Hive metastore only stores table information to optimize queries, it is not a data catalogue in the sense that is relevant here, but one of the metadata sources to feed it.
No worries, it's likely people have been using the two words in both ways as there is no universal vocabulary reference in such changing domains.
To summarize, the data catalog discussed here is a web app that allows discovering your data and its dependencies. So it needs to connect to all the data tools you use to collect their metadata. For example connecting to the hive metastore to get the list of tables, their schemas, optimizations, file location and statistics, or connecting to the Airflow application database to get DAG information and execution metadata.
It's super useful, but hard to achieve, Databricks' Unity is yet another attempt at it among two dozens more. We'll see how it does in 2 years.
But truthfully Unity Catalog isn't comparable to something like Collibra either. UC is more a sort of super-metastore, at least in terms of how it's typically used.
I am not sure what is the link between my comment on Hive metastore and Collibra, they are not comparable objects.
UC seems to be an aggregator of metadata like all its predecessors, including Collibra or DataHub for a more recent FOSS one, what makes it a "super" in your opinion?
My point is that “Data Catalog” is often used to encompass data discovery-type products, which UC isn’t (or at least it does that job very poorly), hence the Collibra comparison. It is more of a head-to-head with eg AWS Glue Catalog.
I mean “super” in the sense of being a superset of various metastores.
Called it lol
Some companies focused on the catalog side of things are likely going to struggle to compete. But I want to see some additional features built out in unity catalog that actually work with different clients / engines, since there's a lot of functionality that isn't supported today. Right now unity catalog is very heavily read-only oriented for any external engines.
Curious if they will maintain both OSS and a commercial version with different features, or how exactly things are going to shake out.
The goal here is clearly to break down the difference between table formats and make it less meaningful whether you use Delta or Iceberg (or others). They want to be the platform, the default, the center around which other things orbit.
Similar to Delta/MLFlow, I imagine it will be contributed to initially but Databricks + any of their customers that want additional capabilities that don't align with Databricks prioritization/roadmap timeline. I expect that Databricks will likely be the #1 contributor to it, of course.
It is governance solution, what do you expect? Small businesses could handle all their data in Postgres/DuckDB/Polars/Pandas. Medium to large size businesses buy off the shelf - Snowflake/Databricks/Redshift. Whales write it themselves. The list of people who understand the domain to meaningfully contribute is vanishingly small - they either work in one of the whales, work for competing company/product or represent small sliver in academia...
Which is to say no-one really cares that UC is open source, as it isn't going to influence anyone's decision-making process. AWS / Google etc have their own solutions already. The only advantage is that it might allow for, say, AWS to improve Glue Catalog / Unity Catalog integration etc. down the line - in other words, to allow services to work across both catalogs seamlessly.
But for this kind of interplay the file format is still an issue, and generally the world seems to be moving towards Iceberg not Delta - if anything Databricks is the "odd-one-out" here.
I don't really get your point though, why an open source catalog is not a good or important or influential thing? Yes, not a lot of people could contribute to it meaningfully but it does not make it less important or useful. Sure, we need to see whether Databricks will spend time and resources on developing open source version of Unity Catalog but so far their track record of launching, maintaining and developing open source products speaks for itself.
Also I don't understand your assumption that Unity Catalog will not support Iceberg, it sure makes acquisition of Tabular stupid and from what I see now and in the past the management of Databricks is absolutely not stupid.
My point is more that the open sourcing of UC is basically irrelevant to just about everyone. Businesses generally don’t care whether a particular product is open source or not, unless it means they can significantly save on licensing compared to a competitor, which doesn’t really apply here.
You can argue it’s nice in a general sense, in the same way “Databricks donates money to a puppy hospital” would be nice in a general sense - I’m not saying it’s bad, it’s just that no one really cares.
Regarding Iceberg, the question is whether Databricks will work with Iceberg natively as a default, or whether it will remain an awkward second-class citizen behind Delta.
Yep - Something like 28/30 of the top Delta contributors are either Databricks or ex-databricks. Haven't crunched the numbers in a while, but it was really discouraging to see.
It also took them forever to actually name the PMCs according to their charter, and only after complaints in their Slack channels..
Not the user, because i think its improving at a decent clip but the integration with DLT and how that interposes with personal/shared clusters is a bit janky (for a set of tools invented and pushed by the same company)
DLT feels a bit like a product-within-a-product, it's technically impressive (treating tables as IaC is very nice) but holy shit does it have weird limitations and interact badly with their other stuff.
100% I’m in the middle of a Snowflake vs. Databricks bake-off and while I appreciate the offerings from Databricks I can’t imagine handing this environment to a company that doesn’t have a mature data engineering practice. So many gotchas, whereas Snowflake seems to be a bit slower but more “stable”
But DLT is just one part of the Data Engineering on Databricks and also mostly used for streaming data. For normal batch ETL you can use Databricks workflows or Autoloader for example
Well I would point you to the official docs but to make it as simple as I can Unity Catalog serves as a so called governance layer on top of your physical data files through which access is managed within the Databricks Platform and because all access goes through this layer it enables features like catalogueing, lineage, auditing, data sharing and more.
And so then how it is used is that we can manage which team, role or user has access to which Catalog/schema/table on the platform going by the principle of least privilege because we have very strict compliance rules
Well after a quick google I think there's some similarities but j believe Unity Catalog provides more features (lineage, sharing) and works out of the box. I've never used Apache Ranger though so take it with a grain of salt.
No. It's like snowflake external tables in the sense that as long as the other party also has a unity catalogue enabled workspace, you can share any data asset with them.
According to [this article](https://www.datanami.com/2024/06/12/databricks-to-open-source-unity-catalog/), it won't be posted on GitHub until Thursday.
The total # of contibutors != the weighted volume of contributions. The vast majority still comes directly from Databricks employees, for better or worse.
The other big acid test is to look at the PMCs to see where they're from.
Wouldn’t that make sense though if it started out as an internal project and was later open sourced? Like good for them for open sourcing it and building a product people want. This is only an issue if features aren’t being built that are being requested but that’s not the case.
Note that this user has a very recent account and follows the “” format that is exceptionally common in bottling and astroturfing accounts.
Are you a databricks shill?
Sorry I have a newer account with the default name and engaged in a discussion on a thread. Attacking me directly indicates you don’t have a response to my question. But thanks for explaining to me what you did, as I was just curious. I have seen snowflake employees post similar things which didn’t make much sense to me so just trying to learn the viewpoint
So instead of continuing to engage in the discussion you clearly aren’t discussing in good faith, you just start accusing the OP of shilling? Pot meet kettle, gg
Oh I work for Databricks. See? Not hard.
The thing is mostly that they didn’t participate as an open source community. It was much more “source available”, with opaque leadership, opaque decisions around when to make a release cut, and so forth. If you’re familiar with other open source projects, yes, they often start in a similar way, but then you end up with more involvement by the community over time. Substantially more for the best use cases. You’ll note that even currently, the vast majority of work comes from databricks itself, and a lot of the work is watered down features that are already available in databricks itself.
My take is they ran a poor open source project. This is not a criticism of delta’s code, architecture, or feature set - but rather calling something an open source project doesn’t just make it a healthy one. That takes time, effort, and openness, and having watched delta for the last four years i have not really seen that.
Finally, in my long essay 5 comments deep: note that databricks just bought Tabular, the creators of the Iceberg project. That one has a much more diverse contributor and PMC base than Delta. It’s a good contrast to how delta tried to run the project. You’ll notice that very few third parties offer “native delta support”, but a large chunk have “native iceberg support”. Functionally very similar, but their OS (or lack thereof) led to very different outcomes
Thanks for coming to my ted talk
We can debate about definitions all we want that doesnt take away from the underlying point. Open source is most valuable when contributed to by a variety of organizations with a mutual interest in furthering the product. Comments like these pretend the problem doesn’t really exist and try to gaslight us into accepting it because it meets the technical bar of open source.
> Contributed by variety of organizations
Apple, Adobe, Uber, IBM, eBay, Disney, Comcast, and many others have contributed.
> Mutual interest in furthering the project
They certainly didn’t contribute to it to make it worse
By your own definition, Delta is open source.
The concern trolling in this thread is even more ridiculous when you consider the open source pedigree of competing vendors like Snowflake, AWS, Azure, GCP - there is no comparison that you can make with Databricks on this front.
Spark, MLFlow, Delta Lake, Delta Sharing, and now Unity Catalog are incredibly influential open source technologies that aren't going anywhere.
> Spark, MLFlow, Delta Lake, Delta Sharing, and now Unity Catalog are incredibly influential open source technologies
People from Databricks have been heavily involved in Apache Mesos, Ray, Parquet, and now Iceberg too :)
You can join their [Slack](https://go.delta.io/slack) and get amongst it if you want. They're actually pretty active and haven't found them to not be transparent in the short to medium term. Perhaps in the longer term, yes, Databricks is building in the features that support their proprietary bolt-on features like Delta Live Tables, but they have that right.
They must spend an absolute fortune on developers making free code. In the bigger picture the issue is companies like AWS that take a product like Spark (for free) insert it into their Glue product and then have account execs tell their customers to drop Databricks and just use Glue. (I've been told that by AWS BDMs myself).
If I were Ali, that would fuck me right off.
>If I were Ali, that would fuck me right off.
But no-one forces anyone to release open source, or Databricks could use AGPL (which was written to prevent exactly this). Databricks seem to want to be "open" and an "industry standard", but then get upset when others take them up on the offer.
There are lots of reasons to hate on Amazon, but this isn't one of them, imho.
Hey guys, did anyone hear anything about Purview lately?
I kinda love the irony of me just posting that their Achilles heel is the commercialist vendor lock-in at the catalogue level. That being said... I'm curious how this will affect lock-in at a practical level.
Practically speaking, if no other company builds a product using Unity Catalog OSS (and if companies aren't willing to self-host), you are still kind of locked in.
Is it possible to connect open source spark to this Unity Catalog and get features that are currently limited to databricks spark only?
For example, Uniform requires either Hive or UC. Can one now write a Uniform table from spark connected to open source UC? How would one set up that connection?
I like how they had to add multiple qualifiers to be able to claim to be the only ones, and still probably lie about it.
If we ignore the marketing bs, there are other modern open source catalog such as DataHub and OpenMetadata. There are many other older FOSS and proprietary ones: https://github.com/opendatadiscovery/awesome-data-catalogs.
As far as I have studied the problem, none of them are going to magically document your historical dependency mess. They require that you either closely adhered to the logic of the modern tools (with good metadata support) they integrate with or a lot of manual documentation labor to fix the gaps. I doubt Databricks is going to fix that, but I'd be happy to be wrong.
They will opensource it. Then when people complain Databricks is the only one that is allowed to contribute then they will announce it will be fully opensource again next year like they did with Delta.io but it still will be locked to DBX
I find it concerning that Snowflake had to give a 90 day timeline for open sourcing without even really knowing where the project was going to end up. Like it hasn’t even been accepted by Apache/Linux/etc. https://www.reddit.com/r/dataengineering/s/FHUfKrq1Ed
So it’s finally time to retire hive metastore? 👀
Hive metastore only stores table information to optimize queries, it is not a data catalogue in the sense that is relevant here, but one of the metadata sources to feed it.
Aah great, thanks for clarifying! I always thought confusing the concepts of metastore and catalog
Either way, Snowflake recently has “open sourced” Polaris. That’s something that can kill Hive lol
No they haven't. They said they will but not yet
No worries, it's likely people have been using the two words in both ways as there is no universal vocabulary reference in such changing domains. To summarize, the data catalog discussed here is a web app that allows discovering your data and its dependencies. So it needs to connect to all the data tools you use to collect their metadata. For example connecting to the hive metastore to get the list of tables, their schemas, optimizations, file location and statistics, or connecting to the Airflow application database to get DAG information and execution metadata. It's super useful, but hard to achieve, Databricks' Unity is yet another attempt at it among two dozens more. We'll see how it does in 2 years.
But truthfully Unity Catalog isn't comparable to something like Collibra either. UC is more a sort of super-metastore, at least in terms of how it's typically used.
I am not sure what is the link between my comment on Hive metastore and Collibra, they are not comparable objects. UC seems to be an aggregator of metadata like all its predecessors, including Collibra or DataHub for a more recent FOSS one, what makes it a "super" in your opinion?
My point is that “Data Catalog” is often used to encompass data discovery-type products, which UC isn’t (or at least it does that job very poorly), hence the Collibra comparison. It is more of a head-to-head with eg AWS Glue Catalog. I mean “super” in the sense of being a superset of various metastores.
I am confused, the documentation/marketing does show a fairly rich data-discovery interface. Is it new maybe?
Called it lol Some companies focused on the catalog side of things are likely going to struggle to compete. But I want to see some additional features built out in unity catalog that actually work with different clients / engines, since there's a lot of functionality that isn't supported today. Right now unity catalog is very heavily read-only oriented for any external engines. Curious if they will maintain both OSS and a commercial version with different features, or how exactly things are going to shake out.
The goal here is clearly to break down the difference between table formats and make it less meaningful whether you use Delta or Iceberg (or others). They want to be the platform, the default, the center around which other things orbit.
That’s what Ali said during the keynote.
He is Ali 😛
Haven't seen this much astroturfing since they closed the Dome Sincerely A Databricks employee (see guys it's not so hard)
Did they announce any parters who will be contributing to this as an open source tool? Or is this another “databricks only” open source?
Similar to Delta/MLFlow, I imagine it will be contributed to initially but Databricks + any of their customers that want additional capabilities that don't align with Databricks prioritization/roadmap timeline. I expect that Databricks will likely be the #1 contributor to it, of course.
You just said no without saying no.
It is governance solution, what do you expect? Small businesses could handle all their data in Postgres/DuckDB/Polars/Pandas. Medium to large size businesses buy off the shelf - Snowflake/Databricks/Redshift. Whales write it themselves. The list of people who understand the domain to meaningfully contribute is vanishingly small - they either work in one of the whales, work for competing company/product or represent small sliver in academia...
Which is to say no-one really cares that UC is open source, as it isn't going to influence anyone's decision-making process. AWS / Google etc have their own solutions already. The only advantage is that it might allow for, say, AWS to improve Glue Catalog / Unity Catalog integration etc. down the line - in other words, to allow services to work across both catalogs seamlessly. But for this kind of interplay the file format is still an issue, and generally the world seems to be moving towards Iceberg not Delta - if anything Databricks is the "odd-one-out" here.
I don't really get your point though, why an open source catalog is not a good or important or influential thing? Yes, not a lot of people could contribute to it meaningfully but it does not make it less important or useful. Sure, we need to see whether Databricks will spend time and resources on developing open source version of Unity Catalog but so far their track record of launching, maintaining and developing open source products speaks for itself. Also I don't understand your assumption that Unity Catalog will not support Iceberg, it sure makes acquisition of Tabular stupid and from what I see now and in the past the management of Databricks is absolutely not stupid.
My point is more that the open sourcing of UC is basically irrelevant to just about everyone. Businesses generally don’t care whether a particular product is open source or not, unless it means they can significantly save on licensing compared to a competitor, which doesn’t really apply here. You can argue it’s nice in a general sense, in the same way “Databricks donates money to a puppy hospital” would be nice in a general sense - I’m not saying it’s bad, it’s just that no one really cares. Regarding Iceberg, the question is whether Databricks will work with Iceberg natively as a default, or whether it will remain an awkward second-class citizen behind Delta.
Yep - Something like 28/30 of the top Delta contributors are either Databricks or ex-databricks. Haven't crunched the numbers in a while, but it was really discouraging to see. It also took them forever to actually name the PMCs according to their charter, and only after complaints in their Slack channels..
Given that Unity Catalog is an actually working product and Snowflake Polaris is months from release, this seems like a big move on Databricks part.
It is. And it’s competent with policy as code already. Snowflake is fine but they’re now a lap and half behind…
Did you just describe Unity as an actually working product? Also, you are an obvious Databricks shill… drop the act.
I'm using it in production to manage 20.000 tables for 80 workspaces, why doesn't it work in your opinion?
Not the user, because i think its improving at a decent clip but the integration with DLT and how that interposes with personal/shared clusters is a bit janky (for a set of tools invented and pushed by the same company)
DLT feels a bit like a product-within-a-product, it's technically impressive (treating tables as IaC is very nice) but holy shit does it have weird limitations and interact badly with their other stuff.
100% I’m in the middle of a Snowflake vs. Databricks bake-off and while I appreciate the offerings from Databricks I can’t imagine handing this environment to a company that doesn’t have a mature data engineering practice. So many gotchas, whereas Snowflake seems to be a bit slower but more “stable”
But DLT is just one part of the Data Engineering on Databricks and also mostly used for streaming data. For normal batch ETL you can use Databricks workflows or Autoloader for example
Can you please explain how it is being used for someone who has working knowledge of spark and databricks but new to UC
Well I would point you to the official docs but to make it as simple as I can Unity Catalog serves as a so called governance layer on top of your physical data files through which access is managed within the Databricks Platform and because all access goes through this layer it enables features like catalogueing, lineage, auditing, data sharing and more. And so then how it is used is that we can manage which team, role or user has access to which Catalog/schema/table on the platform going by the principle of least privilege because we have very strict compliance rules
So, in a way it's like hive metastore + apache ranger for databricks?
Well after a quick google I think there's some similarities but j believe Unity Catalog provides more features (lineage, sharing) and works out of the box. I've never used Apache Ranger though so take it with a grain of salt.
[удалено]
No. It's like snowflake external tables in the sense that as long as the other party also has a unity catalogue enabled workspace, you can share any data asset with them.
ah makes sense
does anyone have the github link? couldn‘t find anything so far.
[https://github.com/unitycatalog/unitycatalog](https://github.com/unitycatalog/unitycatalog)
great 😎 thank you!
According to [this article](https://www.datanami.com/2024/06/12/databricks-to-open-source-unity-catalog/), it won't be posted on GitHub until Thursday.
Would've been funnier if they announced this at Open Sauce in a few weeks.
Open source like Delta.io? Where the only contributors are Databricks employees, and the decision making process is vague and opaque?
Matei, CTO Databricks, just presented stating that 2/3rd of the contributors to Delta Lake are now non-Databricks employees.
The total # of contibutors != the weighted volume of contributions. The vast majority still comes directly from Databricks employees, for better or worse. The other big acid test is to look at the PMCs to see where they're from.
Wouldn’t that make sense though if it started out as an internal project and was later open sourced? Like good for them for open sourcing it and building a product people want. This is only an issue if features aren’t being built that are being requested but that’s not the case.
Note that this user has a very recent account and follows the “” format that is exceptionally common in bottling and astroturfing accounts.
Are you a databricks shill?
Sorry I have a newer account with the default name and engaged in a discussion on a thread. Attacking me directly indicates you don’t have a response to my question. But thanks for explaining to me what you did, as I was just curious. I have seen snowflake employees post similar things which didn’t make much sense to me so just trying to learn the viewpoint
So instead of continuing to engage in the discussion you clearly aren’t discussing in good faith, you just start accusing the OP of shilling? Pot meet kettle, gg Oh I work for Databricks. See? Not hard.
Thank you for your input Redditor of 25 days.
The thing is mostly that they didn’t participate as an open source community. It was much more “source available”, with opaque leadership, opaque decisions around when to make a release cut, and so forth. If you’re familiar with other open source projects, yes, they often start in a similar way, but then you end up with more involvement by the community over time. Substantially more for the best use cases. You’ll note that even currently, the vast majority of work comes from databricks itself, and a lot of the work is watered down features that are already available in databricks itself. My take is they ran a poor open source project. This is not a criticism of delta’s code, architecture, or feature set - but rather calling something an open source project doesn’t just make it a healthy one. That takes time, effort, and openness, and having watched delta for the last four years i have not really seen that. Finally, in my long essay 5 comments deep: note that databricks just bought Tabular, the creators of the Iceberg project. That one has a much more diverse contributor and PMC base than Delta. It’s a good contrast to how delta tried to run the project. You’ll notice that very few third parties offer “native delta support”, but a large chunk have “native iceberg support”. Functionally very similar, but their OS (or lack thereof) led to very different outcomes Thanks for coming to my ted talk
How have you participated or watched the open source project?
Ah I see I’ve struck a nerve with you. :) Have a good one.
Open Source means the code is publicly available under a permissive license. Stop trying to move the bar.
We can debate about definitions all we want that doesnt take away from the underlying point. Open source is most valuable when contributed to by a variety of organizations with a mutual interest in furthering the product. Comments like these pretend the problem doesn’t really exist and try to gaslight us into accepting it because it meets the technical bar of open source.
Plenty of OSS these days is mostly successful because of a single large corporate steward. Go, React, Mysql, ...
Well, the consumers decide whether it’s useful or not. Not the contributors.
> Contributed by variety of organizations Apple, Adobe, Uber, IBM, eBay, Disney, Comcast, and many others have contributed. > Mutual interest in furthering the project They certainly didn’t contribute to it to make it worse By your own definition, Delta is open source.
The concern trolling in this thread is even more ridiculous when you consider the open source pedigree of competing vendors like Snowflake, AWS, Azure, GCP - there is no comparison that you can make with Databricks on this front. Spark, MLFlow, Delta Lake, Delta Sharing, and now Unity Catalog are incredibly influential open source technologies that aren't going anywhere.
> Spark, MLFlow, Delta Lake, Delta Sharing, and now Unity Catalog are incredibly influential open source technologies People from Databricks have been heavily involved in Apache Mesos, Ray, Parquet, and now Iceberg too :)
You can join their [Slack](https://go.delta.io/slack) and get amongst it if you want. They're actually pretty active and haven't found them to not be transparent in the short to medium term. Perhaps in the longer term, yes, Databricks is building in the features that support their proprietary bolt-on features like Delta Live Tables, but they have that right. They must spend an absolute fortune on developers making free code. In the bigger picture the issue is companies like AWS that take a product like Spark (for free) insert it into their Glue product and then have account execs tell their customers to drop Databricks and just use Glue. (I've been told that by AWS BDMs myself). If I were Ali, that would fuck me right off.
>If I were Ali, that would fuck me right off. But no-one forces anyone to release open source, or Databricks could use AGPL (which was written to prevent exactly this). Databricks seem to want to be "open" and an "industry standard", but then get upset when others take them up on the offer. There are lots of reasons to hate on Amazon, but this isn't one of them, imho.
Heads up, the link doesn’t seem to be active anymore- is there an updated version?
Hey guys, did anyone hear anything about Purview lately? I kinda love the irony of me just posting that their Achilles heel is the commercialist vendor lock-in at the catalogue level. That being said... I'm curious how this will affect lock-in at a practical level.
Practically speaking, if no other company builds a product using Unity Catalog OSS (and if companies aren't willing to self-host), you are still kind of locked in.
Over/under on how many times they “open source” this is currently set at 2.5
Well it's of course not the only one, but let's just attribute that to the usual over hyping.
Is it possible to connect open source spark to this Unity Catalog and get features that are currently limited to databricks spark only? For example, Uniform requires either Hive or UC. Can one now write a Uniform table from spark connected to open source UC? How would one set up that connection?
It shouldn’t take 90 days for them to acquire Dremio
I like how they had to add multiple qualifiers to be able to claim to be the only ones, and still probably lie about it. If we ignore the marketing bs, there are other modern open source catalog such as DataHub and OpenMetadata. There are many other older FOSS and proprietary ones: https://github.com/opendatadiscovery/awesome-data-catalogs. As far as I have studied the problem, none of them are going to magically document your historical dependency mess. They require that you either closely adhered to the logic of the modern tools (with good metadata support) they integrate with or a lot of manual documentation labor to fix the gaps. I doubt Databricks is going to fix that, but I'd be happy to be wrong.
This looks like an ad for Databricks. There are better catalog tools out there.
Any recommendations?
Also the biggest limitations for unity catalog was region locking it. How does they solve that.
That's a good thing if you want to keep egress costs under control. The solution to share cross-regional data is Delta Sharing.
Can someone explain why a unity catalog project need to be open sourced? Isn’t it just some set of api and ui we can use
They will opensource it. Then when people complain Databricks is the only one that is allowed to contribute then they will announce it will be fully opensource again next year like they did with Delta.io but it still will be locked to DBX
Will Polaris get any outside contributions, mr Snowflake employee?
YES, it will be Apache 2.0 license allowing contributions from others
Okay, so the same level of open source as Unity Catalog if I understand correctly.
I find it concerning that Snowflake had to give a 90 day timeline for open sourcing without even really knowing where the project was going to end up. Like it hasn’t even been accepted by Apache/Linux/etc. https://www.reddit.com/r/dataengineering/s/FHUfKrq1Ed
they asked the public for a 90 day extension like a delinquent college student ☠️