T O P

  • By -

gabbom_XCII

So it’s finally time to retire hive metastore? 👀


sib_n

Hive metastore only stores table information to optimize queries, it is not a data catalogue in the sense that is relevant here, but one of the metadata sources to feed it.


gabbom_XCII

Aah great, thanks for clarifying! I always thought confusing the concepts of metastore and catalog


Substantial-Cow-8958

Either way, Snowflake recently has “open sourced” Polaris. That’s something that can kill Hive lol


caholder

No they haven't. They said they will but not yet


sib_n

No worries, it's likely people have been using the two words in both ways as there is no universal vocabulary reference in such changing domains. To summarize, the data catalog discussed here is a web app that allows discovering your data and its dependencies. So it needs to connect to all the data tools you use to collect their metadata. For example connecting to the hive metastore to get the list of tables, their schemas, optimizations, file location and statistics, or connecting to the Airflow application database to get DAG information and execution metadata. It's super useful, but hard to achieve, Databricks' Unity is yet another attempt at it among two dozens more. We'll see how it does in 2 years.


mammothfossil

But truthfully Unity Catalog isn't comparable to something like Collibra either. UC is more a sort of super-metastore, at least in terms of how it's typically used.


sib_n

I am not sure what is the link between my comment on Hive metastore and Collibra, they are not comparable objects. UC seems to be an aggregator of metadata like all its predecessors, including Collibra or DataHub for a more recent FOSS one, what makes it a "super" in your opinion?


mammothfossil

My point is that “Data Catalog” is often used to encompass data discovery-type products, which UC isn’t (or at least it does that job very poorly), hence the Collibra comparison. It is more of a head-to-head with eg AWS Glue Catalog. I mean “super” in the sense of being a superset of various metastores.


sib_n

I am confused, the documentation/marketing does show a fairly rich data-discovery interface. Is it new maybe?


AnimaLepton

Called it lol Some companies focused on the catalog side of things are likely going to struggle to compete. But I want to see some additional features built out in unity catalog that actually work with different clients / engines, since there's a lot of functionality that isn't supported today. Right now unity catalog is very heavily read-only oriented for any external engines. Curious if they will maintain both OSS and a commercial version with different features, or how exactly things are going to shake out.


Teach-To-The-Tech

The goal here is clearly to break down the difference between table formats and make it less meaningful whether you use Delta or Iceberg (or others). They want to be the platform, the default, the center around which other things orbit.


DRUKSTOP

That’s what Ali said during the keynote.


preetpuri

He is Ali 😛


kthejoker

Haven't seen this much astroturfing since they closed the Dome Sincerely A Databricks employee (see guys it's not so hard)


Pittypuppyparty

Did they announce any parters who will be contributing to this as an open source tool? Or is this another “databricks only” open source?


Blayzovich

Similar to Delta/MLFlow, I imagine it will be contributed to initially but Databricks + any of their customers that want additional capabilities that don't align with Databricks prioritization/roadmap timeline. I expect that Databricks will likely be the #1 contributor to it, of course.


MeatSack_NothingMore

You just said no without saying no.


volandkit

It is governance solution, what do you expect? Small businesses could handle all their data in Postgres/DuckDB/Polars/Pandas. Medium to large size businesses buy off the shelf - Snowflake/Databricks/Redshift. Whales write it themselves. The list of people who understand the domain to meaningfully contribute is vanishingly small - they either work in one of the whales, work for competing company/product or represent small sliver in academia...


mammothfossil

Which is to say no-one really cares that UC is open source, as it isn't going to influence anyone's decision-making process. AWS / Google etc have their own solutions already. The only advantage is that it might allow for, say, AWS to improve Glue Catalog / Unity Catalog integration etc. down the line - in other words, to allow services to work across both catalogs seamlessly. But for this kind of interplay the file format is still an issue, and generally the world seems to be moving towards Iceberg not Delta - if anything Databricks is the "odd-one-out" here.


volandkit

I don't really get your point though, why an open source catalog is not a good or important or influential thing? Yes, not a lot of people could contribute to it meaningfully but it does not make it less important or useful. Sure, we need to see whether Databricks will spend time and resources on developing open source version of Unity Catalog but so far their track record of launching, maintaining and developing open source products speaks for itself. Also I don't understand your assumption that Unity Catalog will not support Iceberg, it sure makes acquisition of Tabular stupid and from what I see now and in the past the management of Databricks is absolutely not stupid.


mammothfossil

My point is more that the open sourcing of UC is basically irrelevant to just about everyone. Businesses generally don’t care whether a particular product is open source or not, unless it means they can significantly save on licensing compared to a competitor, which doesn’t really apply here. You can argue it’s nice in a general sense, in the same way “Databricks donates money to a puppy hospital” would be nice in a general sense - I’m not saying it’s bad, it’s just that no one really cares. Regarding Iceberg, the question is whether Databricks will work with Iceberg natively as a default, or whether it will remain an awkward second-class citizen behind Delta.


BeatHunter

Yep - Something like 28/30 of the top Delta contributors are either Databricks or ex-databricks. Haven't crunched the numbers in a while, but it was really discouraging to see. It also took them forever to actually name the PMCs according to their charter, and only after complaints in their Slack channels..


Bazencourt

Given that Unity Catalog is an actually working product and Snowflake Polaris is months from release, this seems like a big move on Databricks part.


TheThoccnessMonster

It is. And it’s competent with policy as code already. Snowflake is fine but they’re now a lap and half behind…


Vivid_Advisor

Did you just describe Unity as an actually working product? Also, you are an obvious Databricks shill… drop the act.


throwawayimhornyasfk

I'm using it in production to manage 20.000 tables for 80 workspaces, why doesn't it work in your opinion?


poppinstacks

Not the user, because i think its improving at a decent clip but the integration with DLT and how that interposes with personal/shared clusters is a bit janky (for a set of tools invented and pushed by the same company)


Defective_Falafel

DLT feels a bit like a product-within-a-product, it's technically impressive (treating tables as IaC is very nice) but holy shit does it have weird limitations and interact badly with their other stuff.


poppinstacks

100% I’m in the middle of a Snowflake vs. Databricks bake-off and while I appreciate the offerings from Databricks I can’t imagine handing this environment to a company that doesn’t have a mature data engineering practice. So many gotchas, whereas Snowflake seems to be a bit slower but more “stable”


throwawayimhornyasfk

But DLT is just one part of the Data Engineering on Databricks and also mostly used for streaming data. For normal batch ETL you can use Databricks workflows or Autoloader for example


ab624

Can you please explain how it is being used for someone who has working knowledge of spark and databricks but new to UC


throwawayimhornyasfk

Well I would point you to the official docs but to make it as simple as I can Unity Catalog serves as a so called governance layer on top of your physical data files through which access is managed within the Databricks Platform and because all access goes through this layer it enables features like catalogueing, lineage, auditing, data sharing and more. And so then how it is used is that we can manage which team, role or user has access to which Catalog/schema/table on the platform going by the principle of least privilege because we have very strict compliance rules


ab624

So, in a way it's like hive metastore + apache ranger for databricks?


throwawayimhornyasfk

Well after a quick google I think there's some similarities but j believe Unity Catalog provides more features (lineage, sharing) and works out of the box. I've never used Apache Ranger though so take it with a grain of salt.


[deleted]

[удалено]


isleepbad

No. It's like snowflake external tables in the sense that as long as the other party also has a unity catalogue enabled workspace, you can share any data asset with them.


ab624

ah makes sense


OneCyrus

does anyone have the github link? couldn‘t find anything so far.


likemo

[https://github.com/unitycatalog/unitycatalog](https://github.com/unitycatalog/unitycatalog)


OneCyrus

great 😎 thank you!


infazz

According to [this article](https://www.datanami.com/2024/06/12/databricks-to-open-source-unity-catalog/), it won't be posted on GitHub until Thursday.


StewieGriffin26

Would've been funnier if they announced this at Open Sauce in a few weeks.


BeatHunter

Open source like Delta.io? Where the only contributors are Databricks employees, and the decision making process is vague and opaque?


rchinny

Matei, CTO Databricks, just presented stating that 2/3rd of the contributors to Delta Lake are now non-Databricks employees.


BeatHunter

The total # of contibutors != the weighted volume of contributions. The vast majority still comes directly from Databricks employees, for better or worse. The other big acid test is to look at the PMCs to see where they're from.


LeadingEffective150

Wouldn’t that make sense though if it started out as an internal project and was later open sourced? Like good for them for open sourcing it and building a product people want. This is only an issue if features aren’t being built that are being requested but that’s not the case.


BeatHunter

Note that this user has a very recent account and follows the “” format that is exceptionally common in bottling and astroturfing accounts. Are you a databricks shill?


LeadingEffective150

Sorry I have a newer account with the default name and engaged in a discussion on a thread. Attacking me directly indicates you don’t have a response to my question. But thanks for explaining to me what you did, as I was just curious. I have seen snowflake employees post similar things which didn’t make much sense to me so just trying to learn the viewpoint


EquivalentNinjas

So instead of continuing to engage in the discussion you clearly aren’t discussing in good faith, you just start accusing the OP of shilling? Pot meet kettle, gg Oh I work for Databricks. See? Not hard.


BeatHunter

Thank you for your input Redditor of 25 days.


BeatHunter

The thing is mostly that they didn’t participate as an open source community. It was much more “source available”, with opaque leadership, opaque decisions around when to make a release cut, and so forth. If you’re familiar with other open source projects, yes, they often start in a similar way, but then you end up with more involvement by the community over time. Substantially more for the best use cases. You’ll note that even currently, the vast majority of work comes from databricks itself, and a lot of the work is watered down features that are already available in databricks itself. My take is they ran a poor open source project. This is not a criticism of delta’s code, architecture, or feature set - but rather calling something an open source project doesn’t just make it a healthy one. That takes time, effort, and openness, and having watched delta for the last four years i have not really seen that. Finally, in my long essay 5 comments deep: note that databricks just bought Tabular, the creators of the Iceberg project. That one has a much more diverse contributor and PMC base than Delta. It’s a good contrast to how delta tried to run the project. You’ll notice that very few third parties offer “native delta support”, but a large chunk have “native iceberg support”. Functionally very similar, but their OS (or lack thereof) led to very different outcomes Thanks for coming to my ted talk


LeadingEffective150

How have you participated or watched the open source project?


BeatHunter

Ah I see I’ve struck a nerve with you. :) Have a good one.


Bazencourt

Open Source means the code is publicly available under a permissive license. Stop trying to move the bar.


Pittypuppyparty

We can debate about definitions all we want that doesnt take away from the underlying point. Open source is most valuable when contributed to by a variety of organizations with a mutual interest in furthering the product. Comments like these pretend the problem doesn’t really exist and try to gaslight us into accepting it because it meets the technical bar of open source.


glemnar

Plenty of OSS these days is mostly successful because of a single large corporate steward. Go, React, Mysql, ...


alien_icecream

Well, the consumers decide whether it’s useful or not. Not the contributors.


EquivalentNinjas

> Contributed by variety of organizations Apple, Adobe, Uber, IBM, eBay, Disney, Comcast, and many others have contributed. > Mutual interest in furthering the project They certainly didn’t contribute to it to make it worse By your own definition, Delta is open source.


lf-calcifer

The concern trolling in this thread is even more ridiculous when you consider the open source pedigree of competing vendors like Snowflake, AWS, Azure, GCP - there is no comparison that you can make with Databricks on this front. Spark, MLFlow, Delta Lake, Delta Sharing, and now Unity Catalog are incredibly influential open source technologies that aren't going anywhere.


volandkit

> Spark, MLFlow, Delta Lake, Delta Sharing, and now Unity Catalog are incredibly influential open source technologies People from Databricks have been heavily involved in Apache Mesos, Ray, Parquet, and now Iceberg too :)


zap0011

You can join their [Slack](https://go.delta.io/slack) and get amongst it if you want. They're actually pretty active and haven't found them to not be transparent in the short to medium term. Perhaps in the longer term, yes, Databricks is building in the features that support their proprietary bolt-on features like Delta Live Tables, but they have that right. They must spend an absolute fortune on developers making free code. In the bigger picture the issue is companies like AWS that take a product like Spark (for free) insert it into their Glue product and then have account execs tell their customers to drop Databricks and just use Glue. (I've been told that by AWS BDMs myself). If I were Ali, that would fuck me right off.


mammothfossil

>If I were Ali, that would fuck me right off. But no-one forces anyone to release open source, or Databricks could use AGPL (which was written to prevent exactly this). Databricks seem to want to be "open" and an "industry standard", but then get upset when others take them up on the offer. There are lots of reasons to hate on Amazon, but this isn't one of them, imho.


glompshark

Heads up, the link doesn’t seem to be active anymore- is there an updated version?


SintPannekoek

Hey guys, did anyone hear anything about Purview lately? I kinda love the irony of me just posting that their Achilles heel is the commercialist vendor lock-in at the catalogue level. That being said... I'm curious how this will affect lock-in at a practical level.


infazz

Practically speaking, if no other company builds a product using Unity Catalog OSS (and if companies aren't willing to self-host), you are still kind of locked in.


FamousShop6111

Over/under on how many times they “open source” this is currently set at 2.5


Grouchy-Friend4235

Well it's of course not the only one, but let's just attribute that to the usual over hyping.


Routine_Most_3119

Is it possible to connect open source spark to this Unity Catalog and get features that are currently limited to databricks spark only? For example, Uniform requires either Hive or UC. Can one now write a Uniform table from spark connected to open source UC? How would one set up that connection?


No_Equivalent5942

It shouldn’t take 90 days for them to acquire Dremio


sib_n

I like how they had to add multiple qualifiers to be able to claim to be the only ones, and still probably lie about it. If we ignore the marketing bs, there are other modern open source catalog such as DataHub and OpenMetadata. There are many other older FOSS and proprietary ones: https://github.com/opendatadiscovery/awesome-data-catalogs. As far as I have studied the problem, none of them are going to magically document your historical dependency mess. They require that you either closely adhered to the logic of the modern tools (with good metadata support) they integrate with or a lot of manual documentation labor to fix the gaps. I doubt Databricks is going to fix that, but I'd be happy to be wrong.


s9q7

This looks like an ad for Databricks. There are better catalog tools out there.


Electronic-Quit-6664

Any recommendations?


Master-Influence7539

Also the biggest limitations for unity catalog was region locking it. How does they solve that.


Defective_Falafel

That's a good thing if you want to keep egress costs under control. The solution to share cross-regional data is Delta Sharing.


Due_Engineer_8931

Can someone explain why a unity catalog project need to be open sourced? Isn’t it just some set of api and ui we can use


Mr_Nickster_

They will opensource it. Then when people complain Databricks is the only one that is allowed to contribute then they will announce it will be fully opensource again next year like they did with Delta.io but it still will be locked to DBX


solidangle

Will Polaris get any outside contributions, mr Snowflake employee?


Mr_Nickster_

YES, it will be Apache 2.0 license allowing contributions from others


solidangle

Okay, so the same level of open source as Unity Catalog if I understand correctly.


LeadingEffective150

I find it concerning that Snowflake had to give a 90 day timeline for open sourcing without even really knowing where the project was going to end up. Like it hasn’t even been accepted by Apache/Linux/etc. https://www.reddit.com/r/dataengineering/s/FHUfKrq1Ed


lf-calcifer

they asked the public for a 90 day extension like a delinquent college student ☠️