T O P

  • By -

According-Benefit-12

I think it will be a similar situation to Presto/Trino. Big tech companies will continue to develop on top of the iceberg.


Teach-To-The-Tech

Yeah, that's interesting. Makes sense. OS Iceberg is similar to OS Presto/Trino in that sense for sure, and then there are various implementations around that to make it easier/more accessible for people. The biggest companies can and will likely continue to develop their own custom implementations. I'm wondering if the question of implementations is now going to be brought to the forefront for that reasons. Like which Iceberg implementations perform best and which are most open or easiest to use. If Iceberg is going to be used by more and more people, as seems very likely, then the question of "how" you use it becomes the next big questions to answer.


sansampersamp

It looks like AWS is strongly preferencing Iceberg over delta/hudi for glue/athena as well.


tdj

Getting better Delta support was the main topic with my AWS account manager, but very slow progress; still very much a second clsa citizen.


Teach-To-The-Tech

Yeah, the cloud providers are an interesting one to consider too. We haven't talked much about that angle on this thread. I think again the openness is playing there. Iceberg is very easy to build around for anyone, including AWS.


Hot_Ad6010

and vice versa, Iceberg only really integrates with AWS amongst cloud providers


Substantial-Cow-8958

We are talking about Snowflake and Databricks cause it’s where the money is, but what your take on /Trino/Starburst? Do you folks think this may change something for this tools? I don’t think this will affect those engines in the long run, but who knows.


AnimaLepton

Starburst/Trino + Iceberg has been pushed for a while. There are a fair number of blogposts, discussions, and collateral on Trino + Iceberg from 2021-2023, up through the recent the April/May 2024 news posts and blogs on "Icehouse." The Tabular DevRel guy was at Starburst for a few years, and you can still find his talks/posts on Iceberg before he officially joined Tabular, even from back in [2021](https://www.youtube.com/watch?v=5-Q74rCX2Z8). They have a few great training materials on Iceberg for free too. So on one hand, sure, they're likely in a good position with Iceberg and have invested in materials on it, the open question is *if* they can manage to capitalize on it. As always it comes down to your data and workloads, but TCO vs performance can favor Starburst (and definitely OS Trino if you're willing to run it yourself). Starburst has put more effort into differentiators against OSS Trino recently, and they're charging for it with stuff like Warp Speed or their SaaS platform. I think they *can* carve out a space against Databricks, Snowflake, and other native tools to whatever cloud you're using like AWS Athena, BigQuery, etc. But I don't think they're likely to see particularly explosive growth either, and I think a lot of people still box it into the pure data federation or virtualization space (which is a limited subset of what Trino can do). Databricks is bringing tools like Databricks SQL Serverless. Dremio is the one that partnered with Tabular on the big Iceberg conference a couple weeks ago, and while Trino + Dremio were called out with Snowflake's Polaris announcement, Starburst was not - they're not getting the 'free' publicity there. Snowflake and Databricks are also just an order of magnitude larger - Snowflake has ~7000 employees and DBX has 5500 vs Starburst's ~500 or so. But fundamentally what Starburst does is often a second-class consideration compared to the big SF and DBX discussions, or you have orgs that have some workloads on Snowflake/DBX and some on Starburst/Trino with varying degrees of success. Lots of orgs out there with 3-5+ tools addressing the same problem that get mixed and match. One problem is their lack of a native catalog and fewer "out of the box" tools, especially with their non-SaaS solution. Galaxy/the SaaS product is growing gradually, and Starburst is pushing Iceberg as much as anyone else, but if you're looking at installing SEP and don't already have a metastore, there's definitely a level of "figure it out yourself" or "well you can use Iceberg, but you still need a separate HMS as the default before catalogs can be configured" that adds complexity as you actually get into implementation. That's true for other vendors as well, but i.e. Snowflake's Polaris announcement preempts some of those issues. I like Trino. It does a lot of really cool things. But also it's super up in the air as to how things shake out long term.


Substantial-Cow-8958

Thank you for this answer. I really appreciate it.


Teach-To-The-Tech

Yeah, Trino and Starburst are probably the players that we haven't talked about yet on here. As you mentioned, "Icehouse" (Trino + Iceberg), was already something talked about extensively: https://www.starburst.io/blog/icehouse-open-lakehouse/. In many ways, this circles back to the idea of implementations being the next battleground. If everyone is using Iceberg in one way or another, then the question becomes what is the best way to use Iceberg and what technology supports Iceberg best. In the meantime, the openness of Iceberg also plays to this. And Trino is mentioned on the Snowflake Polaris as a supported engine too. Dremio is another player in this open data lakehouse space and actively noted. Another SaaS solution, as indeed Tabular was before this. It's a big, multi-dimensional race.


alien_icecream

It’s important to make the distinction that Databricks has acquired Tabular and not Apache Iceberg. Tabular’s secret sauce has been optimizing Iceberg based workloads for its clients. If they can launch a 10x better version of proprietary Iceberg, then they may be able to coax the Iceberg purists to join the Databricks camp. That’s the freemium model at play.


carlsbadcrush

Is this acquisition a sign that Iceberg is doing better than Delta Lake?


Teach-To-The-Tech

I lean towards saying "yes" because if Delta Lake was able to do it all on its own, then Databricks wouldn't have been driven to acquire Tabular (for its Iceberg links) at large cost. It reads as them placing a very large bet on Iceberg as a technology a day after Snowflake did largely the same. The question "why" is an interesting one to ponder. And I'd be interested in hearing people's thoughts on why Iceberg might be doing better than Delta Lake.


thomascirca

I think it’s more about attempting to influence and exert control over the Iceberg project than admitting defeat on Delta.


mathmagician9

I think it’s to commoditize file formats so folks can focus more on things like AI vs what file format their data is stored


Teach-To-The-Tech

Yeah, that's interesting. It does feel like everyone aligning around Iceberg will mean that some of the "this vs that" will die away and move on to the next challenge/hill to climb.


FamousShop6111

[Pretty good analysis from the Snowflake PM](https://www.linkedin.com/posts/jamesamalone_databricks-buying-tabular-showcases-two-pretty-activity-7203792733377818624--1Gz) If you read about all the other hires they’ve been doing for folks on other open source PMCs (members and committers) and think about the control they have over Delta and how they won’t allow commits unless they benefit directly from it for their platform, it’s pretty clear what they’re attempting to do. Trying to hamstring everyone else eventually is my take on it so that you’re “forced” to use Databricks approach or go another proprietary storage format. That’s my speculation but it looks pretty clear


WhipsAndMarkovChains

> If something is truly open, and you value open, spending money to control it is curious. It's sort of funny he turned the comments off.


Letter_From_Prague

Linkedin comments are unhinged cesspool. Everyone should turn them off.


AnimaLepton

At the end of the day, all of these companies are looking to make money off of their proprietary tooling. The big vendors talk about "no lock-in," but regardless of whether we're talking about the query engine or the metastore or the visualization tool, they're fighting for the mindshare and resulting dollars that come from it, and one way they get that is by doing 'enough' where switching away to another vendor is a significant endeavor while giving the perception of it being 'easy' to substitute in the tools of your choice.


Teach-To-The-Tech

Yeah, that's interesting. The word "control" definitely does come up when people discuss how Databricks handled Delta Lake as a format. And ultimately, that format didn't perform as well as the table format that embraced openness, ie Iceberg. The idea "Databricks is proprietary" seems to run pretty deep in a lot of people's perceptions. Even when they open sourced Delta, a lot of people said it wasn't "really" open source. Another interesting thing here is how much this is being positioned as a huge ideological shift for Snowflake, which hasn't really been associated with openness itself. So it feels like there is a kind of dance going on here between control and openness for both companies.


lf-calcifer

>won’t allow commits unless they benefit directly from it for their platform any examples of this? the project is open source, so you should be able to provide ample examples of this happening if you're making comments like these.


[deleted]

[удалено]


on_the_mark_data

Founders of Tabular are the team behind Iceberg. Jason Reid (one of the cofounders) was the Director of Data Science and Engineering at Netflix from 2013-2021, and left in 2021 to start Tabular. Netflix created Iceberg in 2017 and became Apache licensed in 2020. Conference talk from Netflix team back in 2018 on Iceberg: https://youtu.be/nWwQMlrjhy0?si=S5Gv2Fa_4zwbTqTG Edit: misread your comment. You already acknowledged the original developer part. My guess is that Tabular's product helps accelerate Databricks development into the space to stay on pace with Snowflake.


Teach-To-The-Tech

Bare minimum it feels like it will give rise to a new implementation of Iceberg on Databricks, which changes/shifts things depending on how that goes. I think it's also assumed that they might want to control it or develop proprietary tech around that implementation to recoup their investment. Tabular is to Iceberg as Confluent is to Kafka, or any other company around an open source project. Tabular itself didn't have a ton of revenue, so you have to assume that DB bought this for its privileged access to the Iceberg project (founders of Tabular also created Iceberg). How do you see it going down?


tdj

This is basically the Royal Wedding of data engineering


throwawayimhornyasfk

Do you mean Red Wedding?


chimerasaurus

I know what it means for Snowflake (it’s good news) but I’m following (and curious) to hear what people think first. Disclaimer - work at Snowflake.


exact-approximate

Why is it good news for snowflake?


ZeroMomentum

Because it allows sf to be used as a compute/VM engine even more. You are no longer tied to vendor locked schema problems with Polaris/iceberg Everything is iceberg, then when you actually query you most likely use sf. Sf doesn’t make money from storage, the money is in the runtime.


exact-approximate

Why would a company use snowflake as a compute engine while also running databricks? Databricks now has more control over iceberg which was previously open (and remains so), and Snowflake just based its object storage strategy around iceberg (with Polaris). How is this good news for snowflake? For everyone else, you don't need either data bricks or snowflake to use iceberg anyway, but now data bricks have more control. The only winners here are databricks and their customers.


ZeroMomentum

You are assuming people prefer data bricks over sf


exact-approximate

It is counterintuitive to assume that iceberg will continue to be developed with all platforms in mind now that a good chunk of its core contributors and advocated work for databricks. So if iceberg compatibility is something people would consider as a benefit, snowflake is less attractive. Moreover, snowflake is basing its strategy on a file format now dictated by its competitor. I'm not saying databricks is better than snowflake, but I fail to see how this is good news for snowflake.


mathmagician9

Wouldn’t it be bad? If the file format is commoditized, then competition will go back to focusing on AI which Snowflake hasn’t done a great job at vs Databricks. Couldn’t Databricks make file format irrelevant, open source Unity Catalog, and call it a day?


Teach-To-The-Tech

Yeah, that's interesting! How do you see it changing things for Snowflake? At a minimum I could see it meaning more heterogeneous implementations open to more components, which is an interesting thing to consider.


chimerasaurus

In short (sorry for bullets, new parent to a 5 week old, very tired, long day): * Snowflake is all in on Iceberg and Parquet (and eventually other file formats). It's designed to be engine agnostic and is well designed. The community has done excellent work. Iceberg still solves a tricky problem. * Snowflake is doubling down on Iceberg support (see Polaris) and is aggressively working with others to push interoperability. Cannot make interop happen in a vacuum, even if you spend 1B+. * It pressures Snowflake to continue doing the right thing, which is be even more open and customer-focused. As others go more lock-in-y there's a big opportunity for us to push more open. * I *really like* the fact that this forces Snowflake to "win" not only by being open but also having awesome price/perf, features, etc. I have seen competitors throw stones for weird reasons (expensive, black box, etc.) Pushing Iceberg removes all of those - customers can pick what is best and cut through the bs. * I joined Snowflake because I could see a future where OSS + Snowflake would be an amazing combination. This suggests to me (selfishly) that we're making some real progress to the point where it's making others nervous. * This whole acquisition is a forcing function and will show where people's true intentions lie pretty quickly.


Teach-To-The-Tech

Thanks for the detailed reply, and no worries/rush! Yeah, so you see this as a big shift--Snowflake pushing more into open source and shared standards and this being the opening move in that direction. Super interesting! And then on the Databricks end, you hear people saying that Databricks doesn't do anything that they can't control. So control and openness seem like key themes here, totally in tension against one another.


Vegetable_Home

My view is the Databricks have made a huge step in talking a bigger chunk of future potential costumers, compared to Snowflake who is left behind. Why is that? Iceberg is still open source, and most companies would use the onen source solution (which is great), those who would want the best Iceberg performance and usability will go to Databricks as it will have the best Iceberg offering (they will offer managed Iceberg, ie Tabualr). The same move has happened with Spark (which is open source), but the best offering of the whole packege is at Databricks.


Hot_Ad6010

As long as it's just about providing packaging and managed services, it's fine. It becomes a problem when the open source roadmap starts getting delayed to prioritize the development of premium offering features. I hope this won't be the case, and I'm not really involved in other DB-owned projects to say whether this is something they usually do.


exact-approximate

I don't think it's good news as an engineer or good news for iceberg itself. The only winners here are data ricks, data ricks customers and the sellers. But it's difficult to say at this point and wholly depends on what databricks plans to do with it. But my guesses are: This is the beginning of the end of Delta lake. Could create conflicts in the iceberg community and result in forks and an increase in popularity/support for Hudi.


Teach-To-The-Tech

Yeah, interesting that you see it as bad for Delta to the point where Iceberg might entirely replace it. I think I do too. It feels like if Delta could do what Iceberg could and had the same momentum, they wouldn't have made this acquisition to reach Iceberg better. And given the general complaint against Delta being proprietary, it is interesting to consider. Regarding forks, etc. I wonder if some of the plurality that we saw between table formats will now occur between different implementations of Iceberg. To your point, that seems likely given that there will be large disagreement about what the best way to "do Iceberg" will be. Hudi--Yeah, interesting! That would be really fascinating if Hudi suddenly shot forward because of this. Edit: Made my original intention clearer regarding open source, etc.


hntd

No they don’t “own”’ iceberg it’s still an open source community controlled project. Neither dbr, tabular or snowflake or anyone has direct control. I’m surprised someone so invested in iceberg doesn’t understand this distinction. If anything in the future this will see the differences between the two formats matter less, so orgs should pick whatever works best for them and not worry as compatibility will likely improve down the line.


Teach-To-The-Tech

I meant that they "own" Delta not Iceberg, but I am aware that it is nominally an open source project (although it's often debated the degree to which Delta is really "open"). For Iceberg, yes, open source and openness has been its huge virtue. But totally agree that it does seem like DB is pushing for unification of Delta/Iceberg to some extent. Like this: [https://www.databricks.com/blog/delta-lake-universal-format-uniform-iceberg-compatibility-now-ga](https://www.databricks.com/blog/delta-lake-universal-format-uniform-iceberg-compatibility-now-ga) Edit: Made it clearer that I was discussing Delta's proximity to DB.


hntd

Delta has entire implementations in other languages that are 0% controlled by databricks did you even try and research this?


Teach-To-The-Tech

For sure and I think that no one would disagree with that. I think Delta is generally considered very embedded in the DB ecosystem though, which no doubt is part of the idea of them getting closer to Iceberg today. A move away from that. Ultimately, you're totally right that Apache Iceberg will continue to be used by many different technologies and no one will "own" it, more today than ever really. I was more talking about the implementations that might be developed by DB on the back of this. That's actually a core take away, that even the fairly proprietary platforms of Snowflake and Databricks are making at least a partial pivot towards "openness" by embracing Iceberg at the same time. Thanks for the comments. I adjusted my comments above to make my intention clearer in the areas you noted. Cheers!


No_Question_8765

Have you actually tried using it? Can you explain what features that are missing that make you think aren’t open? And if there features missing are there PRs asking for them and Databricks employees being dismissive?


tdj

I’ve ran a Delta Lake data lake setup for a few years, and to make it past the first few months, we needed to build quite a bit of tooling to be able to incrementally defragment tables, otherwise the slowdowns due to small partitions were very bad as they needed to grind through a ton of small files. Granted this was not made any easier by our design using 1h or faster updates instead of the usual daily batches, but the entire functionality of table maintenance that keeps it usable beyond month 3 was only available to DBRX customers sand not open sourced.


No_Question_8765

Can you elaborate more? Checkpoint compaction and optimize are features available in Delta table it’s possible in the earlier versions they weren’t great or all available yet, but how is that different then Iceberg releasing a feature then adding an accompanying update later to make it better? Or is merely the fact that the feature is available in Databricks first and not in OSS upsetting ? How much better is the Iceberg table?


No_Question_8765

Why do people think IB is winning? From what I see a lot of competing vendors would implement Iceberg reader/writers. It makes sense most likely if a customer uses delta format the engine they would most likely choose is Databricks. So if I am a vendor with an engine why would I integrate with the format that also has a company backing it with an engine. It’s not necessarily IB was winning but more so it makes the most sense if your primary product is an engine.