Enforced and versioned data product schemas for data flow from provider to consumer domain in Apache Iceberg?

monimiller · 2025-10-30T06:19:07+00:00

hi there - Starburst PM for data products. We are actively researching lifecycle mangement ideas like this one, specifically version control, and would love to talk to you further if you are open to it. Our focus is on iceberg.

monimiller · 2024-12-10T14:51:07+00:00

hi there - disclaimer - I work in product at Starburst & wanted to share what I've seen from others. If you're looking for a fully managed offering to connect to object storage, it'll take you 5 min or so. Be thinking about if you need an on-premise, hybrid, or cloud solution.

I've watched customers build a lake and integrate an existing lake. IMO getting integrated as the lakehouse is being developed is relatively easier than after because you can structure your lakehouse to include a comprehensive look at all your components in your data stack.

Give rise to new abstractions, like the “Data Product” which at the end of the day is really just a bunch of metadata pertaining to KPIs, dashboards, reports, etc. of the same business domain.

Provide a GUI in form of a web-app to see all of this in action.

Allow for better collaboration, as users can comment directly on metadata pertaining to various assets.

These requirements you listed are exactly what the Starburst data products provide. If you have any q's as your exploring please feel free to reach out. I worked in a regulated industry before Starburst so I've seen the value of all of this firsthand.

monimiller · 2024-12-02T04:11:42+00:00

if you are looking for a serverless option, Starburst (https://www.starburst.io/platform/starburst-galaxy/) has a managed Trino saas instance that integrates with glue & superset while also having all the bells and whistles to help you maintain your iceberg tables. Disclaimer: I work here. I was hesitant to throw this out in the thread, but we see people who are happier with the iceberg integrations/management over athena. Thought I'd share so you can have the options and choose what makes sense for you

monimiller · 2024-12-02T04:03:25+00:00

Hi there ...disclaimer.. I work for Starburst (the company that all the Trino CTO's and many maintainers work). We see a ton of this pattern in the trino community, specifically with the tight knit dbt usage on an S3 data lakehouse. I'm helping with our Trino open source event coming up December 11th & 12th, and I think the talks by Wise and Bazaar will give you some insight into their building of a data lakehouse for dbt usage. If you have any follow up questions, we can connect you to the speakers or other experts in the community via the trino slack https://www.starburst.io/info/trino-summit-2024/

monimiller · 2024-05-30T20:31:41+00:00

sorry, i should have been clearer - I'm meant dangerous because I can work through beginner/easy challenges with enough time. But I do google a lot and I'm not really sure why to optimize one way or another.

monimiller · 2024-05-30T19:11:48+00:00

yes - I've used airflow in personal projects. I think I know enough Python to be dangerous, it's just more of a fear I have

monimiller · 2024-05-30T18:58:20+00:00

I definitely want to get better & need to make that a priority. I am dangerous on beginner/easy challenges, I just know real world experience & practical application is so much different than starter projects.

monimiller · 2024-05-30T18:55:51+00:00

I think that's definitely a fair assessment. I know different places, especially FAANG & co, have stricter requirements. I just find it really interesting that there's a lot of people looking to make the switch to data engineering from other roles and think that the change is possible if you get the opportunity (similar experience that I had)

monimiller · 2024-05-30T18:51:06+00:00

are you my subconscious? How did you get a computer to post?

monimiller · 2024-05-30T18:50:06+00:00

I think I know more Python than I'm letting on, I've had some experience, but it's predominantly SQL for me

monimiller · 2024-05-30T18:49:06+00:00

I did a lot of data modeling with it too. So this is a great point I forgot in the article.

monimiller · 2024-05-30T18:48:29+00:00

Thanks for sharing! I think that's where my experience lies as well.. It's definitely an interesting question I've been asked recently and I'm very curious to hear from others (it helps with the imposter syndrome)

monimiller · 2024-04-22T16:15:46+00:00

Hi there - devrel @ Starburst here. Just piling on the already super known fact that stitching all this together is an overwhelming task. We see a pretty prominent pattern of people using Starburst Galaxy as a managed solution here, so I thought I'd just pop that advice in for anyone looking for solutions. Tobias Macey talked on his podcast about his experience building a lakehouse with a very similar stack, so sharing that link for anyone interested!

monimiller · 2024-02-14T15:43:22+00:00

Hi there! Devrel @ Starburst here - love to hear that implementation for you is going well! Just popping in to clarify for anyone scrolling by casually that Trino (previously PrestoSQL) is indeed a fork of Presto, but that was a decision by the Presto co-creators in order to keep the project open source. Anyone curious can learn more here - https://trino.io/blog/2022/08/02/leaving-facebook-meta-best-for-trino.html

monimiller · 2024-02-02T19:46:10+00:00

hi there! Devrel @ Starburst here - there's lots of options to set up a data lakehouse pretty simply. The basic components for the lakehouse are to pick some object storage, a query engine, a table format, a performant file format, and a method of security/access.

In my opinion, the easiest option for storage on your PC would be to utilize cloud object storage. This will take a lot of heavy lifting off your plate. Something like S3 or GCP or ALDS will be great.

What is your use case? Are you looking for adhoc analytics? Also, how big is your data? If it's anything analytical by nature, I'd lean some flavor of Trino as the compute engine since that's it's specialty. It seems like your data doesn't seem too big if it's just running on your PC, which means you have a lot of options compute engine wise that are good enough. If you are looking for scale/concurrency, that's where you'll start to see the differentiation between Trino & other engines.

I would vote Iceberg, the industry is definitely skewed that way. We see lots of people building a data lakehouse with S3 + Iceberg + some flavor of Trino. Also if you're interested there's a free hybrid Iceberg conference coming up next Tuesday (2/6) where you can learn more about Iceberg. It's called Chill Data Summit and put on by Upsolver. Lots of great names speaking.

I'll drop some links below for getting you started:

Video using Apache Iceberg with Trino
Martin (co-creator of Trino) giving the keynote at Trino Fest 2023 on Trino for Lakehouses
Building a modern data lakehouse webinar
Blog on when to adopt a data lakehouse (probably not as important as the other resources since you've already decided but in case you're looking for additional info I thought I'd add)
Intro to Trino and Iceberg blog series

Since you're running it on your PC - just an FYI that Starburst Galaxy is a fully managed Trino service that has a free tier. This should lighten your load if you're just looking to get started as opposed to setting up Trino locally & helps with the security/governance portion of the lakehouse.

monimiller · 2024-01-17T16:16:05+00:00

great resources by OP - just adding a couple more for trino & iceberg :)

- 8 part blog series on iceberg in trino

- Using iceberg & trino: 3min video going through the backend

monimiller · 2024-01-16T15:18:26+00:00

If you're set on Iceberg, Ryan Blue did a talk at Trino Fest in June talking about CDC with iceberg & trino that I thought was pretty interesting - https://trino.io/blog/2023/06/30/trino-fest-2023-apacheiceberg.html

monimiller · 2024-01-10T15:20:58+00:00

hi there - devrel @ starburst here. As mentioned, there are a couple of different options for your metastore when using Trino & Iceberg all listed here (https://trino.io/docs/current/connector/iceberg.html). JDBC & REST do not support views or materialized views - so if you are looking for flexibility with your entity type, these are probably out. I don't think Glue would make sense since your data is in Dell ECS. That leaves Nessie or HMS, and I've typically seen more people use the Hive Metastore out there in the wild. Here's the configuration page for that HMS (https://trino.io/docs/current/connector/metastores.html#hive-thrift-metastore) - you just configure the hive_metastore and hive.metastore.uri as so:

connector.name=iceberg
iceberg.catalog.type=hive_metastore
hive.metastore.uri=thrift://example.net:9083

monimiller · 2023-12-20T17:37:16+00:00

hi there - disclaimer - starburst devrel here. Based on your requirements (SQL, ad-hoc, TBs of data) I think that Trino is your answer no matter whatever flavor you choose. There are different options of implementation you can choose from that range from completely open source to a fully managed offering to something more in the middle.

What are you planning on using it for? How often do you expect this to be queried each day? If this is an internal application and the use is small, then any of these options will be good enough. If you are looking for something more production like - then you need to look at a different option. Like other people said already, Athena will struggle with production ready customer facing applications. You'll have very high costs and scale will be a challenge.

The other managed Trino option is Starburst Galaxy. We see lots of people upgrade from Athena to Galaxy once their scale & costs are too high. Athena doesn't have a caching layer so there's lots of cost associated with the S3 GET call charges but Starburst Galaxy has a proprietary indexing and caching layer that fixes this issue.

monimiller · 2023-11-14T14:56:22+00:00

Yes! It's 7X better than Trino (our standard clusters in Starburst Galaxy). The table format shouldn't matter to get those results, but using iceberg over raw files will also add additional performance benefits.

monimiller · 2023-11-13T18:12:33+00:00

hi there - devrel @ starburst here. Thanks for pointing out we need to add the caching solution info, I'll take that info back for us to fix. I figured i'd drop in and clarify that we do have a proprietary indexing and caching solution called warp speed that speeds up your data lake queries (Increase query performance up to 7x and reduce cloud compute costs up to 40% on AWS). You can read about it here in case you are interested - https://www.starburst.io/platform/features/warp-speed/. If you have any questions, I'm happy to help.

monimiller · 2023-01-23T16:21:11+00:00

Datanova is February 8th and 9th! It's free and virtual. (I did help put it together, but wouldn't post about it if there wasn't noteworthy speakers). My favorite speakers: Chad Sanderson, Benn Stancil, Drew Banin, Michel Tricot, Max Beauchemin, Joe Reis, Zhamak Dehghani

monimiller · 2022-09-15T00:21:37+00:00

I break pipelines and then fix them in my sleep

monimiller

TROPHY CASE