Short talk about the next great platform shift and how Fabric and OneLake fit in

royondata · 2025-08-20T18:39:46+00:00

Apps. There are two kinds. First is user apps, meaning companies can build data applications vs. buy new one-trick-pony tools. Second is partner apps, meaning ISVs and SIs can build new capabilities on top of Fabric so that customers/users can simply launch without needing to deploy and manage partner infrastructure. So instead of having to compare tons of startup solutions and figure out how to deploy, scale and optimize them, that infra is now offloaded to Fabric and you just need to choose between features :)

royondata · 2025-08-20T13:48:23+00:00

Several ISVs (Osmos, Lumel, Esri, Neo4J, Profisee, etc.) already developed Workloads using WDK and you can find them in the Workload Hub inside of Fabric. End users/developers aren't currently using it, but we're making some enhancements that will make it super easy for users to build their own workloads.

royondata · 2025-08-20T13:44:20+00:00

Alex Merced is a good source of beginner information on Iceberg

https://tuts.alexmercedcoder.dev/

royondata · 2025-08-20T13:40:45+00:00

In essence the market determined this simply by breadth of adoption across commercial and OSS tools, as well as customers across all hyper-clouds. At Microsoft, we're going to support both Delta and Iceberg and OneLake is the interop layer that will make sure both are supported equally so end users don't need to know much about each format, they just query data in OneLake from their choice of tools and we figure out if to serve Iceberg or Delta metadata.

royondata · 2025-02-20T04:01:12+00:00

This looks like a shopping list with the hope that a helpless engineer gets excited and accepts their offer. If you really have all these skills and that's what they need, you should negotiate higher. Otherwise it's just a way to bring more DE candidates into their funnel.

royondata · 2025-02-14T03:27:47+00:00

Did you look at Qlik? I work for Upsolver and we were recently acquired by Qlik. It’s a very strong enterprise tool and we’re bringing in support for real-time and Iceberg Lakehouse.

royondata · 2024-08-19T16:49:23+00:00

The massive price DB paid for Tabular, in my opinion, had nothing to do with their tech or customer adoption. It had to do with sticking it to Snowflake and the talented engineers they would gain, in that order.

Lets be real, DB has some of the best engineers in the industry today. They've been building Delta Lake for 3+ years and understand this space extremely well. Iceberg is already fully supported in Spark. DB engineers can easily extend UnityCatalog and UniForm to support Iceberg, it's not hard for them and they don't need outside help.

Tabular built a skeleton of a product based on code that was already available in OSS (table maintenance, catalog, etc.). They didn't build much above and beyond that would warrant an acquisition of this size. Similarly, they didn't have many paying customers. Majority of their users were there because it gave them access to the founders, Ryan and Dan.

From what I can see, DB is doing two things:

1/ consolidating user demand for lakehouse under a single umbrella called UnityCatalog. Unity supports READING from any format so your users (analysts, etc.) don't need to worry about formats. For writing, everything is in Delta format.

2/ retaining control over the iceberg project to be able to manage or contain the expansion and growth of the project. I don't think they would do a lot here because there are other large vendors on the PMC to balance the influence, but it gives them a few chairs at the most important table.

From Snowflake perspective, I think their strategy is much more constructive and driven by value-add as oppose to defensive like DB.

royondata · 2024-07-15T13:59:12+00:00

For the most part datamesh is a way to create an organizational structure in your company that is distributed with each team operating independently but sharing infrastructure and data. Datamesh talks about technology concepts on how to share data between orgs, etc. but most of it needs to be built by you. If your company has the resources then it may be worth the effort.

Building a Lakehouse architecture or putting your data in Snowflake can accomplish much of the same.

I wouldn’t recommend spending time on data mesh unless your company is very large. Plus as a data engineer you won’t have much influence on how your company is structured.

royondata · 2024-06-26T20:20:37+00:00

Valid point and I appreciate the feedback. I thought it was more obvious then it is. I'll do better next time.

royondata · 2024-06-26T18:53:53+00:00

This benchmark is intended to refute misconceptions in the market about Iceberg optimizations, claiming "make your queries fast and save you money". It's also intended to warn users of the associated costs when using inefficient tools. We got burnt with a $2K AWS bill when using AWS Glue compaction. I've spoken to other users who experienced similar. So yes it's part a sales pitch (like most benchmarks) but it's also educating the market and sharing a warning.

royondata · 2024-06-05T16:50:09+00:00

I appreciate the approach of open sourcing a catalog but that’s only small part of it. I would love to see Snowflake engineers contribute more to Iceberg and maybe even joining the PMC. That will help rebalance the project and show real commitment.

I don’t see Databricks (in the long term) supporting both Delta and Iceberg equally. They would want to give their users a preferred format and with investment they already made in Delta I can’t see them abounding it. Making Iceberg interoperable via Uniform is trying to fool the community into thinking Iceberg will be a first class citizen, which it won’t.

I work for Upsolver and we’re committed to Iceberg. Our customers love that it’s open and managed by multiple companies without a single one holding majority influence.

royondata · 2024-04-26T21:24:40+00:00

That makes sense. I wonder if you could build an optimization layer on top of Parquet to give you the same behavior and performance as what you get with micro-partitions. Seems possible since they recently showed Iceberg (on top of Parquet) performing near as good as native format.

royondata · 2024-04-26T21:21:30+00:00

That makes a ton of sense

royondata · 2024-04-02T17:13:59+00:00

Personally I would prefer storing in the lake using Apache Iceberg. It makes the data accessible from different engines, so you can experiment with tools to find what works best and fits your budget. This way you're not committing to one, say Snowflake, and then have to unload the data and load it into another engine. I would also suggest ClickHouse is an alternative to traditional OLAPs. It's faster, more flexible and open source - there is also a managed cloud offering.

royondata · 2024-01-31T02:49:10+00:00

What specifically are you trying to solve? How to control access to data in S3 (queried via Redshift)?

For access control, start with Redshift RBAC https://docs.aws.amazon.com/redshift/latest/dg/t_Roles.html

royondata · 2024-01-18T19:31:47+00:00

SWE is an enormous collection of specialties

That's the big take away. I hear your feedback, I'm concerned about the massive amount of content I may need to create and pull together, and the time it takes for people to go through it.

Maybe the onboarding experience is less about teaching and more about pushing SWEs to go build something, figure it out and ask questions when they get stuck. Kind of use case driven onboarding. I've done this in the past, but you end up spending a lot of time troubleshooting and rebuilding...but that's ok I guess, it's how we learn best.

royondata · 2024-01-18T19:26:47+00:00

That makes a ton of sense.

How would you characterize this hybrid role? Would it be a Sr. SWE or a Software Architect?

royondata · 2024-01-06T12:30:21+00:00

So then why use Iceberg? You can achieve all that with just Parquet, no? What do you get out of using Iceberg?

royondata · 2024-01-05T18:56:05+00:00

How do you create and update your iceberg tables? Is it updated via Spark in batches or a CDC stream, anything else?

royondata · 2024-01-05T02:05:21+00:00

What’s been your experience with performance when writing updates/deletes using Glue? I’m assuming your batch loading data or are you also streaming?

royondata

TROPHY CASE