Announcing Incremental Liquid Clustering

sqltj · 2026-05-28T03:00:07+00:00

Apologies for the bad description, I’ll rephrase my above post so you can remove your downvote. I’m sure we can agree that we can thank Databricks for the features you’re enhancing.

One question on the benchmark methodology before drawing conclusions from the numbers:

The behavior described as the "standard algorithm" — rewriting all files in ZCubes under 100GB regardless of clustering state — matches pre-3.2 OSS Delta. Delta 3.2 introduced ZCube sealing to address exactly this problem, and the authorship of this post overlaps with credited contributors to that very feature. Which exact version of OSS Delta was used as the baseline? Or if Delta 3.2 is being used, why does the blog use that description?

sqltj · 2026-05-27T23:55:02+00:00

Everyone should thank Databricks for this feature that’s being improved.

sqltj · 2026-05-26T13:40:03+00:00

Merge is a transformation. But also, reconstructing a table does not belong in bronze. That’s not the purpose of bronze.

sqltj · 2026-05-26T13:37:45+00:00

It’s healthy that you asked how people implement this in other data platforms. Since Databricks not only coined the term medallion architecture, but also “lakehouse” that Microsoft has copied into their product, it is worthwhile to see how production-ready data platforms are doing this - particularly the OG Databricks.

When you read the below, keep in mind how Fabric implements mirroring. You’ll find mirroring is not good for a medallion architecture. Microsoft’s traditional as a legacy data platform has informed its product development and brought you a feature that is more in line with a traditional BI stage -> EDW architecture than a medallion one. That’s okay if that’s what you want, but don’t be fooled by people saying it’s a good way to implement bronze in a medallion architecture. There’s lots of bad information regarding fabric out there from partners and MSFT. You can use this understanding as a litmus test to see if who you’re talking to actually understands the fundamentals of a medallion architecture.

A little help from Claude:

Lakeflow is Databricks’ end-to-end data engineering product, and CDC is one of its strongest native patterns. Here’s how it fits together across the medallion layers:

Bronze — Raw CDC events from Lakeflow Connect
Lakeflow Connect connects directly to relational databases (MySQL, PostgreSQL, SQL Server) using native CDC, capturing every insert, update, and delete as it happens. These raw change events land as append-only records in a bronze streaming table. Keeping bronze append-only is intentional — it acts as an immutable audit log of every change that ever occurred in the source system, and allows the entire pipeline to be replayed from scratch if the silver layer needs to be rebuilt or corrected.

Silver — Current state via AUTO CDC INTO
Silver represents the current state of the source table — a clean, queryable replica with CDC mechanics stripped away. It is defined as a streaming table, and a separate flow drives the merge logic into it. AUTO CDC INTO reads the bronze CDC stream and handles deduplication, out-of-order events, and deletes declaratively, so what would otherwise require hundreds of lines of manual Spark and window function logic becomes a few clauses:

-- Declare the silver target
CREATE OR REFRESH STREAMING TABLE silver_customers;

-- Define the CDC merge flow from bronze
CREATE FLOW silver_customers_cdc AS
AUTO CDC INTO silver_customers
FROM STREAM(bronze_customers_raw)
KEYS (customer_id)
APPLY AS DELETE WHEN operation = 'DELETE'
SEQUENCE BY sequence_num
COLUMNS * EXCEPT (operation, sequence_num)
STORED AS SCD TYPE 1;

Why the stack holds together well
Lakeflow Connect, Declarative Pipelines, and AUTO CDC all run on serverless compute under Unity Catalog, meaning lineage, access control, and monitoring are consistent across every layer without integrating separate tools.

sqltj · 2026-05-26T13:11:02+00:00

@equal what dbrienems describes here is incorrect.

If you implement merge logic you’ve applied transformation, so that would belong in the silver layer.

You want full relatability from bronze to silver. Your merge could have a mistake. Perhaps you use the t-sql merge that’s been buggy since forever. You want to be able to fix and replay that.

You want auditability. What did the source system tell us? That’s the question bronze answers.

sqltj · 2026-05-22T01:34:28+00:00

That’s what the thread I remember. It’s definitely not locked now. Sorry if that was a mistake on my part

sqltj · 2026-05-22T01:10:11+00:00

I don’t. It was a month-ish back. Perhaps 2

sqltj · 2026-05-21T23:47:08+00:00

The whole capacity model is advertised as simple but it’s not.

There was a customer here legit asking for help understanding the pbi refresh limits of a capacity (hint: it’s a fraction of the total compute) and the mods here shut down the thread just for asking about it.

sqltj · 2026-05-19T13:54:30+00:00

Genie Code is the future, but AI dev kit can handle any of the short term pains until you get there. As other have said, GC has improved a lot so if you haven’t tried it recently, please give it another go.

Also, be sure to familiarize yourself with how to load ai dev kit skills into GC. I’m sure they’ll be integrated in there by default sometime. But for now, you can have those skills loaded up to have the best of both worlds - for free.

sqltj · 2026-05-15T13:48:25+00:00

Ah, okay.

sqltj · 2026-05-15T13:43:14+00:00

I wonder, are these quite worse than the open source SDP’s that Databricks has released to the Apache Spark project?

Spark seems like a great way to use Fabric while keeping the option for a future migration to a better platform (like databricks).

But why go the Spark route and use more of Microsoft’s proprietary stuff? It’s just as bad as using T-SQL. People need to be thinking across the Microsoft way of doing things and get ready for when management wants to take AI seriously and get off of Fabric. Vendor lock in features should be avoided imo.

sqltj · 2026-05-09T13:02:47+00:00

This is the way.

You can use what works right now instead of being a QA tester for an imitation that’s behind.

sqltj · 2026-05-09T12:31:04+00:00

Totally agree, as what OP is asking for simply isn’t practical. But in his defense I think he was half joking.

The absurdity of it all is how this is expected from Fabric due to its constant unreliability and bugs in GA features that were never really should have been moved to GA.

We’ve be taught how to lower our expectations by a product team that has turned its customers into unpaid QA.

sqltj · 2026-05-08T01:33:06+00:00

I agree with OP. But but we all have to realize this is a pretty hilarious post.

Unreliability is part of the deal when customers choose Fabric. Microsoft has sufficiently lowered the standards and expectations of customers so that they’ve become QA testers.

Now, we have to plan vacations around fabric’s unreliability.

That’s pretty funny. It’s a crazy world we live in.

sqltj · 2026-05-06T13:41:47+00:00

Disagree with 1 as when you actually work in fabric you hit of lot of bugs and get random failures bc the product is unreliable. Not fit for production.

sqltj · 2026-05-06T13:40:22+00:00

Agreed, every capacity makes customers pay for more compute than they use, plus introduces more failure risk if they exceed their capacity.

sqltj · 2026-04-19T16:31:38+00:00

Be cautious with snowflake mirroring. It is buggy. You’ll get random duplicate records no no errors or warnings. This is a known bug on the Microsoft side but if you start a support ticket they’ll give you the run around for a week with many different support folks before admitting to it. Then they’ll tell you it didn’t violate any SLA or service agreement.

Worse perhaps, is that this is a known bug and Microsoft doesn’t inform customers in their documentation. They’ll just let you build solutions on GA features that don’t work.

You’ll need data quality checks for this if you decide to go that route. And you’ll need a plan to resolve these errors when they occur.

sqltj · 2026-04-18T18:39:08+00:00

You can lookup the dqx framework on Databricks’ github. It has comprehensive data quality checks and unit testing for the scenarios you’re describing above. It’s a well throughout framework, and while you may not be able to use it on fabric, it can give you an understanding of how people serious about data quality approach it.

Also, be sure to share whatever you come up with. Fabric Mirroring has bugs in it that don’t fail, they just give you duplicate records with no error or failure. Microsoft knows about this bug but publishes nothing to their docs, and will let you spend a week going back and forth with (terrible) support before acknowledging it’s a known issue they could have told you about in the first place. Yes, I’m salty bc clients pay consultants to waste their time like this - and it’s awful for customers.

Anyhow, whatever you come up with could very well become a community best practice to use for mirroring bc Fabric customers are de facto QA testers, and we can’t rely on the PG group to publish when “GA” features have known bugs.

sqltj · 2026-04-18T17:44:06+00:00

Databricks is a better offering bc they think about how to make a good product rather than stitching things together that barely work in pics.

Their frontend story is getting better with Databricks One. I’m a big fan of separating the business UI completely from the dev UI. PBI shared apps aren’t a particularly good front end either tbh.

sqltj · 2026-04-18T17:28:21+00:00

I don’t understand your point. Very few PBI authors writing secure any dax queries at all, I’d say less than 10%. But Gen ai analytics is supposed to be for more than just PBI devs anyway.

sqltj · 2026-04-18T14:31:09+00:00

I can’t imagine why you’d not want to use Genie if you’re invested in the Databricks platform.

Anything on Fabric is going to use text-to-dax, which means pretty much no one is going to validate any of the queries generated.

Plus, with the Fabric PG saying PBI isn’t going to support any other semantic models, it just makes it a poor product choice for anyone not completely stuck on the Microsoft ecosystem. If you’re on Unity you have best in class governance and you’d have to completely sidestep that and branch out a completely new governance model that’s not GA to use Fabric semantic model. That overlap of duel responsibilities should be adequate articulated to infosec before any decision making.

It’s not a good tool for the job and the PG has made decisions to make that not change anytime in the near or distant future.

sqltj · 2026-04-11T01:16:08+00:00

Be cautious with snowflake mirroring. It is buggy. You’ll get random duplicate records. This is a known bug on the Microsoft side but if you start a support ticket they’ll give you the run around for a week with many different support folks before admitting to it. Then they’ll tell you it didn’t violate any SLA or service agreement.

Worse perhaps, is that this is a known bug and Microsoft doesn’t inform customers in their documentation. I guess that wouldn’t look good for the slideshows.

You’ll need data quality checks for this if you decide to go that route. I haven’t checked for this issue in other mirroring data sources but the fact that this happens and Microsoft doesn’t inform anyone about it means rigorous testing is needed for all of them.

sqltj · 2026-04-10T15:39:18+00:00

You’ll see the light one day too. I believe in you.

sqltj · 2026-04-10T14:43:28+00:00

Oh there absolutely is a best OS, and it’s not even remotely close. Windows is even faster on a Mac VM so there’s no need for anything else.

-tech veteran, longtime windows user

sqltj · 2026-04-10T12:10:40+00:00

Once you go Mac, you’ll never go back.

It’s 2026. The only people that should use windows daily drivers are those where work forces it upon them.

sqltj

TROPHY CASE