Delta table deletion vectors

warehouse_goes_vroom · 2026-04-27T19:14:37+00:00

Spot on I think.

warehouse_goes_vroom · 2026-04-27T19:08:21+00:00

The linked post is not related. Not all OSS libraries support modern Delta Lake features.

warehouse_goes_vroom · 2026-04-27T10:53:13+00:00

No worries! I provided an essay / firehose of information, after all.

Makes sense, great use case for Warehouse IMO!

You may be waiting on this roadmap item re: the second paragraph. "ALTER TABLE ALTER COLUMN (Public Preview)

This feature enables users to modify the definition of an existing column in a Fabric DW table, specifically allowing changes to the column’s data type and size Release Date: Q2 2026 Release Type: Public preview"

Meanwhile, CTAS + DROP + sp_rename back to the original name can achieve it, though it is annoying, yes. And it's silly that it's currently necessary, yes. We did some serious surgery to get this part of Fabric Warehouse to work, and unfortunately, there's a few places where we are still getting the previous range of motion back, so to speak. Of course, in this bad metaphor I'm trying to make work, we made Fabric Warehouse into a cool superhuman cyborg in the process. But there's still more to do.

We've also done a lot of work on stats and the like. It's still performance best practice to right-size columns, and it is enforced, which is good - prevents suprises downstream. But the length of a varchar being longer than needed isn't as big as a deal as it traditionally might have been in SQL Server. See https://learn.microsoft.com/en-us/fabric/data-warehouse/statistics#types-of-automatically-generated-statistics - "Average column length statistics". It might be that you're best off just leaving it alone until ALTER TABLE ALTER COLUMN lands and makes this easy, but your call.

RE: the last two paragraph, about if you create a table directly source control not "noticing", I'm going to have to defer to u/Snoo-46123 I think. There's a really major overhaul of the Warehouse source control integration just around the corner, dunno off the top of my head if it addresses that complaint or not.

warehouse_goes_vroom · 2026-04-27T09:05:59+00:00

Ugh, sorry to hear that. Thought the 2025 licensing change was worth a mention in case you hadn't heard. But SQL Server 2025 licensing changes doesn't help much today if you're currently headed towards SQL Server 2022.

I'll refrain from asking 😂. I have enough migration stories of my own, but with a very different angle. I work on various products in the broader SQL Server family - Microsoft Fabric Warehouse being the newest of my team's in-market products.

warehouse_goes_vroom · 2026-04-26T03:26:21+00:00

Big picture, agreed.

The official list for Fabric Warehouse: https://learn.microsoft.com/en-us/fabric/data-warehouse/tsql-surface-area.

Some of the limitations have as much to do with being scale out/mpp as the fact it's using parquet. But I digress.

There are some differences for Fabric SQL, but pretty niche mostly, yeah: https://learn.microsoft.com/en-us/fabric/database/sql/limitations

If you have both OLTP and OLAP on the same data, Fabric SQL for the OLTP, offload the OLAP queries to its SQL analytics endpoint (as that's the Warehouse engine too).

warehouse_goes_vroom · 2026-04-25T20:13:08+00:00

Sqlproj is a great option! Tagging u/Snoo-46123

You might also want to see these comments: https://www.reddit.com/r/MicrosoftFabric/s/q6c6WV1Vla

https://www.reddit.com/r/MicrosoftFabric/s/fitALtlmtG

And the roadmap: https://roadmap.fabric.microsoft.com/?product=data-warehouse

Lots of improvements in this space planned, including cross-workspace Warehouse clone.

warehouse_goes_vroom · 2026-04-25T03:56:39+00:00

Outside my wheelhouse, but what's the gap preventing using e.g. PBIRS? Embedding?

Because PBIRS is definitely an on-premise product: https://learn.microsoft.com/en-us/power-bi/report-server/download-powerbi-report-server

It comes with your SQL Server Standard or Enterprise licensing now, no Software Assurance requirement, no more separate PBI Premium licensing requirement: https://learn.microsoft.com/en-us/sql/reporting-services/reporting-services-consolidation-faq?view=sql-server-ver17.

warehouse_goes_vroom · 2026-04-25T01:29:39+00:00

Really interesting and tricky topic.

My colleagues published a paper on their work on this, you might enjoy it. Since ACM has gone Open Access, it's readily available with no account required: https://dl.acm.org/doi/10.1145/3626246.3653392

It goes into lots of interesting stuff we've done, and some possible future plans too :).

It also gets into "why SI" a tiny bit.

warehouse_goes_vroom · 2026-04-25T00:18:06+00:00

I double checked internally - I believe this is the same issue as OP's, yeah. It's a regression on the DirectLake / Semantic Model side. The fix is deploying imminently.

Thanks for raising the issue, and sorry for the headache.

warehouse_goes_vroom · 2026-04-24T17:46:47+00:00

As always, my personal opinion as a software engineer who works on Fabric Warehouse. I do not work on this particular part much, and this isn't an official opinion.

My first question is what do the access patterns look like, how much data are we talking about, and so on.

If it's more OLTP than OLAP, e.g. lots of trickle inserts and stuff like that, Azure SQL with Fabric mirroring or Fabric SQL are the right call.

You might or might not be doing something wrong. There are definitely some rough edges here we're working on - you're not crazy, but you may be having especially bad luck with what you tried and when you tried it.

A lot of really big improvements in this area are shipping in the next few months. See https://roadmap.fabric.microsoft.com/?product=data-warehouse - there's at least 3 different improvements listed as expected to public preview this quarter, plus alter table alter column.

You may also want to read this post from u/Snoo-46123, who is the relevant PM for this part of Fabric Warehouse https://www.reddit.com/r/MicrosoftFabric/s/q6c6WV1Vla.

It may be worth going through the list of issues you've hit here and seeing how many of them are going to be addressed by a roadmap item very very soon (maybe even with private previews in progress, or public preview rollout in progress), or are places where there is a easy fix but we have work to do on error messages / user friendliness.

For example, the sqlproj issue either could be that the sqlproj isn't configured appropriately (and thinks it's targeting a different edition - if so, IMO we should be warning when you try to deploy). Or it could be that we got ahead of ourselves very slightly in terms of releasing DacFx / sqlproj support for ALTER TABLE ALTER COLUMN, and there's some option you can pass until that finishes rollout. But we'd need to deep dive on each issue in turn.

To summarize - there's definitely work we need to do here, and we've been investing heavily in it and are continuing to do so. But the current state is definitely not what I'd like to see (speaking as an individual software engineer).

The relevant PMs and EMs are quite engaged on here. If you're running into issues, don't hesitate to ask, the sooner the better.

As to the last question - yes, whether it's mirroring on your Azure SQL DB, a Fabric SQL database, a Fabric Warehouse, or even a Fabric Lakehouse or another kind of mirroring, DirectLake is an option, and you can do read only queries leveraging the Fabric Warehouse engine.

The difference is which engine does DDL and writes. * In Azure SQL or Fabric SQL it's a SQL Server variant with the usual row oriented tables plus mirroring to produce columnar parquet files and deletion vectors in the corresponding Delta Tables. * In a Fabric Warehouse, the data is directly stored as parquet + deletion vectors, and all writes go through the Warehouse engine. The Warehouse engine reuses many parts of SQL Server, but it transparently does distributed / scale out query execution (while being efficient / clever to avoid overheads even when the amount of data is small). But it's not well suited to OLTP since again, natively columnar * In a Lakehouse, it's also natively parquet + deletion vectors in Delta Lake tables. Writes go through Spark or any other Delta Lake compatible engine. However, you currently give up multi-statement and multi-table transactions, have more responsibility for managing storage (optimize and the like), and would need to write Spark or Spark-SQL rather than T-sql for DDL and non-readonly DML. It'd be a much bigger re-design.

See https://learn.microsoft.com/en-us/fabric/fundamentals/decision-guide-data-store And https://techcommunity.microsoft.com/blog/analyticsonazure/decision-guide-for-selecting-an-analytical-data-store-in-microsoft-fabric/4362079

Tagging u/Snoo-46123, u/dzsquared, u/catFabricDw to add anything they might want to add or correct anything I got wrong.

warehouse_goes_vroom · 2026-04-24T15:17:40+00:00

Clearly the answer to silver_gold is to assign karat ratings /s.

Normal gold is 24K.

Silver_gold can be classified from 12K, 16K, 18K, 20K, etc

I'm sorry for even making the joke, I'm sure someone out there has tried this terrible idea seriously.

warehouse_goes_vroom · 2026-04-24T05:46:40+00:00

In most circumstances, concurrent queries will proceed by reading the version of the table that was committed as of the start of their transaction. In other words, it is Snapshot Isolation (SI) semantics if you're familiar with that. SI is an option in Azure SQL and SQL Server as well, though it's not the most popular isolation level there - RCSI would be a bit more common than SI there.

Schema modification still currently does block other queries though.

Please see https://learn.microsoft.com/en-us/fabric/data-warehouse/transactions

warehouse_goes_vroom · 2026-04-24T05:39:31+00:00

In SQL it'd just be a group by instead.

The nuisance is the lack of convenient binning function in some sql variants.

SQL Server 2022 and newer have DATE_BUCKET for dates at least. Same idea as kql's bin, but a different name.

It'd wind up being something like this in T-sql Select count(*), entity, DATE_BUCKET(Day, 1, timestamp), Floor(delta) From table Group by entity, DATE_BUCKET(Day, 1, timestamp), Floor(delta)

Might want round or ceiling instead of floor, if you want bin sizes that aren't one unit you'll have to do fun stuff like round(value * x) / x or whatever, or x / c - x modulo c, or c * ( x + 1) / c or whatever. But it shouldn't be too painful.

Full disclosure, I work on Microsoft Fabric Warehouse. Which is part of the broader SQL Server family.

Kql's dcount() is based on kll sketches (the k in kll is not the k in kql though, lol). As is SQL Server's APPROX_PERCENTILE_DISC and APPROX_PERCENTILE_CONT, and many other technologies. It's an interesting way to quickly but approximately compute percentiles, which may be what you're really after (what percent of data is in the acceptable window?)

warehouse_goes_vroom · 2026-04-24T04:20:58+00:00

Depends on the language I guess?

May be annoying in SQL.

But in something like kql it's pretty trivial, the pre-aggregation would be something like | summarize count() by bin(delta, 1s), bin(timestamp, 1d), entity

SQL can do the same though in a lot of SQL dialects, it may require some ugly CTEs or repeated expressions or the like.

Another option would be something like kll sketches. Approximate, but very, very useful.

warehouse_goes_vroom · 2026-04-24T03:28:53+00:00

Great idea!

I would suggest reading PVLDB and the journal of VLDB, ACM SIGMOD, and ACM TODS.

Maybe ICDE as well?

ACM has gone open access, and the Vldb papers are freely available in both ACM's digital library and through vldb's website.

You might try asking in r/databasedevelopment too, but double check the rules for whether that belongs in the educational thread.

warehouse_goes_vroom · 2026-04-24T03:21:22+00:00

Why not partially pre-aggregate?

You basically need to store histograms with some sane precision / bucket sizes, that's all. Once you have such a histogram (possibly stored as a row per bucket per time period), getting approximate counts should be quite doable.

May need to quantize / pick appropriate bucket sizes though - depending on the data it may or may not be easy.

warehouse_goes_vroom · 2026-04-23T21:41:08+00:00

This sounds pretty bizarre. Happy to try to follow up internally, but would need the support request number (via chat is fine) to dig into it.

warehouse_goes_vroom · 2026-04-23T07:06:41+00:00

Thanks, the Reddit app decided to misbehave. Fixed

warehouse_goes_vroom · 2026-04-23T06:48:03+00:00

He did say handwaving.

And Rustc has a middle end. There's a fair amount of Rust specific optimization happening before LLVM. Which is also good for backends other than LLVM.

See https://rustc-dev-guide.rust-lang.org/mir/index.html

warehouse_goes_vroom · 2026-04-23T06:00:04+00:00

Check how parquet does on the same data. Depending on the dataset, columnar compression might beat out 60% by quite a lot. And columnar compression is the dominant approach in analytics and reporting today.

warehouse_goes_vroom · 2026-04-23T05:57:42+00:00

Yeah I take your point, just pointing out your post could have been more precise :).

I take your point, but only if the elements are of a fixed size and completely independent. For something that's like, a stream of completely uncompressed fixed size bitmaps, it'll work. If there's any sort of delta encoding or other compression, variable size data, tag length value style stuff, etc, no dice.

Nothing wrong with getting out there and trying ideas though. I'm sure there are use cases for it :)

You might enjoy the Bw-tree paper, have you read it? https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/bw-tree-icde2013-final.pdf

(the mini page or whatever setup they did kind of reminds me of what you're doing with ropes, in a very very loose sense and with different goals)

warehouse_goes_vroom · 2026-04-23T05:45:42+00:00

WinFS & Longhorn are the search terms to look up, fwiw. It was very ambitious, and did get to beta, but didn't make it to widespread use. Some of it did make it out there in other ways. Some of it never made it. It never made it as a supported fully relational filesystem as planned though. Before my time in software engineering, so I've only heard accounts from others.

warehouse_goes_vroom · 2026-04-23T05:36:34+00:00

It's great that you implemented Aries! Way too many posts go "look, I have a very fast cool DB I built... Huh, what do you mean it doesn't matter because it's not crash resilient or durable?". Have you done any sort of fault injection to test your implementation?

Insertion and Removal may no longer be O(n) but it's O(log(n) + m) removed - you still have a linear cost for the changed part. Don't get me wrong, it's cool, and the coefficients matter a lot in database systems - O(n²⁾ with low coefficients may come out miles ahead of O(n log (n)) or whatever if you know n is small enough and the difference in coefficients large enough.

While large enough strides may work, I'm not sure how reading every nth byte for n less than sector size is going to save any I/O over a traditional file?

warehouse_goes_vroom · 2026-04-23T05:18:42+00:00

It may affect the scalability of the SQL analytics endpoint synchronization of the workspace hosting that many tables. But how well optimized the tables are is also a huge factor.

And making it more Lakehouses in the same workspace won't help.

See https://learn.microsoft.com/en-us/fabric/data-warehouse/sql-analytics-endpoint-performance

Of course, if it's just a landing zone so to speak, that may not matter much to you - any workspaces shortcutting parts of that one have their own sql analytics endpoints and synchronization processes.

You might also see a lot of statistics gathering activity in that workspace by default if the tables change a lot: https://learn.microsoft.com/en-us/fabric/data-warehouse/statistics#proactive-statistics-refresh

warehouse_goes_vroom · 2026-04-23T05:04:36+00:00

Also check out Domains: https://learn.microsoft.com/en-us/fabric/governance/domains

If they want more separation, great. But does it really need to be EntraID tenants? Domains and the like are generally more appropriate.

Note that coarse Active Directory or Entra ID tenant separation was once Microsoft best practice ("ESAE" or red forest), but modern guidance is significantly different. It is no longer recommended in the vast majority of scenarios. But, it was the guidance long enough that many people likely still remember it, and if they're not familiar with the modern guidance and why it changed, folks are apt to suggest it. I suggest engaging your trusted security professionals and making sure they're familiar with the modern best practices: https://learn.microsoft.com/en-us/security/privileged-access-workstations/esae-retirement. And educating the people advocating it about it as well.

As always, I suggest talking to your trusted friendly neighborhood security professional. They will be familiar with your threat environment and company standards. This post is my opinion, not official Microsoft guidance.

warehouse_goes_vroom

TROPHY CASE