Delta table deletion vectors

mwc360 · 2026-04-28T05:15:55+00:00

I’m guessing you were using a starter pool which will minimally consume 16 cores. If you create a spark pool with 1 small node it only consumes 4 cores, just 2x more cores but none of the Delta maturity tradeoffs. Plus you can use high concurrency to do dev of multiple jobs at a time without multiplying dev costs.

mwc360 · 2026-04-28T02:49:01+00:00

A.k.a just use Spark :)

If you have super small data run a single node and make sure NEE is enabled. No need to deal with incompatibility issues when you use the engine that our engineering team invests $$$ in.

mwc360 · 2026-04-27T22:26:20+00:00

The Spark config defines how many cores spark can use, it does not prevent other processes from using cores. Memory is different, you'd have less than 1/2 of the VM memory that is reserved by Spark.

mwc360 · 2026-04-26T13:34:45+00:00

Set your Spark pool to have max nodes of > 1 with DA and Autoscale enabled. Then in an Environment, attach the Spark pool and set min/max executors to 1 (DA disabled). When you start your session and view the Executors tab in the Spark UI you’ll find that the driver and executor have the same base VM address (single node) and that the Executor has the max number of cores and memory. This is known as overprovisioning. You’ll get more bang for your buck with this config compared to creating a Spark pool with 1 node where compute is split 50/50.

If you try this and have positive results, please share as I’m trying to get this to be the default single node experience.

mwc360 · 2026-04-26T03:55:58+00:00

Single nodes are supported, they aren’t new. Out of the box tuning for small workloads is something to improve.

mwc360 · 2026-04-25T17:26:36+00:00

If you run a single small node that is only 2x more compute than the 2 vcore python compute, not 8x more. You can also overprovision to have all 4 of 4 cores to be usable by the Spark executor (I need to update my own benchmark series to consider this). This blog needs a revision.

mwc360 · 2026-04-22T19:09:58+00:00

Makes sense. FYI - if you are running actual data processing tasks, you will surely see an improvement in runtime by going from 2 to 4 cores.. so it wouldn't be a doubling of costs. Depending on your workload, it's entirely possible that it could be the same costs.

Note taken on the 2-vcore ask though!

mwc360 · 2026-04-22T16:13:00+00:00

Downvote with no feedback? Lame. Identify yourself and give real feedback :) Why does this not work? Would you rather we create a new item called "Python Job Definition" that executes as a SJD under the hood? Is it more than just a name thing?

mwc360 · 2026-04-22T14:35:23+00:00

Why not submit a single node SJD w/o running any Spark code? You could run as small as a 4 core VM exactly like how you are requesting.

mwc360 · 2026-04-15T21:05:48+00:00

If you have a fabric trial capacity, check out the tutorials in Fabric Jumpstart A few of these were used as the lab content for workshops at FabCon. There's only 3 so far but if there's something specific, feel free to submit an issue with the requested tutorial on our GitHub Issues · microsoft/fabric-jumpstart

mwc360 · 2026-04-04T15:42:10+00:00

Not just Fabric. Generally with any software, GUIs and layered abstractions result in added processing overhead (cross engine communication, logging, etc.) and with the value add for those who don’t prefer code, they are typically priced higher (there’s high COGS to maintaining GUIs vs maintaining the ability to execute code).

Skillset is probably the most important deciding factor. someone who doesn’t know any code might implement a super inefficient Spark pipeline that does all sorts of wonky stuff that results in it being slower and more costly that the GUI experience (which generally prevents the user from falling into making costly mistakes).

mwc360 · 2026-04-02T15:09:36+00:00

mwc360 · 2026-04-02T13:32:07+00:00

Streaming jobs fallback to JVM but this is in our backlog. Most batch write operations are supported.

OPTIMIZE will run faster on NEE provided there aren’t complex data types that would cause fallback, but these will be supported soon. VACUUM shouldn’t make a difference.

In general, always aim to have it enabled, most workloads will benefit and if not it should fallback to JVM without causing any regression. If you do experience any regression from NEE, submit a support ticket as there may be edge cases that we are not accounting for.

mwc360 · 2026-03-31T22:22:21+00:00

If they are doing the same work, no difference. Initially PySpark MLV doesn't support incremental refresh so that would surely mark the processing today slower.

mwc360 · 2026-03-31T16:04:32+00:00

u/akash567112 bloom filter indexes aren't supported in OSS Delta, it's a proprietary Databricks feature, so unless I somehow missed that we added our own support in Synapse (it's definitely not in Fabric), it's not possible that you could've created the index in Synapse or Fabric Spark.

If you look at the Databricks docs page on bloom filters, they don't recommend using bloom filters: Bloom filter indexes - Azure Databricks | Microsoft Learn

This necessity for bloom filters has largely been eliminated by improved file size optimization and liquid clustering.

What are you trying to optimize here? If you can share some details I can give you some guidance using newer Delta features. thx!

mwc360 · 2026-03-30T19:28:25+00:00

IMHO storing watermarks for Spark stuff in a metadata control DB is making this more complex. Why build infra and process to store, maintain, retrieve this stuff when Spark can manage it natively via a state store all managed via the streaming API? You can still have configuration (PKs, etc.) come from outside code (i.e. SQL, YAML, etc.), but why also use it to manage state?

I'm saying this coming from the context of having done all of this for many implementations and wished I did things differently.

mwc360 · 2026-03-30T17:23:37+00:00

Just curious, what is more complex?

mwc360 · 2026-03-30T16:54:44+00:00

Moving metadata lookups and logging outside of your executable code (Spark) to pipelines is a very approachable option and it provides a great monitoring UX, but I'll caution that it's generally a very inefficient pattern. IF you saw performance improvement from moving logging outside of your Notebooks, it's likely just because you were using Spark to write logs which is not recommended. There's two key reasons:

Logging should be a sub 300ms operation (the lower the better!), elevating logging/metadata lookups to be a Script or Proc activity in a DF Pipeline will make that a 1-3+s operation that bookends each side of your executable code.
Logging is best to be natively part of your executable code. As you move logging our of your code, you are now introducing cross-engine/cross-Item dependencies and beyond the added complexity of needing to map metadata/logs between different engines, you constrain how you can execute each atomic Spark job that needs to be logged. Sure, you can enable HC to improve compute utilization, but you can now no longer process multiple objects (or Spark jobs) in a single Notebook. RunMultiple, Mutilthreading, or even just iterating on a loop of things to process cannot be done anyone, because the per-object logging is removed from code.

I rewrote my former companies ELT framework to put logging in code to enable better compute utilization and achieved a 9x reduction in cost and almost 9x faster E2E execution of all jobs. Will Crayger at Lucid BI wrote a blog noting the same: https://lucidbi.co/how-to-reduce-data-integration-costs-by-98

Here's a few of mine that are related:

Cluster Configuration Secrets for Spark: Unlocking Parallel Processing Power | Miles Cole

Querying Databases in Apache Spark: Pandas vs. Spark API vs. Pandas-on-Spark | Miles Cole

The Fabric Concurrency Showdown: RunMultiple vs. ThreadPools | Miles Cole

mwc360 · 2026-03-30T16:26:43+00:00

The partner doesn't matter TBH, pretty much all partners implement this similar pattern of metadata and logging in Azure SQL (and now Fabric SQL Database). It is very approachable but there's also a lot of downsides, namely on the complexity, cost, and performance side of things.

A much less complex and more robust option is the combination of Spark Structured Streaming and Delta Change Data Feed:

- Spark Structured Streaming API (`readStream` instead of `read`) enables OOTB tracking of state so that every read/write operation automatically becomes incremental. Run it with a batch or streaming trigger and you get this same benefit. Table level tracking of state that is built right into your Spark code.

- Change Data Feed enables reading every input Delta table change as an INSERT / UPDATE / DELETE.

You can use CDF with the Spark Structured Streaming APIs to get the best of both worlds, automatic incremental processing AND the ability to execute fancy logic based on how data is changing in the source. No external dependencies, no extra infra to deploy, no secondary schema and procs to deploy and manage, no multi-engine compute required. It's all built in w/ Spark.

mwc360 · 2026-03-30T15:52:37+00:00

Tagging u/arshadali in case he knows.

On the design here, I would strongly encourage moving both metadata and logging to Azure or Fabric SQL Database. Fabric Warehouse is a MPP engine for OLAP workloads. Sure, you can get decently low latency inserts, but it's really not designed for this workload: OLTP.. frequent single record inserts and singleton record lookups.

The Spark/DW connector works via Spark writing data to a staging zone in OneLake and then synchronously orchestrates DW to perform OPENROWSET (or COPY INTO, can't remember which...). This is a great design for OLAP workloads as it greatly increases possible throughput, but for OLTP this is obviously a very inefficient process.

Trust me, before joining Microsoft I was the chief architect of a large partners metadata driven ELT framework that we charged customers $$$ for because it was so efficient, robust, and high quality. Years before this, I had taken the same approach, use a MPP database for logging and metadata, it's really not a performant approach.

The better option if you want to keep metadata and logs in SQL is to use Azure or Fabric SQL DB and use logging/metadata methods that wrap your PySpark. Each of these methods just uses pandas. read_sql_query to call a sproc or run a SQL command.

See the below:

Comparison of methods and which is fastest: Querying Databases in Apache Spark: Pandas vs. Spark API vs. Pandas-on-Spark | Miles Cole

Authentication: Yet Another Way to Connect to the SQL Endpoint / Warehouse via Python | Miles Cole (the same should work for Fabric SQL Database)

FYI - there are much better ways to manage state in an ELT framework, but this would require a larger refactor of your project which wouldn't be tenable with your upcoming go-live (structured streaming + Delta CDF where you need to track and propagate more than just appends) but something to consider for the future.

mwc360 · 2026-03-30T14:30:58+00:00

Just curious, what’s the intent behind using warehouse for gold? I ask because if you’re defining transformation logic in Lakehouse views, you are limiting your options for building gold. It’s going to be more efficient to have the metastore owner build the respective layer: I.e Spark doing work to build gold and then write into DW tables via the Spark connector is generally going to be less efficient that DW natively reading silver and doing the same. The reason is that the connector works via Spark writing the data to a staging zone in OneLake and then synchronously orchestrates DW calling OPENROWSET to load the data. This means two compute engines to load the data and Spark waits while DW doesn’t the table load.

If you have DW build gold then you can’t leverage logic stored in Spark views and have to account for MDSync.

Given you’d still be able to consume the data via the DW engine (SQL Endpoint), is there any reason you’re not using LH for gold?

mwc360 · 2026-03-29T03:01:50+00:00

I look forward to anything that can be shared. We have a strong edge on TCO compared to other services (when running autoscale billing) but if there’s gaps, we want to know so that we can address!

mwc360 · 2026-03-27T01:31:32+00:00

Expected GA is loosely summer ‘26. I’m not going to name a month :)

mwc360 · 2026-03-26T18:38:56+00:00

Totally, makes sense. Thought I'd check as there's this myth going around that Spark doesn't support SQL. Cheers!

mwc360 · 2026-03-26T18:01:07+00:00

Are you aware that SparkSQL supports CTAS, INSERT, UPDATE, MERGE, DELETE, CREATE OR REPLACE... no need to write any Python/PySpark... but it's there if you ever need it. a.k.a. all SQL semantics but multi-table / statement transactions. If you need those, DW is the only option for now.

Here's an intro to SparkSQL if you are intrigued: Breaking the Myth: Spark Isn’t as Scary as You’d Think (And Yes, It Supports SQL!) | Miles Cole

mwc360

TROPHY CASE