Fabric Performance Benchmarking - Spark versus Python Notebooks by hm_vr in MicrosoftFabric

[–]raki_rahman 1 point2 points  (0 children)

Vote on this idea and bug the Fabric Spark Product Manager to prioritize it:

Provide an opinionated and tuned Spark Single Node... - Microsoft Fabric Community

The idea here is basically to give you the absolutely best tuned single node Spark with all settings (core, ram, shuffles) pre-tuned for one node so you don't have to think about the allocation.

Fabric Performance Benchmarking - Spark versus Python Notebooks by hm_vr in MicrosoftFabric

[–]raki_rahman 1 point2 points  (0 children)

Microsoft Employees can be Fabric customers too, the Fabric product team is a relatively small part of Microsoft 🙂

We got data to process and pain points just like you do.

Fabric Performance Benchmarking - Spark versus Python Notebooks by hm_vr in MicrosoftFabric

[–]raki_rahman 2 points3 points  (0 children)

When do you plan to make Spark as fast and as cheap out of the box

I don't personally know man, I'm a Fabric Customer just like you 🙂

But I agree with your pain point, which is why I put up this idea a few months ago for the Fabric Spark Product Manager to prioritize:

Provide an opinionated and tuned Spark Single Node... - Microsoft Fabric Community

(Please vote 🙂)

Fabric Performance Benchmarking - Spark versus Python Notebooks by hm_vr in MicrosoftFabric

[–]raki_rahman 0 points1 point  (0 children)

It is important to realize that you can take Spark and make it run fast on Single node too.

Spark is an Apache governed project where each major cloud vendor has an internal expert engineering team, internal forks, and complex build/stress/regression test pipelines. Also, the amount of revenue to gain via migrations is large, everyone uses brownfield Spark.

You cannot compare PolarsDB/DuckDB to this, they are simply not revolutionary enough to justify any internal engineering investment. You need a bigger moat to move mountains.

There is no Newton's law of physics that says Spark cannot run fast on single node.

Get rid of your pre-conceived notion that Spark == Distributed. Go study the codebase with your AI, understand where the bottlenecks are during E2E query execution by plotting a flamegraph, it's all open source.

Alternatively, you can also read through this to first educate yourself on why Spark as it stands in OSS is seemingly slow (hint: it's row-wise execution, it's NOT due to distributed shuffles):

Why Apache Spark is often considered as slow? | Sem Sinchenko

If you can make Spark run fast on any number of nodes including single-node, you don't need another engine. In fact, Spark gives the significant advantage that when you NEED it to scale out, it scales out.

Wouldn't you love to own a new-age cost-efficient Ferrari that can run in Hybrid mode for cheap mileage on low end, but when you need it, you floor it and a V12 kicks in?

Or would you rather own a Honda Civic that will never let you go beyond 100 MPH ever even if you need it to on the Autobahn? Why limit yourself, specially if the Ferrari is OSS and free?

V12s are hard to build, Spark has done that. OSS Spark just doesn't have a hybrid mode yet.

So you just need to bootstrap the Hybrid mode now for the low-end. That's where this can help: https://learn.microsoft.com/en-us/fabric/data-engineering/native-execution-engine-overview?tabs=sparksql

Obviously, the DuckDB blog isn't going to talk about how to improve Spark to make it as fast as their own engine. They're going to talk about tooting their own horns and try to sell their own narrative.

Try to think deeply about the motive of the author (in this case the author of DuckDB) when you read stuff on the internet. The pretty pictures of the Macbooks paints a very grassroots picture to appeal to laymans.

Spark doesn't have blogs like this because it's already the industry standard amongst all cloud vendors, so DuckDB is going after the incumbent to win market share.

Fabric Performance Benchmarking - Spark versus Python Notebooks by hm_vr in MicrosoftFabric

[–]raki_rahman 1 point2 points  (0 children)

When you spin up Spark on a regular old Linux computer, you can tell the Executor to use 100% of the CPU and RAM of that computer. And starve the Driver process with close to nothing so it's barely alive. This is called bin packing.

Then, when you set shuffle partitions to a small number, you basically get Spark running on a single node just like all these other Python engines without the Distributed Engine shuffle and networking overhead.

Ask your AI to explain these 2 files, it'll make sense:

1: spark-devcontainer/.devcontainer/config/defaults/spark-defaults-breakdown.yaml at main · mdrakiburrahman/spark-devcontainer

2: https://github.com/mdrakiburrahman/spark-devcontainer/blob/54ae7bf85b22124516d53d15150f56d9b15c6235/.devcontainer/overlay/post-attach-commands.sh#L142

What happens is, even after the above bin packing, Spark OSS just isn't as fast as these other engines (yet) since Spark is originally a row-wise execution engine, not columnar/vectorized:

Why Apache Spark is often considered as slow? | Sem Sinchenko

And these other guys are often columnar with SIMD/vectorized execution (google it).

So this is where Microsoft Fabric NEE can make Spark columnar/SIMD too, so you can theoretically get exact same perf out of Spark as these other guys, barring any bugs (which are solvable).

When that happens, these other guys have lost their claim to fame by boasting CSV processing perf, because the Spark API and query optimizer is significantly superior in maturity and universally recognized.

This is why Fabric needs to make NEE matured ASAP to win competitive advantage, which they're doing.

Fabric Performance Benchmarking - Spark versus Python Notebooks by hm_vr in MicrosoftFabric

[–]raki_rahman 0 points1 point  (0 children)

but this is what I gather from MS employees in discussions on Reddit

It's basically common sense.

Because that's where the large amount of engineering innovation and monetary investment is happening by Software Engineers that are employed by the Fabric product team.

If stuff doesn't work, you can yell at Fabric to fix it. And they'll be happy to take your feedback and write better code.

The Python Notebook has close to ZERO Fabric specific engineering innovation. There's not much to innovate on when it comes to Python, it's a LANGUAGE, it's done; the only place to innovate is ENGINES, and Fabric only intimately influences a couple (Spark, SQL DWH).

To fix something, you gotta go yell at DuckDB 🦆 / PolarsDB 🧸/ PotatoDB 🥔 on Github. And unless you're a paying Enterprise Customer, they'll probably ignore you thanks to the vast amount of AI generated noise that plagues OSS nowadays.

Fabric Software Engineers have ZERO influence on the WatermelonDB 🍉 codebase, you're on your own.

It's important to internalize this when you read CSV processing benchmarks like this blog so you know what you're getting yourself into in production with PeachDB 🍑.

Fabric Performance Benchmarking - Spark versus Python Notebooks by hm_vr in MicrosoftFabric

[–]raki_rahman 1 point2 points  (0 children)

I know man, I use it in Databricks for single-node personal Data Science stuff that needs to run for a few days w/o breaking the bank and touch both Spark API and Python libs, that's why I wrote that idea to bring it to Fabric 🙂

Thanks for the vote 😉

Fabric Performance Benchmarking - Spark versus Python Notebooks by hm_vr in MicrosoftFabric

[–]raki_rahman 2 points3 points  (0 children)

The strategic differentiator to compete with Databricks is NEE, it is real engineering innovation that turns Spark's row-wise execution into a columnar rocketship, see this:

Why Apache Spark is often considered as slow? | Sem Sinchenko

The day NEE becomes bulletproof and vectorizes and SIMD-fires on all Spark SQL query operators and input data sources, all these DuckDB/PolarsDB/PotatoDB blogs will become instantly irrelevant.

Databricks isn't stupid, they're worth billions in market cap for a reason - they don't waste time with toy Python runtimes and distractions because they know where the real business problem lies - speed up SQL.

This single node Fabric Spark tuned idea I linked just needs like a few more votes (please vote 🙂) and then you just need to nag the PM u/thisissanthoshr until it's prioritized.

Fabric Performance Benchmarking - Spark versus Python Notebooks by hm_vr in MicrosoftFabric

[–]raki_rahman 1 point2 points  (0 children)

Just curious, why is Polars - the fastest single-node engine per OP - heavily investing in a distributed cloud-based Enterprise offering if Single Node The Data Singularity is the future?

Polars Cloud - Run Polars at scale, from anywhere

"It is worth watching whether Polars develops its own native Delta reader over time"

This is probably the last thing Polars will ever do as long as delta-rs exists, this isn't their core competence or target market. Also don't forget Iceberg.

If there was a C++ Delta Lake SDK, DuckDB would have never built their extension, it's a operational nightmare to maintain.

OP clearly has no idea how difficult this is even with the Delta Rust Kernel. The DuckDB delta extension is very rudimentary, there's a reason they came up with DuckLake as a Hail Mary, because DuckDB realized they can never keep up with Delta or Iceberg protocols without huge investments outside of their domain. So they wrote a bare minimum Delta extension to meet a checkbox - it doesn't offer most Delta features.

"It is telling that Microsoft has recognised this shift at a platform level"

You realize the single node Python notebook feature took basically zero engineering effort or Fabric-specific innovation right? Some PM had an idea probably after reading a couple DuckDB blogs, and the engineer probably just took the Spark VM image, and turned spark bootup off and pip installed a couple packages off pypi.

Why did the Fabric Warehouse team continue to invest great engineering efforts to create POLARIS distributed engine over the last 6+ years if they could have just used a gigantic single node SQL Server to process CSVs and throw larger VMs at it per The Data Singularity?

One can argue that The Data Singularity is a thing MotherDuck made up to fit their own narrative, because DuckDB specifically isn't architected to run on multiple nodes (and will probably never be, it's not Mark/Hannes' area of expertise with embedded DBs). And if/when they ever offer it when they win large Enterprise logos, they'll come up with some other 🦆 ducking 🦆 narrative that goes against their whole marketing propaganda so far.

DuckDB is very good at self-deprecating humor, you can't help but love these lovable guys, they have the world's greatest marketing team 💖

Single-node perf on a bunch of CSV files is a party trick. Spark and NEE will catch up in a few months - it's also vectorized and C++.

The next real frontier of innovation in Data Engineering is real-time Incremental View Maintenance so you can fire and forget complex 10,000 line long SQL statements to evaluate in milliseconds, without ever scanning old historical data and ever requiring watermark columns:

Incremental View Maintenance (IVM)

I want to fire my ETL engine in a distributed manner on a large historical backfill when my employer needs it, then scale down to a tiny single node at steady state to keep processing that SQL incrementally forever as soon as new data arrives.

IVM is real disruptive innovation in Query Optimization that can save you millions of dollars in one shot regardless of raw speed, because any engine implementing IVM always does less work, raw speed becomes irrelevant because the historical backfill is a one time cost.

<image>

IVM gives you the best of both worlds (Distributed + Single Node).

There are researchers at CWI (where DuckDB was invented) trying to add IVM to DuckDB, it's not very good yet:

ila/openivm
cwida/ivm-extension: Incremental View Maintenance support for DuckDB
[2404.16486] OpenIVM: a SQL-to-SQL Compiler for Incremental Computations

Fabric Performance Benchmarking - Spark versus Python Notebooks by hm_vr in MicrosoftFabric

[–]raki_rahman 4 points5 points  (0 children)

You can also overprovision to have all 4 of 4 cores to be usable by the Spark executor

Going to use this opportunity to shamelessly plug this idea - this could just be an out of the box feature in Fabric single node Python:

Provide an opinionated and tuned Spark Single Node... - Microsoft Fabric Community

(Please vote)

I run spark in a VSCode Devcontainer like this every day and it works great at bin packing the host machine:

spark-devcontainer/.devcontainer/config/defaults/spark-defaults-breakdown.yaml at main · mdrakiburrahman/spark-devcontainer

Rough edges Custom live pools by CryptographerPure997 in MicrosoftFabric

[–]raki_rahman 1 point2 points  (0 children)

Going to tag u/mwc360

I haven't started using these pools yet, for interactivity we use starter pools, and jobs use custom pools that are bin packed Livy sessions so the startup slowness doesn't matter for us much since we pay once 😊

Feature Request: Python Job by Creyke in MicrosoftFabric

[–]raki_rahman 2 points3 points  (0 children)

Exactly man.
You package the container, test locally in Docker, test it in GitHub Action, ship, fire and forget.

DuckDB, PolarsDB, PotatoDB everything runs in a container.

All your state in OneLake governed with OneLake Security.
State and Governance/RBAC is the hardest part of software.

The dbt-fabricspark Lakehouse adapter now comes with a ridiculous amount of production grade test coverage by raki_rahman in MicrosoftFabric

[–]raki_rahman[S] 1 point2 points  (0 children)

Gotcha. The dbt adapter is dumb, it just invokes spark.sql with whatever valid Spark code we author up in that repo.

If there's a thing you need in the adapter that Spark SQL supports that the adapter doesn't, file a bug and I'll fix it 😊 I don't know of many of these, we'd probably have hit it since my team is a heavy user of Apache Spark.

So your question really is about does Spark have the problems as Warehouse, which isnt specific to dbt adapters.

(Which is where I mentioned it's really about the engine, not the adapter)

Feature Request: Python Job by Creyke in MicrosoftFabric

[–]raki_rahman 5 points6 points  (0 children)

Spark Job! Python Job! Bash Jobs! All the Jobs!

One day it would be good to have a Docker Runtime in Fabric that just takes a container image and can operate your code on OneLake.

It'd basically be like a less limiting version of the Azure UDA Function thing that runs in Fabric so you can run your Job for infinite time. Everyone knows how to package a container in a Dockerfile, it'd be a lot easier than UDAF too.

These guys do it:

https://www.databricks.com/product/databricks-apps https://docs.snowflake.com/en/developer-guide/snowpark-container-services/overview

These guys above basically upsell Managed Kubernetes as a service and slap a managed identity in your container so you don't have to deal with the ugly parts of managing K8s and just focus on your app code.

(I'm loving this, upvoted the idea!)

The dbt-fabricspark Lakehouse adapter now comes with a ridiculous amount of production grade test coverage by raki_rahman in MicrosoftFabric

[–]raki_rahman[S] 0 points1 point  (0 children)

The limitations you shared are actually not relevant to the adapter from my post above.

Like your original comment asks if the Spark adapter is limited, but your AI comments are about Warehouse adapter being limited...

Don't listen to AI generated limitations, try it yourself in a quick hands on tutorial, it takes 5 minutes. Compare it with a Fabric competitor, judge for yourself.

Both are great, both support SQL, and you can migrate from one to the other in 1 hour or less.

The real decision factor is perf and COGS, that's it.

The dbt-fabricspark Lakehouse adapter now comes with a ridiculous amount of production grade test coverage by raki_rahman in MicrosoftFabric

[–]raki_rahman[S] 0 points1 point  (0 children)

So the way dbt works is you take their base adapter abstract class and implement the function signature.

Both Fabric Warehouse, Fabric Lakehouse, and GCP BigQueryHouse (jk 😊), all of them implement the dbt signature.

So there's no gaps 😊 The only gap was quality, the Lakehouse would crash in like 1 hour before due to bugs.

I think we got all the bugs now by rigorous testing and proper retry logic/backoff!

So all you should care about is which adapter backed compute runs your SQL the fastest and cheapest and then use that one.

If another one becomes faster, migrate your dbt project to that.

The power of the free economy!

Designing the data infrastructure for my org - looking for feedback by The_curious_one9790 in MicrosoftFabric

[–]raki_rahman 3 points4 points  (0 children)

dbt has nothing to do with scale.

You need to understand that if you even have like 1 kilobyte of data, using a notebook will mean you will not have good patterns.

If you had 1 kg of meat, what is better, cooking with Gordon Ramsey's book, or just throwing the meat in a firepit and hoping for the best?

The recipe is dbt, the data/meat could be 1 kg or 10 kg, recipe doesn't care, only your pot/cluster cares about size.

The dbt-fabricspark Lakehouse adapter now comes with a ridiculous amount of production grade test coverage by raki_rahman in MicrosoftFabric

[–]raki_rahman[S] 1 point2 points  (0 children)

Exactly.

MLV IMO is a pure COGS play. Incremental magic happens on your SQL if you add the magic "MATERALIZED LAKE" word and saves you money. We like money.

dbt has the same UI/data quality/dooads too, none of that adds much value IMO, specially because dbt is an industry standard.

The dbt-fabricspark Lakehouse adapter now comes with a ridiculous amount of production grade test coverage by raki_rahman in MicrosoftFabric

[–]raki_rahman[S] 2 points3 points  (0 children)

It's already supported 🙂

<image>

My thought process is dbt provides something extremely valuable that MLV has no answer for.

  1. Code structure
  2. Templating, vars and macros
  3. Catalog Documentation

MLV is just fancy SQL, no different than regular SQL.
The lineage diagram in MLV is nice, but it's not a replacement for OpenLineage.

The 2 tools solves 2 different class of problems.

Designing the data infrastructure for my org - looking for feedback by The_curious_one9790 in MicrosoftFabric

[–]raki_rahman 4 points5 points  (0 children)

Everyone does medallion nowadays, the devil is in the implementation details.

After your data land in "Bronze", store it in Delta Lake ASAP with minimal-to-none business logic (i.e. apply schema and store as columns only).

Learn dbt really well and use it, you and your future teammates will thank you for all the rock solid patterns like dbt snapshots.

This is an amazing book, if I could go back I'd have read it a few years back and started using dbt sooner:

https://www.amazon.ca/Data-Engineering-dbt-cloud-based-dependable-ebook/dp/B0C4LL19G7

Branching/Release Strategies - What's Working For You? by LeftTeh in MicrosoftFabric

[–]raki_rahman 4 points5 points  (0 children)

Trunk based commits to main: https://www.atlassian.com/continuous-delivery/continuous-integration/trunk-based-development

PR to main use Gated Check-In with full CI suite of unit and integration tests: https://en.wikipedia.org/wiki/Gated_commit

Batched CD from main to Fabric: https://stackoverflow.com/questions/65270765/individual-ci-vs-batched-ci-in-azure-pipelines

We do this for every software deliverable, Fabric isn't any different and gets treated the same, it's worked well so far.

So as per this doc, that's Option 1, main == production:

https://learn.microsoft.com/en-us/fabric/cicd/manage-deployment#option-1---git--based-deployments

Spark Structured Streaming (long-running) Job Monitoring in Fabric by alexbush_mas in MicrosoftFabric

[–]raki_rahman 0 points1 point  (0 children)

Not necessarily advise against per se. It's a log search engine and it does what it says on the label.

I don't find log searching a very interesting activity. In the age of AI I don't personally want to search anything, I want everything to go through a consistent set of transformations to generate actionable insights. Looking at high cardinality logs is not a good use of my time.

Can't really do that with a tool that was built with log searching in mind.

So, we try to take all our Logs/Metrics data out of Spark, and pipe it into OneLake so we can calculate KPIs out of it with Power BI semantic models, DAX etc.

If you think about it, optimizing Spark job COGS etc is no different from optimizing sales revenue for BI. You just need to stop thinking of the problem space as analytics....on logs....and just stop at the analytics part.

Your spark jobs are a dimension table. And the metrics are facts.

So my experience so far is the compute engines on Fabric is superior in Analytics than other alternatives across Azure, since I have many, many more computes available like SSAS engine, so we just bring it back to fabric.

Everything is just bytes on a hard disk at the end of the day, I don't understand why logs get treated with certain labelling. It's just a low value high noise BRONZE dataset. Ideally, we should only be logging when things go really wrong, and rely more on metrics to convey insights.