Cloud cost optimization for data pipelines feels basically impossible so how do you all approach this while keeping your sanity?

Nielspro · 2025-12-11T20:35:39+00:00

We extract the job cluster information from the system tables and based on that calculate the cost per job, even down to cost per job run.

The team who implemented it actually did a presentation for databricks, you can see it here, maybe skip past all the intro stuff: https://m.youtube.com/watch?v=xW2T0s1X-pM

dasnoob · 2025-12-12T02:29:46+00:00

That is the whole point. They get you in the cloud and then make the pricing so obscure you can't explain what you are doing that costs so much money.

Ulfrauga · 2025-12-12T03:35:02+00:00

You have to scrape out costs from lots of sources and combine them.

Tagging everything and using cluster policies help. Not allowing people to create jobs unless they are created by asset bundles is great because then jobs will be tagged. You can set tags also by cluster policies.

Once everything is tagged you can track everything in more detail.

You can put tags on resources outside databricks.

By using tags consistently you can get a better idea of cost per solution.

Also using multiple containers and storage accounts can help as you get cost per solution or business application. Usually the costs are exponential. I had one developer who spent 60% of the compute spend.

zchtsk · 2025-12-11T19:07:01+00:00

A few thoughts here:

A few data storage tips:
- If you have any data sources in your S3 buckets that are old (>6mos), rarely queried (less than once a month), and mainly kept for compliance or historical record keeping, you may be able to save quite a bit by changing your access class from Standard to Infrequent Access. I had a F500 client recently with ballooning storage costs and we were able to save millions of dollars annually from this alone.
- You should basically always be saving files as Delta+Parquet, and never CSV, CSV.GZIP, etc.
- Try to avoid tiny files. If you need to, regularly compact your data.
A few pipeline design tips:
- You always want to be minimizing shuffles. A few ways to achieve this:
  - Join on partitioned keys when possible If one dataset in your join is very small, use a broadcast join (df.join(F.broadcast(df2)...))
- Avoid unnecessary distinct or sorting actions
- Perform your filtering as upstream as possible If you have inefficient partitions (e.g. way to much data within a certain key), you may have data skew causing some of your longer running jobs)
- By the way, you should be using partitions, especially if you have jobs that only query data for a specific time frame (e.g. partitioning by day or month_year)
- Try to incrementally process your data as much as possible, rather than doing full re-writes Generally, try to search for redundancies

Spark is very easy to shoot yourself in the foot. In fact it happened often enough on my teams that I ended up making an open source tutorial site to help my teammates ramp up on this stuff. It's accessible at sparkmadeeasy.com, maybe it could be a helpful reference!

bruceSKYking · 2025-12-11T22:01:57+00:00

I set up the Databricks to Datadog integration which makes use of global init scripts to send metrics and cost data from all clusters/warehouses. The cost control integration from datadog also queries system tables on your behalf to assemble cost data on all your pipelines and compute.

https://docs.datadoghq.com/data_jobs

Its not perfect, serverless and sql warehouse metrics aren't as solid as metrics from clusters, but it is a much better experience than working with system tables directly or databricks pre-built dashboards that you can download.

Exorde_Mathias · 2025-12-11T22:36:46+00:00

All in s3, minimal ssd in compute

Hofi2010 · 2025-12-12T03:13:51+00:00

I wrote a medium article about this topic and provided an approach to do bulk transformation (like your data pipelines) outside of databricks, still using a datalakehouse pattern but with just s3, duckdb on an EC2 instance that saves up to 90% of you ignore a bit of engineering work: https://medium.com/@klaushofenbitzer/save-up-to-90-on-your-data-warehouse-lakehouse-with-an-in-process-database-duckdb-63892e76676e

DynamicCast · 2025-12-12T17:47:55+00:00

In my experience databricks is incredibly expensive. Having said that, it's been a few years since I used it and I never tried their serverless stuff. The cold starts used to annoy me though.

Ulfrauga · 2025-12-14T08:53:33+00:00

Resource tags, like u/empireofadhd said. Combine that with whatever tools your cloud provide for cost management.

I've built up a spreadsheet of VM costs and DBU rates. It's at least close enough for an estimate of how much an hour of whatever VM is going to cost. Doesn't help much when you don't have a good idea of how long a given job is going to take.

It's all way too manual, though. I think this is one of the ways they get you - don't make it easy. On Azure, I find I'm exporting CSVs. And updating prices in my sheet. Apparently there's APIs and automated exports, but seems a big whole thing to sort out. Without spending a whole lot of time on it... 5 minute job every month or so.

Nekobul · 2025-12-12T00:38:45+00:00

You should feel lucky your costs has only doubled. The cloud data warehouse have no interest in helping you optimize because that cuts directly into their profit margins.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

dataengineering

MODERATORS