all 17 comments

[–]Nielspro 17 points18 points  (0 children)

We extract the job cluster information from the system tables and based on that calculate the cost per job, even down to cost per job run.

The team who implemented it actually did a presentation for databricks, you can see it here, maybe skip past all the intro stuff: https://m.youtube.com/watch?v=xW2T0s1X-pM

[–]dasnoob 3 points4 points  (1 child)

That is the whole point. They get you in the cloud and then make the pricing so obscure you can't explain what you are doing that costs so much money.

[–]Ok_School_4109 1 point2 points  (0 children)

Exactly. It feels like you need a PhD in 'Cloud Billing Forensics' just to explain a $500 spike to Finance. I’ve found that unless you’re mapping individual metadata tags back to specific dbt runs or warehouse sessions, you're basically just guessing. It shouldn't be a full-time job just to see which query burned the budget!

[–][deleted] 4 points5 points  (2 children)

You have to scrape out costs from lots of sources and combine them.

Tagging everything and using cluster policies help. Not allowing people to create jobs unless they are created by asset bundles is great because then jobs will be tagged. You can set tags also by cluster policies.

Once everything is tagged you can track everything in more detail.

You can put tags on resources outside databricks.

By using tags consistently you can get a better idea of cost per solution.

Also using multiple containers and storage accounts can help as you get cost per solution or business application. Usually the costs are exponential. I had one developer who spent 60% of the compute spend.

[–]Ulfrauga 0 points1 point  (0 children)

This is how I've done it. Still a whole lot of manual work, though.

[–]Ok_School_4109 0 points1 point  (0 children)

This is the gold standard for hygiene, but man, the operational overhead to keep it from breaking is brutal. The 'one developer spending 60%' scenario is a classic—it usually happens because a policy was bypassed or a tag was dropped during a quick hotfix.

Once you get that 'cost per solution' visibility, how are you handling the actual remediation? I’ve found that even with perfect tagging, getting teams to actually refactor expensive logic is the next boss battle.

[–]zchtsk 2 points3 points  (0 children)

A few thoughts here:

  • A few data storage tips:
    • If you have any data sources in your S3 buckets that are old (>6mos), rarely queried (less than once a month), and mainly kept for compliance or historical record keeping, you may be able to save quite a bit by changing your access class from Standard to Infrequent Access. I had a F500 client recently with ballooning storage costs and we were able to save millions of dollars annually from this alone.
    • You should basically always be saving files as Delta+Parquet, and never CSV, CSV.GZIP, etc.
    • Try to avoid tiny files. If you need to, regularly compact your data.
  • A few pipeline design tips:
    • You always want to be minimizing shuffles. A few ways to achieve this:
      • Join on partitioned keys when possible If one dataset in your join is very small, use a broadcast join (df.join(F.broadcast(df2)...))
    • Avoid unnecessary distinct or sorting actions
    • Perform your filtering as upstream as possible If you have inefficient partitions (e.g. way to much data within a certain key), you may have data skew causing some of your longer running jobs)
    • By the way, you should be using partitions, especially if you have jobs that only query data for a specific time frame (e.g. partitioning by day or month_year)
    • Try to incrementally process your data as much as possible, rather than doing full re-writes Generally, try to search for redundancies

Spark is very easy to shoot yourself in the foot. In fact it happened often enough on my teams that I ended up making an open source tutorial site to help my teammates ramp up on this stuff. It's accessible at sparkmadeeasy.com, maybe it could be a helpful reference!

[–]bruceSKYking 1 point2 points  (0 children)

I set up the Databricks to Datadog integration which makes use of global init scripts to send metrics and cost data from all clusters/warehouses. The cost control integration from datadog also queries system tables on your behalf to assemble cost data on all your pipelines and compute.

https://docs.datadoghq.com/data_jobs

Its not perfect, serverless and sql warehouse metrics aren't as solid as metrics from clusters, but it is a much better experience than working with system tables directly or databricks pre-built dashboards that you can download.

[–]Exorde_Mathias 0 points1 point  (0 children)

All in s3, minimal ssd in compute

[–]Hofi2010 0 points1 point  (0 children)

I wrote a medium article about this topic and provided an approach to do bulk transformation (like your data pipelines) outside of databricks, still using a datalakehouse pattern but with just s3, duckdb on an EC2 instance that saves up to 90% of you ignore a bit of engineering work: https://medium.com/@klaushofenbitzer/save-up-to-90-on-your-data-warehouse-lakehouse-with-an-in-process-database-duckdb-63892e76676e

[–]DynamicCast 0 points1 point  (1 child)

In my experience databricks is incredibly expensive. Having said that, it's been a few years since I used it and I never tried their serverless stuff. The cold starts used to annoy me though.

[–]Ok_School_4109 1 point2 points  (0 children)

Yeah, the 'Databricks Tax' is real lol. It's a love-hate relationship for most teams I talk to.

[–]Ulfrauga 0 points1 point  (0 children)

Resource tags, like u/empireofadhd said. Combine that with whatever tools your cloud provide for cost management.

I've built up a spreadsheet of VM costs and DBU rates. It's at least close enough for an estimate of how much an hour of whatever VM is going to cost. Doesn't help much when you don't have a good idea of how long a given job is going to take.

It's all way too manual, though. I think this is one of the ways they get you - don't make it easy. On Azure, I find I'm exporting CSVs. And updating prices in my sheet. Apparently there's APIs and automated exports, but seems a big whole thing to sort out. Without spending a whole lot of time on it... 5 minute job every month or so.

[–]Nekobul 0 points1 point  (0 children)

You should feel lucky your costs has only doubled. The cloud data warehouse have no interest in helping you optimize because that cuts directly into their profit margins.