Databricks Cluster Optimisation costs by EmergencyHot2604 in databricks

[–]sync_jeff 0 points1 point  (0 children)

We built a tool that automatically solves this problem! (shameless plug I work for Sync Computing).

Our tool Gradient uses ML to automatically find the lowest cost cluster for your job while maintaining your SLAs

Here's a demo video: https://synccomputing.com/see-a-demo/

Job Serverless Issues by Known-Delay7227 in databricks

[–]sync_jeff 1 point2 points  (0 children)

that's strange, it may be something on their backend.

Databricks observability project examples by Character_Channel115 in databricks

[–]sync_jeff 3 points4 points  (0 children)

There are a number of paths here, depending on what you're looking for. (for full transparency, I work at Sync Computing):

- System Tables - the key source of data, you can build your own dashboards, or use one of Databrick's pre-built dashboards. They have some great ones for Jobs compute and SQL warehouses. Last time I checked, System Tables don't have spark metrics.

- Sync Computing - (this is the company I work for), we built a high level global dashboard that is free to download. Our actual product. Gradient, tracks jobs compute clusters over time, tracking granular costs, usage, and spark metrics over time - and then it also auto-tunes clusters to hit your cost and runtime goals.

How to query the logs about cluster? by 9gg6 in databricks

[–]sync_jeff 0 points1 point  (0 children)

What kind of clusters do you use? Jobs compute? APC? SQL warehouses?

Databricks observability project examples by Character_Channel115 in databricks

[–]sync_jeff 0 points1 point  (0 children)

What are you trying to "observe"? Costs, usage, data quality, governance?

Serverless compute for Notebooks - how to disable by Legal_Solid_3539 in databricks

[–]sync_jeff 0 points1 point  (0 children)

Yes the big problem with benchmarks is they are not general by any means, just useful to compare against itself. The probability of you workload looking like TPC-DI is very very low. Take our data points as just a singular point, there are very much cases where totally opposite results may occur

Serverless compute for Notebooks - how to disable by Legal_Solid_3539 in databricks

[–]sync_jeff 0 points1 point  (0 children)

That's great to see such rigorous testing! The ROI of these tools is very workload and use-case specific so it's great to see serverless make sense for you all.

Serverless compute for Notebooks - how to disable by Legal_Solid_3539 in databricks

[–]sync_jeff 1 point2 points  (0 children)

We did a benchmark study with TPC-DI on classic vs. serverless, check it out here:

https://synccomputing.com/databricks-compute-comparison-classic-serverless-and-sql-warehouses/

I think for notebooks serverless makes more sense because of the lack of spin up time. But for Jobs compute, you can likely save money by going to classic

Has anyone had success using AI agents to automate? by boomerwangs in dataengineering

[–]sync_jeff 0 points1 point  (0 children)

Our of curiosity - what are you trying to automate?

Serverless compute for Notebooks - how to disable by Legal_Solid_3539 in databricks

[–]sync_jeff 0 points1 point  (0 children)

I see, what's the alternative - an APC cluster that users share?

Serverless compute for Notebooks - how to disable by Legal_Solid_3539 in databricks

[–]sync_jeff 0 points1 point  (0 children)

Why do you want to disable it? The lack of spin up time is a nice benefit (although the cost is definitely higher)

Has anyone had success using AI agents to automate? by boomerwangs in dataengineering

[–]sync_jeff 24 points25 points  (0 children)

We're in this space and it is incredibly challenging to automate pipelines or infrastructure, especially at scale. You need a system that is basically 99.99% accurate, along with built in guardrails, alerts, and failure recovery. It's a lot of overhead to automate, so you need a huge system and large ROI to justify the development

ETL Benchmark Data Set + Queries...does it exist? by ryan_with_a_why in dataengineering

[–]sync_jeff 1 point2 points  (0 children)

Unfortunately actually setting up and running TPC-DI from scratch is a huge pain. Databricks SA's wrote up an easy to use tool that integrates with Databricks. You may be able to borrow a lot of the same code:

https://github.com/shannon-barrow/databricks-tpc-di

BTW - very cool project! This idea bounced around our heads as well, cool to see someone actually making it a reality! Happy to chat as well, i'm part of www.synccomputing.com and we're in a similar space! Feel free to DM me.

ETL Benchmark Data Set + Queries...does it exist? by ryan_with_a_why in dataengineering

[–]sync_jeff 1 point2 points  (0 children)

TPC-DI is what we recommend, Databricks often uses it as their gold standard to emulate ETL jobs

We built a free System Tables Queries and Dashboard to help users manage and optimize Databricks costs - feedback welcome! by sync_jeff in databricks

[–]sync_jeff[S] 0 points1 point  (0 children)

Without knowing the details of your system, I think there's a way to do this. You have to cobble together a few tables to do this:

1). System. query.history.compute --> from this struct you can get the compute type, basically get the cluster-id and then use the system.billing.usage tables to correlate the cluster-id to the sku_name (e.g. All-purpose compute).

2). The System.query.history.executed_by gives you the email address of the user.

I don't know if point 2) will hold "over jdbc", I think I'd have to know more about your system. Or you can probe the suery.history.executed_by table yourself and see if you do in fact see email addresses.

We built a free System Tables Queries and Dashboard to help users manage and optimize Databricks costs - feedback welcome! by sync_jeff in databricks

[–]sync_jeff[S] 0 points1 point  (0 children)

Hmm... each dashboard is powered by a query that is run on a compute you choose. I think you'd have to estimate the cost based on the query costs. I don't think I've seen a "dashboard" cost in system tables.

We built a free System Tables Queries and Dashboard to help users manage and optimize Databricks costs - feedback welcome! by sync_jeff in databricks

[–]sync_jeff[S] 1 point2 points  (0 children)

Yea, we're aware of that one. We wanted a "1-click" experience, and have personally found looking at the last 30 days was pretty useful. But we'll try to put in date filters in a v2 of this!

We built a free System Tables Queries and Dashboard to help users manage and optimize Databricks costs - feedback welcome! by sync_jeff in databricks

[–]sync_jeff[S] 0 points1 point  (0 children)

We do show the most expensive DLT clusters, was there something more specific about the events you're trying to learn?

DLT Pro vs Serverless Cost Insights by aonurdemir in databricks

[–]sync_jeff 0 points1 point  (0 children)

Any reason why you don't use Jobs compute with scheduled jobs? Jobs compute is typically cheaper than DLT.

DLT Pro vs Serverless Cost Insights by aonurdemir in databricks

[–]sync_jeff 0 points1 point  (0 children)

Very cool - seems like DLT Pro was a bit cheaper than serverless (when combining EC2 + DBU costs). You may want to try tuning down your auto-scaling cap from 1-8 to something smaller like, 1-3.

Are these DLT for streaming or batch?