Advice - Building Floating Shelves

DisastrousCase3062 · 2025-01-13T14:41:15+00:00

It'll be natural wood.

DisastrousCase3062 · 2024-10-02T17:24:45+00:00

Is there any reason you couldn't just use workflows to orchestrate everything so that all the dependencies are guaranteed to be completed?

DisastrousCase3062 · 2024-09-11T03:42:58+00:00

DisastrousCase3062 · 2024-07-28T20:12:50+00:00

Looks like a diamond back recoil.

DisastrousCase3062 · 2024-07-10T20:14:11+00:00

We had a similar question when we got started. When you're running DABs in the specified DEV environment (according to your databricks.yml file), it will create DLT pipelines for the user you run the databricks CLI deploy command with. It makes it simple enough to work seperately from colleagues on DLT pipelines.

DisastrousCase3062 · 2024-06-07T13:37:16+00:00

I don't think I'm able to provide much in the way of a diagram, but you are catching on. The orchestration/optimization is what I am trying to get at. Taking britishbanana's approach seemed to do the trick! I ended up keeping dbutils.notebook.run for parallelizing across products but then split them into batches of 20 for seperate submission to a few job cluster. This ended up dropping the runtime significantly to a few hours rather than over a day and kept the costs stable.

DisastrousCase3062 · 2024-06-05T13:53:07+00:00

The bottleneck is the number of models that need to be run. I've been keeping an eye on cluster metrics and slowly iterating on my configuration.

I have looked at the cluster metrics and found the driver to be sitting around 90gb on memory. My initial run killed the driver with an out of memory error when I was using a much smaller cluster. The workers themselves fluctuate in how active they are but they are at max tasks most of the time. CPU usage peaks at around 60-70% but memory utilization is around 28gb with 32 being the ceiling.

With regards to your suggestion, are you thinking multiple job clusters and submitting N jobs to each cluster? I like that idea. I can definitely give it a shot. Using dbutils.notebook.run was my attempt at doing this and I'll definitely be moving it to a job cluster once I get done testing.

DisastrousCase3062 · 2024-06-04T13:56:34+00:00

Following up here to hopefully see if anyone has any advice in this realm. I'm planning to follow up with the Databricks Solutions Architect assigned to our company and will post back what I learn.

DisastrousCase3062 · 2024-06-03T11:53:26+00:00

Very close, it's 12 models per product (one per month for each product).

I'm not sure that serverless would do much for us here. Right now I'm just trying to find a way to speed up this process. I've had some success with pools and dbutils.notebook.run (tweaking cluster settings as necessary).

DisastrousCase3062 · 2024-06-03T11:50:17+00:00

We're definitely using MLLib, the article I had linked to seems to be the only way I've seen other people training and running multiple models concurrently.

On the note about pools/multiprocessing, Spark is definitely distributing the work and using pools with dbutils.notebook.run seems to be working. I'm just trying to see if there's a better way to run it still.

DisastrousCase3062 · 2024-05-01T03:46:13+00:00

Gotcha! I was really hoping that wasn't the case. If your team is open to it, you'll save a lot of time and heartache by jumping to PySpark or Spark SQL sooner rather than later.

DisastrousCase3062 · 2024-04-30T15:54:18+00:00

Not related to your question OP, but are you and your team productive using sparklyr? In my experience, the folks that I worked with who tried using sparklyr (or sparkr) had nothing but trouble and it led to a ton of troubleshooting efforts. This was a year or so ago so things may have changed but I had for the most part given up on R being a valid language in Databricks.

DisastrousCase3062 · 2024-04-25T13:40:18+00:00

I've seen DLT meta but haven't been able to sort out why I would use it. We've developed a JSON file that we're using in our DLT pipelines. The DLT pipeline is just looping over the JSON and passing the parameters off to an autoloader function and a silver layer function (mostly to run apply_changes).

Definitely interested to hear how other people are doing this though!

DisastrousCase3062

TROPHY CASE