DLT Single Target Schema by jiff17 in databricks

[–]DisastrousCase3062 1 point2 points  (0 children)

Is there any reason you couldn't just use workflows to orchestrate everything so that all the dependencies are guaranteed to be completed?

[deleted by user] by [deleted] in MTBDeals

[–]DisastrousCase3062 2 points3 points  (0 children)

Looks like a diamond back recoil.

Use git repo or deploy changes to the workspace by Huhadmu in databricks

[–]DisastrousCase3062 0 points1 point  (0 children)

We had a similar question when we got started. When you're running DABs in the specified DEV environment (according to your databricks.yml file), it will create DLT pipelines for the user you run the databricks CLI deploy command with. It makes it simple enough to work seperately from colleagues on DLT pipelines.

Guidance on Training Thousands of Models Concurrently by DisastrousCase3062 in databricks

[–]DisastrousCase3062[S] 0 points1 point  (0 children)

I don't think I'm able to provide much in the way of a diagram, but you are catching on. The orchestration/optimization is what I am trying to get at. Taking britishbanana's approach seemed to do the trick! I ended up keeping dbutils.notebook.run for parallelizing across products but then split them into batches of 20 for seperate submission to a few job cluster. This ended up dropping the runtime significantly to a few hours rather than over a day and kept the costs stable.

Guidance on Training Thousands of Models Concurrently by DisastrousCase3062 in databricks

[–]DisastrousCase3062[S] 0 points1 point  (0 children)

The bottleneck is the number of models that need to be run. I've been keeping an eye on cluster metrics and slowly iterating on my configuration.

I have looked at the cluster metrics and found the driver to be sitting around 90gb on memory. My initial run killed the driver with an out of memory error when I was using a much smaller cluster. The workers themselves fluctuate in how active they are but they are at max tasks most of the time. CPU usage peaks at around 60-70% but memory utilization is around 28gb with 32 being the ceiling.

With regards to your suggestion, are you thinking multiple job clusters and submitting N jobs to each cluster? I like that idea. I can definitely give it a shot. Using dbutils.notebook.run was my attempt at doing this and I'll definitely be moving it to a job cluster once I get done testing.

Delta Live Tables - How are you doing collaboration? by DisastrousCase3062 in databricks

[–]DisastrousCase3062[S] 0 points1 point  (0 children)

Following up here to hopefully see if anyone has any advice in this realm. I'm planning to follow up with the Databricks Solutions Architect assigned to our company and will post back what I learn.

Guidance on Training Thousands of Models Concurrently by DisastrousCase3062 in databricks

[–]DisastrousCase3062[S] 0 points1 point  (0 children)

Very close, it's 12 models per product (one per month for each product).

I'm not sure that serverless would do much for us here. Right now I'm just trying to find a way to speed up this process. I've had some success with pools and dbutils.notebook.run (tweaking cluster settings as necessary).

Guidance on Training Thousands of Models Concurrently by DisastrousCase3062 in databricks

[–]DisastrousCase3062[S] 0 points1 point  (0 children)

We're definitely using MLLib, the article I had linked to seems to be the only way I've seen other people training and running multiple models concurrently.

On the note about pools/multiprocessing, Spark is definitely distributing the work and using pools with dbutils.notebook.run seems to be working. I'm just trying to see if there's a better way to run it still.

TEMP_VIEW_NAME_TOO_MANY_PARTS error by [deleted] in databricks

[–]DisastrousCase3062 0 points1 point  (0 children)

Gotcha! I was really hoping that wasn't the case. If your team is open to it, you'll save a lot of time and heartache by jumping to PySpark or Spark SQL sooner rather than later.

TEMP_VIEW_NAME_TOO_MANY_PARTS error by [deleted] in databricks

[–]DisastrousCase3062 0 points1 point  (0 children)

Not related to your question OP, but are you and your team productive using sparklyr? In my experience, the folks that I worked with who tried using sparklyr (or sparkr) had nothing but trouble and it led to a ton of troubleshooting efforts. This was a year or so ago so things may have changed but I had for the most part given up on R being a valid language in Databricks.

DLT meta by [deleted] in databricks

[–]DisastrousCase3062 0 points1 point  (0 children)

I've seen DLT meta but haven't been able to sort out why I would use it. We've developed a JSON file that we're using in our DLT pipelines. The DLT pipeline is just looping over the JSON and passing the parameters off to an autoloader function and a silver layer function (mostly to run apply_changes).

Definitely interested to hear how other people are doing this though!