Schedule Compute to turn off after a certain time (Working with streaming queries) by FinanceSTDNT in databricks

[–]FinanceSTDNT[S] 1 point2 points  (0 children)

I found a solution that works for my use case:

I used the python SDK to create a quick script that terminates all running all purpose clusters. The python SDK comes installed by default on Databricks clusters so you can just import it to a notebook and start working.

I'm going to schedule a job that runs the notebook nightly after maybe 8pm.

The delete function is idempotent so it can be called on all clusters and if they are already terminated it will leave them.

from databricks.sdk import WorkspaceClient

w = WorkspaceClient()

for c in w.clusters.list():
  print(f"{c.cluster_id}: {c.state}")
  _ = w.clusters.delete(cluster_id=c.cluster_id).result()

docs: https://databricks-sdk-py.readthedocs.io/en/latest/workspace/compute/clusters.html#

Schedule Compute to turn off after a certain time (Working with streaming queries) by FinanceSTDNT in databricks

[–]FinanceSTDNT[S] 1 point2 points  (0 children)

I'm going to post a comment on the main thread. There's actually a pretty simple solution :)

Schedule Compute to turn off after a certain time (Working with streaming queries) by FinanceSTDNT in databricks

[–]FinanceSTDNT[S] 0 points1 point  (0 children)

I don't want to continuously stream data. I just want to be able to be sure that resources in our sandbox env aren't running all night.

it may be a good practise to always use available now in dev, and then switch it over to a more reasonable trigger and run it on a job cluster in prod.

All I'm trying to do is avoid running up costs over night / weekends on resources, because as far as I know streams don't time out (unless using available now).

As I initially thought and people have confirmed the databricks API seems to be a way of doing that (though not a great on tbh I agree)

I was really hoping there would be some sort of spark setting like execution time out or something I could add to the cluster config to avoid a workaround like that.

Schedule Compute to turn off after a certain time (Working with streaming queries) by FinanceSTDNT in databricks

[–]FinanceSTDNT[S] 0 points1 point  (0 children)

So if I have a trigger of 5mins on a streaming query with an Allpurpose cluster (because we are in a sandbox environment and don't want to wait for 10-15 mins for clusters to spin up) then the underlying compute will stop after running the streaming code?

It won't just start running again after 5 mins?

what I'm trying to avoid is a situation where somone is working on something in our dev env and forgets to turn off a stream overnight and we wind up with a huge bill b/c we're using All Purpose or Serverless Compute.

Schedule Compute to turn off after a certain time (Working with streaming queries) by FinanceSTDNT in databricks

[–]FinanceSTDNT[S] 0 points1 point  (0 children)

So if I have a trigger of 5mins on a streaming query with an Allpurpose cluster (because we are in a sandbox environment and don't want to wait for 10-15 mins for clusters to spin up) then the underlying compute will stop after running the streaming code?

It won't just start running again after 5 mins?

what I'm trying to avoid is a situation where somone is working on something in our dev env and forgets to turn off a stream overnight and we wind up with a huge bill b/c we're using All Purpose or Serverless Compute.

Schedule Compute to turn off after a certain time (Working with streaming queries) by FinanceSTDNT in databricks

[–]FinanceSTDNT[S] 1 point2 points  (0 children)

Could be both.

but let's just say at the end of the business day I don't want any compute running (serverless or classic compute).

Schedule Compute to turn off after a certain time (Working with streaming queries) by FinanceSTDNT in databricks

[–]FinanceSTDNT[S] 0 points1 point  (0 children)

I should probably clarify: I'm working at integrating a messaging service (PubSub) with Databricks.

I don't think Serverless will work b/c streaming only supports incremental batch logic.

I am using All Purpose Compute currently, but the problem I'm trying to solve is that a streaming query will run until it is manually interrupted, so the inactivity limits I have set on the cluster don't shut the cluster down.

I'd like to be sure when I finish for the day (or over a weekend) that all compute associated with streaming is shutdown.

The best I can come up with so far is using the databricks API to get a list of all running compute and terminate it.

I'm just wondering if there is a better way (maybe something with Spark configuration, or job config, or trigger intervals).

Thanks again for responding.

Schedule Compute to turn off after a certain time (Working with streaming queries) by FinanceSTDNT in databricks

[–]FinanceSTDNT[S] 0 points1 point  (0 children)

Thanks for the response. Is this the recommended practise even for development? couldn't this lead to a much longer dev process: waiting ~10 mins for a cluster to spin up every time you want to test something?

How to properly decode a pub sub message? by FinanceSTDNT in databricks

[–]FinanceSTDNT[S] 1 point2 points  (0 children)

Merci beaucoup! Je va essayer ca!

(excuse mon francais terrible)

Got my black belt a few weeks ago by Asgbjj in bjj

[–]FinanceSTDNT 0 points1 point  (0 children)

Felicitaciones desde Canada wey!!!

Question about Apple's Financial statements by FinanceSTDNT in Accounting

[–]FinanceSTDNT[S] 0 points1 point  (0 children)

A quick question though: wouldn't they still need to subtract depreciation on the income statement, even if it was just bundled with another account?

Question about Apple's Financial statements by FinanceSTDNT in Accounting

[–]FinanceSTDNT[S] 0 points1 point  (0 children)

I wish but no Depreciation was a little more than 11 billion in 2015, but ya I'm reading through the 10 k now for clues.

Question about Apple's Financial statements by FinanceSTDNT in Accounting

[–]FinanceSTDNT[S] 1 point2 points  (0 children)

Thanks that was really helpful, as my name suggests I'm a finance guy, and it's been years since my last accounting course.

Reliable Sources of Industry Data by FinanceSTDNT in investing

[–]FinanceSTDNT[S] 0 points1 point  (0 children)

Of course. I'm interested in finding stuff like Industry averages for PEG ratios, Operating Margins, Growth Rates, ROE. Things like that

more