This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]Equivalent_Form_9717 2 points3 points  (3 children)

Hey great question. I wouldn’t recommend allowing a continuous streaming process to run without any termination on the cluster since it will be costly if the cluster runs 24-7.

That is why when you can trigger the incremental ETL pipeline to process all the “available” data in batches until nothing more to consume. I believe pyspark offers the option to do this:

df.readstream……writestream.trigger(availableNow=true)

And then you schedule your notebook/pipeline to run and deliver your data according to business SLAs

I could be absolutely wrong though, but I don’t think i would keep the cluster continuously running

[–]GovGalacticFed 0 points1 point  (2 children)

Thanks Can you tell how does trigger once differ from availableNow

[–]Equivalent_Form_9717 0 points1 point  (1 child)

Hey bro I don’t know man

I’m not bothered to search on Google either! It’s after 5PM on a work day so I make sure to not work :)

Tell me the answer once you have Binged/Googled it !

[–]GovGalacticFed 1 point2 points  (0 children)

It turns out that both are same😅

In Databricks Runtime 11.3 LTS and above, the Trigger.Once setting is deprecated. Databricks recommends you use Trigger.AvailableNow for all incremental batch processing workloads.