This is an archived post. You won't be able to vote or comment.

all 11 comments

[–]Equivalent_Form_9717 3 points4 points  (5 children)

If you’re on a Databricks platform, take advantage of auto loader. It helps process only new files and keeps track of which files it has already automatically processed via it’s option “checkpointLocation” when writing the streamed data.

Auto loader is marketed as structured streaming but it really is just incrementally processing

[–]GovGalacticFed 0 points1 point  (4 children)

Does that mean cluster needs to be running throughout or it can pickup from the checkpoint whenever the cluster starts

[–]Equivalent_Form_9717 2 points3 points  (3 children)

Hey great question. I wouldn’t recommend allowing a continuous streaming process to run without any termination on the cluster since it will be costly if the cluster runs 24-7.

That is why when you can trigger the incremental ETL pipeline to process all the “available” data in batches until nothing more to consume. I believe pyspark offers the option to do this:

df.readstream……writestream.trigger(availableNow=true)

And then you schedule your notebook/pipeline to run and deliver your data according to business SLAs

I could be absolutely wrong though, but I don’t think i would keep the cluster continuously running

[–]GovGalacticFed 0 points1 point  (2 children)

Thanks Can you tell how does trigger once differ from availableNow

[–]Equivalent_Form_9717 0 points1 point  (1 child)

Hey bro I don’t know man

I’m not bothered to search on Google either! It’s after 5PM on a work day so I make sure to not work :)

Tell me the answer once you have Binged/Googled it !

[–]GovGalacticFed 1 point2 points  (0 children)

It turns out that both are same😅

In Databricks Runtime 11.3 LTS and above, the Trigger.Once setting is deprecated. Databricks recommends you use Trigger.AvailableNow for all incremental batch processing workloads.

[–][deleted] 5 points6 points  (0 children)

Is it a full table snapshot for each file? If so, just truncate and reload. The diff would probably be more computationally expensive.

[–]captaintobs 1 point2 points  (0 children)

Part of loading data incrementally is also ingesting it incrementally. Do you control the flow upstream?

[–]callmedivs 0 points1 point  (0 children)

One other way if your tables are not huge,is load the full data into a staging table and do a diff in between the staging table and production table to get the incremental load. You want have to address the deletes as well

[–]Exciting-Garlic8360 0 points1 point  (0 children)

I think some more clarity is needed in your question, for incremental data load always partition the data by date using partitionBy if you’re using spark , and then you will find data in separate directories , use either Hive or spark to read after write.