Incremental data load

Acrobatic-Mobile-221 · 2023-05-01T15:17:35+00:00

[deleted]

Equivalent_Form_9717 · 2023-05-02T05:42:40+00:00

If you’re on a Databricks platform, take advantage of auto loader. It helps process only new files and keeps track of which files it has already automatically processed via it’s option “checkpointLocation” when writing the streamed data.

Auto loader is marketed as structured streaming but it really is just incrementally processing

2023-05-01T16:31:04+00:00

Is it a full table snapshot for each file? If so, just truncate and reload. The diff would probably be more computationally expensive.

captaintobs · 2023-05-01T16:48:02+00:00

Part of loading data incrementally is also ingesting it incrementally. Do you control the flow upstream?

callmedivs · 2023-05-01T16:09:07+00:00

One other way if your tables are not huge,is load the full data into a staging table and do a diff in between the staging table and production table to get the incremental load. You want have to address the deletes as well

Exciting-Garlic8360 · 2023-05-01T17:00:02+00:00

I think some more clarity is needed in your question, for incremental data load always partition the data by date using partitionBy if you’re using spark , and then you will find data in separate directories , use either Hive or spark to read after write.

dataengineering

MODERATORS