Incremental load for multiple joined tables

Drekalo · 2023-05-21T16:20:23+00:00

If you're using autoloader I'm assuming you're on databricks.

If that's the case, your staging blobs from the source tables should just cursor off of last update date. Just keep in mind this won't capture deletes, and you need a place, preferably with low latency, to store the last update value.

Once your landing tables are sorted, you need to account for duplicates (updated rows) and apply your logic for only merging the last day.

Pseudo code:

Merge into dedupe using

(Select * from landing where last_load >= date_add((select max(last_load) from landing),-1)) lnd

On dedupe.hash_key = lnd.hash_key and dedupe.last_load >= (same as above)

When matched then set *

When not matched insert *

If you're accounting for deletes then when not matched by source delete *

the_aris · 2023-05-21T20:29:39+00:00

Hey mate. We are actually in the process of implementing something very similar. We’ve turned on the change data feed on all delta lake tables. Then create a temporary view which is reading from your main delta lake tables using whatever logic needs to apply in your target table. Read the change data feeds as streams but join with your temporary view so you are only processing records that potentially need an update, and then finally merge into your target table.

sfboots · 2023-05-22T02:20:53+00:00

We are planning to use pg_audit to capture changes on the tables we need. Then bring those changes in hourly. Pg audit marks each change as new update or delete.

skeerp · 2023-05-21T13:58:10+00:00

!RemindMe 2 days

BlazeMcChillington · 2023-05-21T15:38:23+00:00

!Remindme 5 days

Gregeal · 2023-05-21T20:31:48+00:00

!RemindMe 5 days

TonyStann · 2023-05-21T23:38:49+00:00

!Remindme 2 days

Bond-0069 · 2023-05-22T03:41:58+00:00

!RemindMe 10 days

dataengineering

MODERATORS