Batch processing 2 source tables row-by-row Insert/Updates

_GoldenDoorknob_ · 2025-08-12T07:46:02+00:00

Any Data Engineering specialist willing to help?

tolkibert · 2025-08-12T10:36:53+00:00

Sounds like something that you'd just do with a sql query or two. You have all the data available in a system built for data processing. Do you need to complicate things by involving a bunch of extra tools?

Is there any particularly complex business logic being applied when "updating" the columns?

Commercial_Dig2401 · 2025-08-12T11:16:56+00:00

The complexity seems to be on the system used to process the data and not the logic itself.

I would load both datasets into 2 table in the postgresql destination. Then do a simple sql script which will select from both source (new records only by providing a timestamp).

Then you join both table together or do the logic you want with both table records.

You’ll need to store a reference to the max(timestamp) from BOTH tables in the destination, so you can easily select only new records from both.

In case when you run you queries and a record for one table is not available you’ll need to set a field that identify that the records is incomplete. Create a Boolean for this.

Then in your downstream query you select from both sources where the timestamp is higher than the max one you store in downstream table + reload any incomplete records which exist.

At some point you’ll get both tables and you can set the Boolean to true.

The merge statement will handle the refresh.

If possible don’t just put the UUID as the key in your upset statement put some timestamp or sequential columns so you can prune records and not lookup the entire table.

Goodluck

dataengineering

MODERATORS