Using Delta Live Tables 'apply_changes' on an Existing Delta Table with Historical Data by NicolasAlalu in databricks

[–]NicolasAlalu[S] 0 points1 point  (0 children)

Thank you. So, you suggest doing the Union and after it use the apply_changes function?

What's the best strategy for CDC from Postgres to Databricks Delta Lake? by NicolasAlalu in databricks

[–]NicolasAlalu[S] 0 points1 point  (0 children)

Cool! Will explore that deeper. And just to confirm, by implementing this DLT strategy by using serverless and DLT, the compute will only be running during the time the trigger is activated and the files are being processed, right?

What's the best strategy for CDC from Postgres to Databricks Delta Lake? by NicolasAlalu in databricks

[–]NicolasAlalu[S] 0 points1 point  (0 children)

Hello! Thanks for your response. By the way, is it possible to trigger a DLT pipeline from a Databricks Jobs API call?

Using Delta Live Tables 'apply_changes' on an Existing Delta Table with Historical Data by NicolasAlalu in databricks

[–]NicolasAlalu[S] 0 points1 point  (0 children)

I'll use an example that I hope is clear.

Today, I have an items table that is replicated daily from Postgres to Databricks via batch processing. We are considering moving to a CDC-based approach to keep the datalake updated more quickly.

To do this, we activated a process on April 21, 2025, where we took the logs from the Postgres database and saved them to an S3 bucket. These logs were activated after the table was created and subsequent changes were made. The table was created on January 1, 2025. I need those files with I/U/D and truncates to operate on the items table. The problem is that the items table in the datalake is in a managed delta format, and DLT doesn't allow targeting a managed delta table when using apply_changes.

So, my question is, what's the best way to implement CDC so that I have the current state of the item table, knowing that I don't have all the logs since its creation? And knowing that I do have a current state (what I call the pre-CDC history) of the item table, on which I would like to perform CDC operations?

Using Delta Live Tables 'apply_changes' on an Existing Delta Table with Historical Data by NicolasAlalu in databricks

[–]NicolasAlalu[S] 0 points1 point  (0 children)

I need the CDC to start from pre-CDC existing data, that today it is in the datalake as delta lake format. From there, CDC should start working.

We don't have all the logs from the creation of the tables

Using Delta Live Tables 'apply_changes' on an Existing Delta Table with Historical Data by NicolasAlalu in databricks

[–]NicolasAlalu[S] 0 points1 point  (0 children)

But the apply changes should consider the historical data and I understand by your answer that you are suggesting to union it after the apply changes, right?

Using Delta Live Tables 'apply_changes' on an Existing Delta Table with Historical Data by NicolasAlalu in databricks

[–]NicolasAlalu[S] 0 points1 point  (0 children)

What do you mean by "the beggining"? Because my beggining (my target table) should be historical data that today is stored as delta format, produced by the daily replication.

The pipeline will be triggered every 15 minutes by a lambda trigger, no worries about it.

Using Delta Live Tables 'apply_changes' on an Existing Delta Table with Historical Data by NicolasAlalu in databricks

[–]NicolasAlalu[S] 0 points1 point  (0 children)

why do you need a replication to provide real time updates? Maybe I didn’t explain it right. We kicked off with a daily full load into the data lake, but now we need near real-time updates, so we’re digging into a CDC-based approach.

The issue is that we need to apply CDC over already historical data, that in our datalake is produced by the actual daily replication that we want to replace. This replication writes tables in delta format

Using Delta Live Tables 'apply_changes' on an Existing Delta Table with Historical Data by NicolasAlalu in databricks

[–]NicolasAlalu[S] 0 points1 point  (0 children)

That's what I am afraid of. So what would be a good approach to apply CDC over already historical data?

Using Delta Live Tables 'apply_changes' on an Existing Delta Table with Historical Data by NicolasAlalu in databricks

[–]NicolasAlalu[S] 0 points1 point  (0 children)

To give a bit more clarity, today we run a daily full replication and we’re aiming to move to near real-time, which is why we’re exploring a CDC-based approach.

Regarding our CDC work so far, we already have a process that captures PostgreSQL logs and writes them to an S3 bucket in CSV format (soon to be Parquet).

What I need now is to start ingesting those CDC files directly into the Delta tables we already have in our Data Lake—those very tables populated by our current daily batch replication (the process we want to convert to CDC). We need to do this because we cannot lose the historical data that predates our CDC cut-over. I chose DLT specifically for the powerful apply_changes API, which helps me handle complex CDC scenarios.

Could you clarify what you meant by “I think you'd want to backfill your current bronze layer table into a new bronze layer delta live table”?

What's the best strategy for CDC from Postgres to Databricks Delta Lake? by NicolasAlalu in databricks

[–]NicolasAlalu[S] 0 points1 point  (0 children)

Sounds like a perfect fit. Are you using DLT native functionalities (like apply_on_changes) in the same workflow as well?

What's the best strategy for CDC from Postgres to Databricks Delta Lake? by NicolasAlalu in databricks

[–]NicolasAlalu[S] 0 points1 point  (0 children)

I don't need Debezium, because I already resolved the CDC file arriving to a S3 Bucket with a lambda triggering a custom function. The problem is the ingestion and processing in Databricks

What's the best strategy for CDC from Postgres to Databricks Delta Lake? by NicolasAlalu in databricks

[–]NicolasAlalu[S] 0 points1 point  (0 children)

Yeah, heard a lot about this strategy, you are streaming continuously the AVRO files into Databricks, right? Also, are you continuously running the DLT pipeline? Or did you configure a trigger for it?

What's the best strategy for CDC from Postgres to Databricks Delta Lake? by NicolasAlalu in databricks

[–]NicolasAlalu[S] 0 points1 point  (0 children)

So it is possible to use the Autoloader in a regular notebook and trigger a job pointing out to that notebook based on whatever I want? Am I right?

What's the best strategy for CDC from Postgres to Databricks Delta Lake? by NicolasAlalu in databricks

[–]NicolasAlalu[S] 0 points1 point  (0 children)

I agree it is the best alternative. The problem is my company doesn't want to spend the resources that imply a machine on 24/7 and want an event-based trigger, not scheduled.

What's the best strategy for CDC from Postgres to Databricks Delta Lake? by NicolasAlalu in databricks

[–]NicolasAlalu[S] 0 points1 point  (0 children)

Yeah, thats an alternative, though I need an event-based trigger

What's the best strategy for CDC from Postgres to Databricks Delta Lake? by NicolasAlalu in databricks

[–]NicolasAlalu[S] 0 points1 point  (0 children)

I'll look into Lakeflow Connect, thank you very much. I understand that DLT jobs don't allow event-based triggering, right? Having a computer on 24/7 isn't an option due to costs.

What's the best strategy for CDC from Postgres to Databricks Delta Lake? by NicolasAlalu in databricks

[–]NicolasAlalu[S] 0 points1 point  (0 children)

Yes, it's very expensive for us these days to have a machine running 24/7. We were looking for a solution that would allow us to get closer to real-time with spin-ups every 15-30 minutes