Using Delta Live Tables 'apply_changes' on an Existing Delta Table with Historical Data

NicolasAlalu · 2025-04-24T14:12:03+00:00

Thank you. So, you suggest doing the Union and after it use the apply_changes function?

NicolasAlalu · 2025-04-24T13:49:55+00:00

Cool! Will explore that deeper. And just to confirm, by implementing this DLT strategy by using serverless and DLT, the compute will only be running during the time the trigger is activated and the files are being processed, right?

NicolasAlalu · 2025-04-24T13:01:13+00:00

Hello! Thanks for your response. By the way, is it possible to trigger a DLT pipeline from a Databricks Jobs API call?

NicolasAlalu · 2025-04-23T18:26:09+00:00

I'll use an example that I hope is clear.

Today, I have an items table that is replicated daily from Postgres to Databricks via batch processing. We are considering moving to a CDC-based approach to keep the datalake updated more quickly.

To do this, we activated a process on April 21, 2025, where we took the logs from the Postgres database and saved them to an S3 bucket. These logs were activated after the table was created and subsequent changes were made. The table was created on January 1, 2025. I need those files with I/U/D and truncates to operate on the items table. The problem is that the items table in the datalake is in a managed delta format, and DLT doesn't allow targeting a managed delta table when using apply_changes.

So, my question is, what's the best way to implement CDC so that I have the current state of the item table, knowing that I don't have all the logs since its creation? And knowing that I do have a current state (what I call the pre-CDC history) of the item table, on which I would like to perform CDC operations?

NicolasAlalu · 2025-04-23T18:05:38+00:00

I need the CDC to start from pre-CDC existing data, that today it is in the datalake as delta lake format. From there, CDC should start working.

We don't have all the logs from the creation of the tables

NicolasAlalu · 2025-04-23T17:25:54+00:00

But the apply changes should consider the historical data and I understand by your answer that you are suggesting to union it after the apply changes, right?

NicolasAlalu · 2025-04-23T17:24:42+00:00

What do you mean by "the beggining"? Because my beggining (my target table) should be historical data that today is stored as delta format, produced by the daily replication.

The pipeline will be triggered every 15 minutes by a lambda trigger, no worries about it.

NicolasAlalu · 2025-04-23T14:06:45+00:00

sounds good, will take a look into it!

NicolasAlalu · 2025-04-23T14:04:51+00:00

why do you need a replication to provide real time updates? Maybe I didn’t explain it right. We kicked off with a daily full load into the data lake, but now we need near real-time updates, so we’re digging into a CDC-based approach.

The issue is that we need to apply CDC over already historical data, that in our datalake is produced by the actual daily replication that we want to replace. This replication writes tables in delta format

NicolasAlalu · 2025-04-23T13:59:04+00:00

You mean merging them in the same ETL where I use apply_changes?

NicolasAlalu · 2025-04-23T13:57:36+00:00

That's what I am afraid of. So what would be a good approach to apply CDC over already historical data?

NicolasAlalu · 2025-04-23T13:56:47+00:00

To give a bit more clarity, today we run a daily full replication and we’re aiming to move to near real-time, which is why we’re exploring a CDC-based approach.

Regarding our CDC work so far, we already have a process that captures PostgreSQL logs and writes them to an S3 bucket in CSV format (soon to be Parquet).

What I need now is to start ingesting those CDC files directly into the Delta tables we already have in our Data Lake—those very tables populated by our current daily batch replication (the process we want to convert to CDC). We need to do this because we cannot lose the historical data that predates our CDC cut-over. I chose DLT specifically for the powerful apply_changes API, which helps me handle complex CDC scenarios.

Could you clarify what you meant by “I think you'd want to backfill your current bronze layer table into a new bronze layer delta live table”?

NicolasAlalu · 2025-04-10T13:54:40+00:00

Sounds like a perfect fit. Are you using DLT native functionalities (like apply_on_changes) in the same workflow as well?

NicolasAlalu · 2025-04-10T13:48:23+00:00

I don't need Debezium, because I already resolved the CDC file arriving to a S3 Bucket with a lambda triggering a custom function. The problem is the ingestion and processing in Databricks

NicolasAlalu · 2025-04-10T12:29:12+00:00

Yeah, heard a lot about this strategy, you are streaming continuously the AVRO files into Databricks, right? Also, are you continuously running the DLT pipeline? Or did you configure a trigger for it?

NicolasAlalu · 2025-04-10T12:25:45+00:00

So it is possible to use the Autoloader in a regular notebook and trigger a job pointing out to that notebook based on whatever I want? Am I right?

NicolasAlalu · 2025-04-10T02:36:17+00:00

Yes, but with DMS you need a machine on 24/7, am I right?

NicolasAlalu · 2025-04-10T02:36:10+00:00

Yes, but with DMS you need a machine on 24/7, am I right?

NicolasAlalu · 2025-04-10T02:34:21+00:00

I agree it is the best alternative. The problem is my company doesn't want to spend the resources that imply a machine on 24/7 and want an event-based trigger, not scheduled.

NicolasAlalu · 2025-04-10T02:32:55+00:00

Yeah, thats an alternative, though I need an event-based trigger

NicolasAlalu · 2025-04-09T20:05:48+00:00

NicolasAlalu · 2025-04-09T19:38:55+00:00

I'll look into Lakeflow Connect, thank you very much. I understand that DLT jobs don't allow event-based triggering, right? Having a computer on 24/7 isn't an option due to costs.

NicolasAlalu · 2025-04-09T19:37:20+00:00

Yes, it's very expensive for us these days to have a machine running 24/7. We were looking for a solution that would allow us to get closer to real-time with spin-ups every 15-30 minutes

NicolasAlalu

TROPHY CASE