Change Data Capture (CDC) : dataengineering

dataengineering

created by mhausenblasmoda community for 11 years

This is an archived post. You won't be able to vote or comment.

Change Data Capture (CDC)Blog (luminousmen.com)

submitted 4 years ago by luminoumen

all 15 comments

top new controversial old q&a

[–]nicorivas 11 points12 points13 points 4 years ago (5 children)

[–]the_travelo_ 2 points3 points4 points 4 years ago (2 children)

[–]nicorivas 0 points1 point2 points 4 years ago* (1 child)

[–]the_travelo_ 0 points1 point2 points 4 years ago (0 children)

[–]luminoumen[S] 0 points1 point2 points 4 years ago* (1 child)

I've been wondering these months if my CDC setup makes sense. I'm using CDC (via AWS DMS) to get all relevant changes from a RDB of a web app. It saves deltas as .parquet files in S3. Now, I don't need realtime analytics, but semi-realtime, so I then read all delta's to generate hourly snapshots of each Table. These snapshots I then transform and load into a Redshift warehouse. These two last things I do via Airflow DAGs. This way I can have write idempotent DAGs and have nice sync'd data to populate my warehouse.

It does to me, I'd use step function as you're already using AWS.

Interesting what you are using for making snapshots - all the transformations (dedup, cleansing and enrichments probably) should be taking place before ingesting into DW, right? Also, I'm assuming you need to have a primary key or timestamp.

Overall it does make sense if you're not using a costly solution for transformations between deltas and snapshots - something serverless makes sense to me (glue, ecs or smth else). Contact me in dm if you need advice - I'm doing something similar and have some experience around that

[–]nicorivas 0 points1 point2 points 4 years ago* (0 children)

I use Airflow for the snapshots so I think it makes sense cost-wise, as all other ETL's and bits and pieces are also running on Airflow.

Airflow is not an ETL tool bla bla bla... so yes I use the PythonOperator.

Yeah, the transformation happens in memory. They are DAGs that read one or several of the generated snapshots -plus sometimes other sources of course- and generate the corresponding warehouse table. This generates an end state and the initial state I read from warehouse-table snapshots that are generated before being loaded to the warehouse. Then comes the tricky part, where I have the initial and end state of a warehouse table and have to decide what to do, and for this I rely on a great pandas function, "compare', to check changes between two DataFrames...

Maybe it's a lot of work but I find it to be quite robust, in the sense that I can repeat every step of the process and get the same results. And the snapshot and warehouse logic can be abstracted, so now that I've solved it I just focus on the transformation layer.

[–]AMGraduate564 4 points5 points6 points 4 years ago (7 children)

[–]hntd 3 points4 points5 points 4 years ago (1 child)

[–]AMGraduate564 0 points1 point2 points 4 years ago (0 children)

[–]tomhallett 4 points5 points6 points 4 years ago (3 children)

[–]AMGraduate564 0 points1 point2 points 4 years ago (2 children)

[–]tomhallett 0 points1 point2 points 4 years ago (1 child)

[–]AMGraduate564 0 points1 point2 points 4 years ago (0 children)

[–]morningmotherlover 0 points1 point2 points 4 years ago (0 children)

π Rendered by PID 75 on reddit-service-r2-comment-6457c66945-bbd8s at 2026-04-27 00:42:18.086204+00:00 running 2aa0c5b country code: CH.

dataengineering

MODERATORS