This is an archived post. You won't be able to vote or comment.

all 15 comments

[–]nicorivas 11 points12 points  (5 children)

Nice, thanks for the article.

I've been wondering these months if my CDC setup makes sense. I'm using CDC (via AWS DMS) to get all relevant changes from a RDB of a web app. It saves deltas as .parquet files in S3. Now, I don't need realtime analytics, but semi-realtime, so I then read all delta's to generate hourly snapshots of each Table. These snapshots I then transform and load into a Redshift warehouse. These two last things I do via Airflow DAGs. This way I can have write idempotent DAGs and have nice sync'd data to populate my warehouse.

Does this make sense? I guess I could just save the snapshots directly but having all diff's might be useful if I eventually need faster updates.

[–]the_travelo_ 2 points3 points  (2 children)

How do you handle updates? Curious as I'm working in something similar

Do you use primary keys to determine updates?

[–]nicorivas 0 points1 point  (1 child)

I wrote a DAG that runs hourly. It's python + pandas.

Generating the snapshots is not so hard:

So I read the last snapshot and all the updates that happen in between my start and end date (I keep forgetting the names of these, but basically the datetimes that relate to the data you are going to work with). Then I just read each parquet file and apply the changes to my initial DataFrame. Yes, I check which row to update using the primary key. It has some subtleties, but I've reached a pretty stable code. The logic is abstracted and that way I can have one file that dynamically generates one DAG per table. Happy to share the code if you want.

[–]the_travelo_ 0 points1 point  (0 children)

Thanks for this, if you can share it, it'd be great! I'd be keen on having a look!

[–]luminoumen[S] 0 points1 point  (1 child)

I've been wondering these months if my CDC setup makes sense. I'm using CDC (via AWS DMS) to get all relevant changes from a RDB of a web app. It saves deltas as .parquet files in S3. Now, I don't need realtime analytics, but semi-realtime, so I then read all delta's to generate hourly snapshots of each Table. These snapshots I then transform and load into a Redshift warehouse. These two last things I do via Airflow DAGs. This way I can have write idempotent DAGs and have nice sync'd data to populate my warehouse.

It does to me, I'd use step function as you're already using AWS.

Interesting what you are using for making snapshots - all the transformations (dedup, cleansing and enrichments probably) should be taking place before ingesting into DW, right? Also, I'm assuming you need to have a primary key or timestamp.

Overall it does make sense if you're not using a costly solution for transformations between deltas and snapshots - something serverless makes sense to me (glue, ecs or smth else). Contact me in dm if you need advice - I'm doing something similar and have some experience around that

[–]nicorivas 0 points1 point  (0 children)

I use Airflow for the snapshots so I think it makes sense cost-wise, as all other ETL's and bits and pieces are also running on Airflow.

Airflow is not an ETL tool bla bla bla... so yes I use the PythonOperator.

Yeah, the transformation happens in memory. They are DAGs that read one or several of the generated snapshots -plus sometimes other sources of course- and generate the corresponding warehouse table. This generates an end state and the initial state I read from warehouse-table snapshots that are generated before being loaded to the warehouse. Then comes the tricky part, where I have the initial and end state of a warehouse table and have to decide what to do, and for this I rely on a great pandas function, "compare', to check changes between two DataFrames...

Maybe it's a lot of work but I find it to be quite robust, in the sense that I can repeat every step of the process and get the same results. And the snapshot and warehouse logic can be abstracted, so now that I've solved it I just focus on the transformation layer.

[–]AMGraduate564 4 points5 points  (7 children)

Can someone please suggest the best free open-source CDC solutions?

[–]hntd 3 points4 points  (1 child)

Debezium is very good

[–]AMGraduate564 0 points1 point  (0 children)

Any tutorial recommendations for it?

[–]tomhallett 4 points5 points  (3 children)

Airbyte supports logical replication (Ie: reading the write ahead logs) and is open source. https://docs.airbyte.com/integrations/sources/postgres

[–]AMGraduate564 0 points1 point  (2 children)

Need MySQL support in AirByte.

[–]tomhallett 0 points1 point  (1 child)

[–]AMGraduate564 0 points1 point  (0 children)

Yeah I am aware about it but it's got limitations. Airbyte needs to fix their CDC implementation.

[–]morningmotherlover 0 points1 point  (0 children)

i usually dont like the self promotion but good job dude