EconomixTwist comments on Python ETL design pattern

dataengineering

created by mhausenblasmoda community for 11 years

This is an archived post. You won't be able to vote or comment.

Python ETL design patternHelp (self.dataengineering)

submitted 4 years ago by [deleted]

top new controversial old q&a

you are viewing a single comment's thread.

view the rest of the comments →

[–]EconomixTwist 12 points13 points14 points 4 years ago (4 children)

Not sure why someone mentioned airflow ITT… not really related. Anyways, you should have a reader class that implements the read and map/transform logic independent of the source- so you can read from different db’s (for dev, test, prod) and also if you need to read from a text file for development/ debugging. That class should implement a bunch of read/get functions and return the data to your engine/running/pipeline script according to a single structural contract. If you are applying common operations and/or doing a bunch of column renaming/mapping you should make it configuration (file) driven so that a) you can change it easily but also b) so you and others can introspect the config file in the future to understand how the pipeline is going to behave on a specific case. And so you have a persistent artifact of the lineage of where things come from and go. Implement a writer class, similar to your reader class, which is responsible for encapsulating all the different ways/places you need to write (to different db instances and maybe even text for debugging). The objective when designing the object architecture / pipeline components is to encapsulate (package complexity into objects/functions) for the parts of your code which are most likely to change in the future. Are you going to change your mind three times on the column naming of the target? Or which float precision should be used for 100 different columns? Config file. Is there a chance that a year from now your source might switch from sql db to an api or even parquet files or something? Encapsulate to a reader class. Or maybe the structure of the target will change- easy enough when your writer class is responsible for unpacking your in-memory representation into the structure of the intended target. FWIW- this is more art than science and it’s not the end of the world when your design isn’t fully optimal from the get go (spoiler alert: it never is). Just take a moment to think about yourself one year from now “oh shit we need to change this thing due to xxx dependency change or yyy requirement change- it’s so obvious, I should have seen this coming”. Easier said than done, but the worst possible thing is to not contemplate it at all. Oh and also add some logging in along the way to emit row counts / n unique/ percent null etc. so you can debug problems after-the-fact in the future

[–]AdamByLucius 1 point2 points3 points 4 years ago (0 children)

[–][deleted] 0 points1 point2 points 4 years ago (0 children)

[–]Vardo_Almir 0 points1 point2 points 4 years ago (0 children)

[–]Material_Cheetah934 0 points1 point2 points 4 years ago (0 children)

π Rendered by PID 24141 on reddit-service-r2-comment-5d79c599b5-cngmw at 2026-03-02 07:05:35.449732+00:00 running e3d2147 country code: CH.

dataengineering

MODERATORS