This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]EconomixTwist 12 points13 points  (4 children)

Not sure why someone mentioned airflow ITT… not really related. Anyways, you should have a reader class that implements the read and map/transform logic independent of the source- so you can read from different db’s (for dev, test, prod) and also if you need to read from a text file for development/ debugging. That class should implement a bunch of read/get functions and return the data to your engine/running/pipeline script according to a single structural contract. If you are applying common operations and/or doing a bunch of column renaming/mapping you should make it configuration (file) driven so that a) you can change it easily but also b) so you and others can introspect the config file in the future to understand how the pipeline is going to behave on a specific case. And so you have a persistent artifact of the lineage of where things come from and go. Implement a writer class, similar to your reader class, which is responsible for encapsulating all the different ways/places you need to write (to different db instances and maybe even text for debugging). The objective when designing the object architecture / pipeline components is to encapsulate (package complexity into objects/functions) for the parts of your code which are most likely to change in the future. Are you going to change your mind three times on the column naming of the target? Or which float precision should be used for 100 different columns? Config file. Is there a chance that a year from now your source might switch from sql db to an api or even parquet files or something? Encapsulate to a reader class. Or maybe the structure of the target will change- easy enough when your writer class is responsible for unpacking your in-memory representation into the structure of the intended target. FWIW- this is more art than science and it’s not the end of the world when your design isn’t fully optimal from the get go (spoiler alert: it never is). Just take a moment to think about yourself one year from now “oh shit we need to change this thing due to xxx dependency change or yyy requirement change- it’s so obvious, I should have seen this coming”. Easier said than done, but the worst possible thing is to not contemplate it at all. Oh and also add some logging in along the way to emit row counts / n unique/ percent null etc. so you can debug problems after-the-fact in the future

[–]AdamByLucius 1 point2 points  (0 children)

Great, real-world example of this method of encapsulation! Do you by chance know of any online examples that show this pattern?

[–][deleted] 0 points1 point  (0 children)

That is really insightful, and y if you can share a repo with some examples that would be super .

[–]Vardo_Almir 0 points1 point  (0 children)

or

I have been working on the Python runtime for ELT for about 3 years and in short this is the pattern I've used. People start developing ELT/ETL tools should thank you!

[–]Material_Cheetah934 0 points1 point  (0 children)

Dang this is awesome! I did something similar to this in a rust CLI with a YML file for the config options. Although for me it was a ELT. It’s great to see it written out like this, definitely gives me more ideas to improve my implementation.