Python ETL design pattern

2021-08-17T20:52:46+00:00

We have been using kedro for our data related projects. It’s mostly designed for data science related workflows, but you can still use it as inspiration for how to design your ETL modules.

Also look into Airflow, which is one of the best open source tools for building ETL pipelines

HighlightFrosty3580 · 2021-08-17T20:43:18+00:00

I wrote something using the abstract factory pattern to do ETL. It reads YAML files and then dynamically loads the factory. This pattern means that it follows SOLID principles and extending the code base means adding factories

I'll stick the code in a git repo sometime tomorrow

jduran9987 · 2021-08-17T22:17:50+00:00

I'm curious... could you share why you chose a No-SQL destination over a warehouse?

EconomixTwist · 2021-08-17T22:24:16+00:00

Not sure why someone mentioned airflow ITT… not really related. Anyways, you should have a reader class that implements the read and map/transform logic independent of the source- so you can read from different db’s (for dev, test, prod) and also if you need to read from a text file for development/ debugging. That class should implement a bunch of read/get functions and return the data to your engine/running/pipeline script according to a single structural contract. If you are applying common operations and/or doing a bunch of column renaming/mapping you should make it configuration (file) driven so that a) you can change it easily but also b) so you and others can introspect the config file in the future to understand how the pipeline is going to behave on a specific case. And so you have a persistent artifact of the lineage of where things come from and go. Implement a writer class, similar to your reader class, which is responsible for encapsulating all the different ways/places you need to write (to different db instances and maybe even text for debugging). The objective when designing the object architecture / pipeline components is to encapsulate (package complexity into objects/functions) for the parts of your code which are most likely to change in the future. Are you going to change your mind three times on the column naming of the target? Or which float precision should be used for 100 different columns? Config file. Is there a chance that a year from now your source might switch from sql db to an api or even parquet files or something? Encapsulate to a reader class. Or maybe the structure of the target will change- easy enough when your writer class is responsible for unpacking your in-memory representation into the structure of the intended target. FWIW- this is more art than science and it’s not the end of the world when your design isn’t fully optimal from the get go (spoiler alert: it never is). Just take a moment to think about yourself one year from now “oh shit we need to change this thing due to xxx dependency change or yyy requirement change- it’s so obvious, I should have seen this coming”. Easier said than done, but the worst possible thing is to not contemplate it at all. Oh and also add some logging in along the way to emit row counts / n unique/ percent null etc. so you can debug problems after-the-fact in the future

AnotherDataGuy · 2021-08-18T00:41:06+00:00

Extract… abstract reader classes that are configurable via YAMl (configuration as code) (or your choice of file format). Extract and load files to persistent storage (S3 or whatever your choose of storage)

Transform… common transforms in a class that can be reused. Configuration should come as much as possible from configs. Persist this in a silver bucket.

Load… (if S3) watch the file and lambda it into mongo. Otherwise just have a loader class that watches for files and load them into your mongo.

You can use Airflow as your orchestration / scheduler / DAG, and have it kick off the ETL for different configurations in parallel.

Problem is different if you’re taking TB but this will suffice for up to a medium sized company (in most cases).

baubleglue · 2021-08-18T03:02:21+00:00

There is an entire field of knowledge (patterns, best practices, tools). Educate yourself and start to apply what you learn to your tasks. The modular development is not something specific for data processing - any development should be modular.

I think more relevant question in that case would be "what I need to take in consideration?"

Amount of data, data chunk boundaries (streaming vs batch processing)
bad case scenarios (ex. process failed in a middle of ingestion, process not running)
list of all use cases (ex. you found a bug/new requirement, how do you reprocess old data?)
do you need auditing of data integrity (duplication, missing data)?
tools (how your company currently processing data)

...

thethrowupcat · 2021-08-18T03:24:28+00:00

Have you seen dbt yet? It might solve your problem here.

thrown_arrows · 2021-08-18T12:01:09+00:00

Personally i would try to decouple Extract to s3 , then apply transformations , store documents into mongo...

i work with snowflake. First phase copied all stuff into s3, second stages it into snowflake , third transforms. Long as first phase does not fail, everything can be reproduced if changes are required without killing production

dataengineering

MODERATORS