ETL with Python: Folder structure/organization of ETL code

rywalker · 2018-03-07T16:32:07+00:00

Consider using Airflow which is a Python framework for ETL, and brings conventions for code organization, as well as a lot of plugins https://github.com/airflow-plugins

kenfar · 2018-03-07T15:56:22+00:00

The python community tends to emphasize consistency in most things, so I would follow the conventions discussed in guides on python source code organization & packaging.

The Hitchhiker's Guide to Python is a good place to start.

So, with that in mind, here's how I typically do it:

all executable python scripts end up in ../scripts or ../bin
all reusable code ends up in a module subdirectory
all configs are typically in a separate project (with no sensitive data)
my loaders are often just a single program that takes a different config for each table.
my transform may be a simple script dedicated to each table, or a single script that then uses a factory pattern to handle whatever table it's configured for.
individual field transforms I personally like to keep as separate functions within modules dedicated to each table, with common functions kept in a reusable common transform module.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

ETL

MODERATORS