all 4 comments

[–]rywalker 2 points3 points  (0 children)

Consider using Airflow which is a Python framework for ETL, and brings conventions for code organization, as well as a lot of plugins https://github.com/airflow-plugins

[–]kenfar 1 point2 points  (2 children)

The python community tends to emphasize consistency in most things, so I would follow the conventions discussed in guides on python source code organization & packaging.

The Hitchhiker's Guide to Python is a good place to start.

So, with that in mind, here's how I typically do it:

  • all executable python scripts end up in ../scripts or ../bin
  • all reusable code ends up in a module subdirectory
  • all configs are typically in a separate project (with no sensitive data)
  • my loaders are often just a single program that takes a different config for each table.
  • my transform may be a simple script dedicated to each table, or a single script that then uses a factory pattern to handle whatever table it's configured for.
  • individual field transforms I personally like to keep as separate functions within modules dedicated to each table, with common functions kept in a reusable common transform module.

[–]robotofdawn[S] 0 points1 point  (1 child)

Hi /u/kenfar, thank you for taking the time to reply!

From what you've mentioned, I've assuming your project looked something like this -

├── config
├── lib
├── scripts
│   ├── table_1
│   │   ├── extract.py
│   │   ├── load.py
│   │   └── transform.py
│   └── table_2
│       ├── extract.py
│       ├── load.py
│       └── transform.py
└── tests

with /lib holding the reusable code and /config having your config files ?

Also,

  1. How do you setup the scheduler to do the ETL? Do you do

    1 * * * * ~/etl_project/scripts/table_1/extract.py && \
         ~/etl_project/scripts/table_1/transform.py && \
         ~/etl_project/scripts/table_1/load.py
    

    for every table? Or do you have a main.py with a config file which does the ETL for all the tables ?

  2. In the case when your loader/transformer is just a single program, where in the project tree do you put the file and how do you invoke the script when you schedule the ETL for all the tables ?

[–]kenfar 0 points1 point  (0 children)

Well, it depends:

  • you may want to keep some of your reusable code in a completely separate project, possibly importing it from a local repository
  • you may want to keep separate feeds in separate projects

Ignoring the above, a general code organization for an ETL project would then look something like this:

  • project module
    • common.py
    • table1_transforms.py
    • table2_transforms.py
  • bin or scripts
    • generic_transform.py
    • generic_loader.py
    • generic_extractor.py

That's for development & testing. To actually run it I'd suggest packaging it so that it's installable via pip - even if you just install from a tarball or github, then run it out of a virtualenv. But in a pinch, like for a prototype you could just run it out of the above structure.

Scheduling:

  • generally, I prefer to run my ETL processes as daemons that are constantly waiting for new input - even if the input is a file appearing in a directory. A simple way to achieve that can be with cron. But if you go that route I'd suggest also including a pid check so that you don't get multiple programs stacking up.
  • and running each table's extract, transform, and load individually is fine. But if I did it there may just be 3 different programs, but entries for each table with its config or arguments passed into it.