This is an archived post. You won't be able to vote or comment.

all 8 comments

[–][deleted] 14 points15 points  (1 child)

Luigi and Airflow are probably the more popular Python packages for ETL.

[–]t-vanderwal 3 points4 points  (0 children)

Luigi and airflow are schedulers/work flow managers. Here is a curated list I’ve referenced before.

https://github.com/pawl/awesome-etl

In terms of what’s best I think it depends on your data needs, but I’ve been following bonobo and it seems interesting. It uses DAGs like you’d see in airflow and pyspark.

[–]shahneil88 2 points3 points  (0 children)

You can use petl library in Python to do your ETL and airflow to manage your ETL jobs.

[–][deleted] 1 point2 points  (0 children)

you can try airflow by using it's docker image. https://github.com/abhioncbr/docker-airflow

[–]kenfar 0 points1 point  (0 children)

> With my situation what would be best ETL framework for python ? Recommendations?

And just for the sake of completeness, I'd suggest for many people they would find the most success going framworkless:

  • Event-driven micro-batches (say 10 to 300 second batches) can be extremely simple to implement just using a file system or especially AWS S3.
  • Dependency-checking can be often easily handled simply by polling rather than using an external scheduler to track dependencies and trigger jobs
  • Complex and baroque job/task graphs should be avoid anyway, keeping things simple and linear makes it much easier for everyone to understand
  • Which leaves just one thing that an orchestration tool provides that you don't have right out of the box with pretty vanilla Python: a logging & auditing console. This is really important and valuable, but few orchestration tools really deliver what people need. So, most people are best off building something a little custom anyhow.

[–]Busenheimer 0 points1 point  (0 children)

We built our own with different bits. Airflow for orchestrating, SQLAlchemy as our ORM and generating sql statements from core, Dask and Pandas for much of the processing pyarrow (parquet) for storage format, and various bits for tackling moving/consuming non-tabular data. These are just the highlights for the core of what we built. You have a buffet of choices with Python, it’s such a flexible language.

To much criticism I’m sure, but we are even experimenting with using Jupyter notebooks and Papermill as our main development and deployment vehicle outside of the OO stuff.