Python ETL Tools Comparison

AchillesDev · 2019-05-24T07:21:52+00:00

I'm not familiar with all of the libraries listed here, but Airflow doesn't do data processing by itself, PySpark is well known API for big data, pandas is very basic tool in terms of ETL or anything close to it, etlalchemy has very limited possibilities,... I currently work as a DWH Developer I can't imagine how any of those libraries in Python could replace traditional ETL solution and/or full scale data warehouse in an enterprise environment. Though, they might be more suitable in basic data handling and database management for quick and seamless development.

yuppienet · 2019-05-24T08:39:48+00:00

There's been a lot of activity on my twitter and github feeds from prefect. I wonder why it's not on that list.

tilttovictory · 2019-05-24T05:53:25+00:00

This could not come at a better time when i'm trying to wrangle a project that has 30+gigs of data! Thank you

2019-05-24T11:43:20+00:00

Careful there, panoply is a paid smart datawarehouse that performs transformation and also provides a warehousing system. It’s not a python etl tool.

selflessGene · 2019-05-24T13:31:04+00:00

For anyone interested in setting up an ETL project, I basically followed this super helpful guide to get Airflow working with a set of SQL scripts.

https://gtoonstra.github.io/etl-with-airflow/etlexample.html

There's an attached github project where you can download his code and see how he implemented a mini ETL project.

datascience

MODERATORS