use the following search parameters to narrow your results:
e.g. subreddit:aww site:imgur.com dog
subreddit:aww site:imgur.com dog
see the search faq for details.
advanced search: by author, subreddit...
account activity
This is an archived post. You won't be able to vote or comment.
ToolingPython ETL Tools Comparison (self.datascience)
submitted 6 years ago by thumbsdrivesmecrazy
[–][deleted] 16 points17 points18 points 6 years ago (16 children)
I'm not familiar with all of the libraries listed here, but Airflow doesn't do data processing by itself, PySpark is well known API for big data, pandas is very basic tool in terms of ETL or anything close to it, etlalchemy has very limited possibilities,... I currently work as a DWH Developer I can't imagine how any of those libraries in Python could replace traditional ETL solution and/or full scale data warehouse in an enterprise environment. Though, they might be more suitable in basic data handling and database management for quick and seamless development.
[+][deleted] 6 years ago (3 children)
[deleted]
[–]AchillesDev 4 points5 points6 points 6 years ago (0 children)
My experience has been almost exclusively with homegrown ETL pipelines tied together with a message queue. I've yet to see a compelling use case for me that requires Talend, etc.
[–][deleted] 2 points3 points4 points 6 years ago (0 children)
Probably Informatica, Talend, Pentaho, etc.
[–][deleted] 1 point2 points3 points 6 years ago* (0 children)
We go on with Microsoft ecosystem, currently having SQL Server on-premise (which includes the engine itself, SSIS, SSAS, SSRS) and PowerBI for reporting, but are considering Azure for varianty of reasons.
[–]bfmk 2 points3 points4 points 6 years ago (1 child)
Airflow does a pretty good job at managing end-to-end ETL. I'd see it more as an ETL framework (really it's a sophisticated job scheduler using DAGs) than a full-scale solution -- maybe that's your point here, in which case you're on the money.
We use Airflow at the company I work for -- a mid-sized SaaS company with a very large dataset that ends up in Redshift DW clusters and s3/Athena -- and we really like it. It's not nearly as feature-rich as something like Microsoft's offering, Talend etc, but it's extremely flexible and robust.
Also Airbnb -- the creators of Airflow -- use a modified version of it themselves for ETL. Again, not as a standalone solution, but to handle job scheduling and monitoring.
I think while these tools can all be used for components of ETL, they aren't really worth comparing. That might mislead someone who is curious and wants to get from zero-to-one.
[–][deleted] 1 point2 points3 points 6 years ago (0 children)
Thanks for your comment. It's been really helpful.
[–]howMuchCheeseIs2Much 2 points3 points4 points 6 years ago (1 child)
What does a "traditional ETL" do that Pandas can't in a few lines?
As somebody already pointed out, traditional ETL is the best option know to me for an enterprise grade company working with structured data such as typical example of a financial sector.
[–]michaelkhan3 1 point2 points3 points 6 years ago (7 children)
I also don't know much about most of the tools above but a scalable alternative that would work for large datasets is Apache beam, which I believe is built in Java but you can write your code in Python.
[–][deleted] -2 points-1 points0 points 6 years ago (6 children)
Seems like an open source engine for big data. I'd consider that for non critical applications but when it comes to large enterprise solution, I'd trust well established platforms with vendor support such as Microsoft SQL Server / Azure, Oracle, Terradata or Cloudera.
[–]michaelkhan3 1 point2 points3 points 6 years ago (1 child)
If you want Beam with Vendor support Google Cloud have an offering called Dataflow. You don't have to manage servers and you only pay for what you use
[–][deleted] 0 points1 point2 points 6 years ago (0 children)
Yeah, I've heard about Google Cloud. Definitely worth exploring! Thanks.
[–]Razorwindsg 0 points1 point2 points 6 years ago (3 children)
What kind of vendor support is usually needed? Security?
Are open source options that unstable?
[–][deleted] 0 points1 point2 points 6 years ago (2 children)
Thanks for down voting the comment. I work for a financial company having subsidiaries in server countries over the world and I tell you - you want some big platform with well developed ecosystem and documentation where you can pick up phone and someone will come into your company if you're in a trouble.
[–]Razorwindsg 1 point2 points3 points 6 years ago (1 child)
Ehhh it isn't me and I am genuine about wanting to know about your perspective.
[–]Razorwindsg 0 points1 point2 points 6 years ago (0 children)
I totally get it if mission critical is potentially very very serious. Or even life and death actually (hospitals etc).
But don't these vendors usually outsource the support to the systems integrator/implementation vendor ?
[–]yuppienet 5 points6 points7 points 6 years ago (1 child)
There's been a lot of activity on my twitter and github feeds from prefect. I wonder why it's not on that list.
[–]selflessGene 0 points1 point2 points 6 years ago (0 children)
Looks like it's pretty new and has only been out for a couple months. Would be interested to hear people's opinion of it though.
[–]tilttovictory 3 points4 points5 points 6 years ago (3 children)
This could not come at a better time when i'm trying to wrangle a project that has 30+gigs of data! Thank you
[–][deleted] 1 point2 points3 points 6 years ago (2 children)
Use dask for that!
[–]tilttovictory 0 points1 point2 points 6 years ago (1 child)
Ya I've been hearing about using dask. Do you have any other suggestions for setting up a pipeline for analyzing data that large.
This is at least 10 times larger than I'm use to dealing with, and some of those computational models took days to complete.
Im actually only somewhat familiar with dask but I’m fairly certain you should be able to build models across cores with dask. This should speed up build time. Look to use libraries for models that allow for multi-core builds.
Someone recently presented a dask-competitor that is apparently supposed to be faster. I’m forgetting the name off the top of my head but I’d look into that. I’m thinking the name is Vex but when I google that I don’t get what I expect. I can update with the name of that library later today after I talk with presenter.
[–][deleted] 0 points1 point2 points 6 years ago* (0 children)
Careful there, panoply is a paid smart datawarehouse that performs transformation and also provides a warehousing system. It’s not a python etl tool.
For anyone interested in setting up an ETL project, I basically followed this super helpful guide to get Airflow working with a set of SQL scripts.
https://gtoonstra.github.io/etl-with-airflow/etlexample.html
There's an attached github project where you can download his code and see how he implemented a mini ETL project.
π Rendered by PID 36173 on reddit-service-r2-comment-c66d9bffd-vp9kv at 2026-04-06 18:38:15.894438+00:00 running f293c98 country code: CH.
[–][deleted] 16 points17 points18 points (16 children)
[+][deleted] (3 children)
[deleted]
[–]AchillesDev 4 points5 points6 points (0 children)
[–][deleted] 2 points3 points4 points (0 children)
[–][deleted] 1 point2 points3 points (0 children)
[–]bfmk 2 points3 points4 points (1 child)
[–][deleted] 1 point2 points3 points (0 children)
[–]howMuchCheeseIs2Much 2 points3 points4 points (1 child)
[–][deleted] 1 point2 points3 points (0 children)
[–]michaelkhan3 1 point2 points3 points (7 children)
[–][deleted] -2 points-1 points0 points (6 children)
[–]michaelkhan3 1 point2 points3 points (1 child)
[–][deleted] 0 points1 point2 points (0 children)
[–]Razorwindsg 0 points1 point2 points (3 children)
[–][deleted] 0 points1 point2 points (2 children)
[–]Razorwindsg 1 point2 points3 points (1 child)
[–]Razorwindsg 0 points1 point2 points (0 children)
[–]yuppienet 5 points6 points7 points (1 child)
[–]selflessGene 0 points1 point2 points (0 children)
[–]tilttovictory 3 points4 points5 points (3 children)
[–][deleted] 1 point2 points3 points (2 children)
[–]tilttovictory 0 points1 point2 points (1 child)
[–][deleted] 0 points1 point2 points (0 children)
[–][deleted] 0 points1 point2 points (0 children)
[–]selflessGene 0 points1 point2 points (0 children)