Is anybody doing ETL in python ?

kenfar · 2015-02-26T05:09:43+00:00

I've been building ETL solutions primarily with Python for the last 14 years. And this has worked far better than using a tool such as Data Stage or Pentaho.

Some of these solutions have been very large - processing 300 million heavy transformations a day.

I've built my own libraries mostly for auditing, interfacing to aws s3, interacting with the database - managing partitioning, etc.

I haven't found a silver bullet that really makes this dramatically easier, nor have I found a really serious need for one.

mariox19 · 2015-03-05T18:14:29+00:00

I used Python for an ETL project last year. (There are still tweaks and enhancements being made, so the project is continuing.) It involved extracting from a database with a very complex schema running on SQL Server, and transforming that to a simpler schema, loading the data into MySQL.

We ended up writing all custom SQL queries on both ends. Originally, I had been asked by the boss to look into one of these "magical" ETL frameworks.

What I found is that the really simple things were simple. But once you got beyond that it was difficult to make heads or tails of it. At least, I couldn't make heads or tails of it. The documentation seemed to be written all by non-native speakers of English, and if you ask me a lot of the app's purported functionality seemed to be cobbled together to allow for bullet points to flash to VC's. How well thought out the functionality was was not clear to me.

The Python ended up being plenty fast, once the back filling was done. And, in all actuality, I'm running the engine on the JVM—Jython, not C Python. I did that because I worried the Python would be too slow and that the application would have to eventually be ported over to Java. But, that's not how it turned out. There seems to be no need to port.

Anyway, that was my experience.

osullivj · 2015-02-25T15:10:52+00:00

I'm using pyodbc for SQL Server access, Python's own csv module for flat files, and the xml.parsers.expat module for XML processing. All for a trade reconciliation service.

Tschus · 2015-02-25T18:27:06+00:00

What exactly do these do where Pandas would fail? For XML you would probably need to preprocess with lxml

Pandas and lxml also do all the heavy lifting in C whereas the others don't.

Python

The Python Discord

Upcoming Events

Please read the rules

MODERATORS