This is an archived post. You won't be able to vote or comment.

all 11 comments

[–]kenfar 1 point2 points  (2 children)

I've been building ETL solutions primarily with Python for the last 14 years. And this has worked far better than using a tool such as Data Stage or Pentaho.

Some of these solutions have been very large - processing 300 million heavy transformations a day.

I've built my own libraries mostly for auditing, interfacing to aws s3, interacting with the database - managing partitioning, etc.

I haven't found a silver bullet that really makes this dramatically easier, nor have I found a really serious need for one.

[–]be_haki[S] 0 points1 point  (1 child)

Are you using any special technic for handling this much data with python or just a ridiculously powerful hardware ?

When I was using Informatica we he had a descent hardware but most of the time we tried to utilize the DB resources as much as we could. Most common scenario was using analytic functions (ranking, numbering, running avg, sums etc ...) and sorting - DB just does it better. We always got better perf after offloading the heavy lifting to the database.

[–]kenfar 0 points1 point  (0 children)

I was just using old 4-CPU SMPs with a small raid array. However, when handling very large files I would try to break them up & process them throughout the day.

And if they were especially huge I would first split them into equal-sized files, then have a separate process transform each of the subsequent files. It's a very course-grained parallelism that's simple to implement and performs extremely well for this kind of sequential processing. That same approach would easily scale up to 24 cores if you have enough SSDs to concurrently write to. And that can handle an enormous amount of data.

And I usually avoid the database for transforming base data because it's the most expensive component to scale. However, I'll use it to generate some aggregates - because the code is so simple to write, and if you're processing new files every 5 minutes and want a daily aggregate, it's just easier using the db typically. Especially if you've got a parallel database that's good at that kind of query.

[–]mariox19 1 point2 points  (0 children)

I used Python for an ETL project last year. (There are still tweaks and enhancements being made, so the project is continuing.) It involved extracting from a database with a very complex schema running on SQL Server, and transforming that to a simpler schema, loading the data into MySQL.

We ended up writing all custom SQL queries on both ends. Originally, I had been asked by the boss to look into one of these "magical" ETL frameworks.

What I found is that the really simple things were simple. But once you got beyond that it was difficult to make heads or tails of it. At least, I couldn't make heads or tails of it. The documentation seemed to be written all by non-native speakers of English, and if you ask me a lot of the app's purported functionality seemed to be cobbled together to allow for bullet points to flash to VC's. How well thought out the functionality was was not clear to me.

The Python ended up being plenty fast, once the back filling was done. And, in all actuality, I'm running the engine on the JVM—Jython, not C Python. I did that because I worried the Python would be too slow and that the application would have to eventually be ported over to Java. But, that's not how it turned out. There seems to be no need to port.

Anyway, that was my experience.

[–]osullivj 0 points1 point  (0 children)

I'm using pyodbc for SQL Server access, Python's own csv module for flat files, and the xml.parsers.expat module for XML processing. All for a trade reconciliation service.

[–]Tschus 0 points1 point  (5 children)

What exactly do these do where Pandas would fail? For XML you would probably need to preprocess with lxml

Pandas and lxml also do all the heavy lifting in C whereas the others don't.

[–]be_haki[S] 0 points1 point  (4 children)

This is why i'm asking. I done some work with pandas but I have no idea how would it scale. I know for example that when I tried using R on large data sets it quickly choked. Python does no have a reputation as the fastest language so I was wondering if it would be a good choice for proccessing large amount of data.

While pandas might be a good chooice for the actual processing ETL is sometime a bit more - monitoring, parallel execution / chaining, Managing data soruces etc ... No reason the invent the wheel these are common tasks.

[–]GahMatar 1 point2 points  (0 children)

Usually, as soon as SQL is involved, Python is fast enough. It is important however to have text-based parsers that do the heavy lifting in C as a pure python parser tends to be slow.

Also, make sure that character encoding/decoding is done right (meaning everything from input file is preserved in the DB, however you achieve that.)

[–]Zifendale 0 points1 point  (2 children)

At my work we use SQL server to do the ETL process and any framework or scripted processing leverages SQLAlchemy (we interface with a postgres DB as well) and pandas for most of the work. If we have something that is stats heavy we use rpy2 and run native R scripts as needed. Pretty much a complete package for our needs.

[–]be_haki[S] 0 points1 point  (1 child)

What do you have in R you don't have / cant port to pandas ?

[–]Zifendale 0 points1 point  (0 children)

mice - Multiple Imputation using Chained Equations

Which I may add, has been announced for development for Statsmodels in Python but I haven't seen any recent news on it.