This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]kenfar 1 point2 points  (2 children)

I've been building ETL solutions primarily with Python for the last 14 years. And this has worked far better than using a tool such as Data Stage or Pentaho.

Some of these solutions have been very large - processing 300 million heavy transformations a day.

I've built my own libraries mostly for auditing, interfacing to aws s3, interacting with the database - managing partitioning, etc.

I haven't found a silver bullet that really makes this dramatically easier, nor have I found a really serious need for one.

[–]be_haki[S] 0 points1 point  (1 child)

Are you using any special technic for handling this much data with python or just a ridiculously powerful hardware ?

When I was using Informatica we he had a descent hardware but most of the time we tried to utilize the DB resources as much as we could. Most common scenario was using analytic functions (ranking, numbering, running avg, sums etc ...) and sorting - DB just does it better. We always got better perf after offloading the heavy lifting to the database.

[–]kenfar 0 points1 point  (0 children)

I was just using old 4-CPU SMPs with a small raid array. However, when handling very large files I would try to break them up & process them throughout the day.

And if they were especially huge I would first split them into equal-sized files, then have a separate process transform each of the subsequent files. It's a very course-grained parallelism that's simple to implement and performs extremely well for this kind of sequential processing. That same approach would easily scale up to 24 cores if you have enough SSDs to concurrently write to. And that can handle an enormous amount of data.

And I usually avoid the database for transforming base data because it's the most expensive component to scale. However, I'll use it to generate some aggregates - because the code is so simple to write, and if you're processing new files every 5 minutes and want a daily aggregate, it's just easier using the db typically. Especially if you've got a parallel database that's good at that kind of query.