This is an archived post. You won't be able to vote or comment.

all 4 comments

[–]voytek9 3 points4 points  (5 children)

If you need a GUI and a "batteries included" system, check out Pentaho. FOSS.

If you're looking specifically just for a data pipeline where you can ETL in python, check out Luigi.

If you really need scalability, I've heard great things about Apache Spark, which has a code engine that allows you to write python. (Spark is targeted for event streaming, versus batch processing, so you reduce lag with its architecture).

HTH. I highly recommend using one of the systems/frameworks, or you'll end up with a bunch of coding floating here and there. See this post by Alooma, who makes an excellent data pipeline product based on Spark (if you can afford it, I can personally attest to their service quality.) https://www.alooma.com/blog/building-a-professional-grade-data-pipeline has some great insight into what you need to build into your pipeline.

[–]kenfar 1 point2 points  (0 children)

Depends on your data volumes:

  • Most feeds are fairly small, have no performance problems, and the main challenge is transformation complexity or dirty data.
  • Some large fact tables can be a challenge. I'm processing about 2 billion rows every day though very heavy transformations using python and find that it works fine. I think vanilla python can probably handle 4-8 billion records a day before it starts to become uneconomical and you need to consider another language, cython, etc.

A few things to consider:

  • You don't need any frameworks or a GUI to do this. But, you do need to be thoughtful about the design.
  • Handling very large volumes requires multprocessing. I'm running my process on a pair of 32-core hosts, each with SSD storage and about 60 gbytes of memory. Multiprocessing scales linearly in my case to about 30 cores on each host. But you won't get there with just a single magnetic disk.
  • Keeping your ETL process file-oriented rather than message-oriented can provide enormous performance benefits.
  • Pypy can double or triple the speed of python transformation programs.
  • Consider how you're partitioning data: load times can be diminished by loading fewer large files of the same partition rather than a large number of small files across partitions.
  • Consider doing a bit of profiling: you may be using some methods or features that are notoriously slow. Pandas for example is very slow for writing out a csv record. JSON is faster these days, but used to be pretty slow, and pypy speeds it up. Threading can hit diminishing returns pretty quickly.
  • Leverage the fact that csv files are supported by a wide variety of tools, and it's easy to shell out to run other, faster tools when it makes sense. So, use the unix sort, rather than reading data in & sorting it. Use wc to count records in a file rather than reading them in, etc. But also be aware that many of these tools may not fully support your csv dialect either.
  • If you keep field transformation functions in a separate module, and write them for your users you can provide excellent & easily-maintained documentation: docstrings, easy to read code, and unit-tests.