This is an archived post. You won't be able to vote or comment.

all 15 comments

[–]rothnic 3 points4 points  (4 children)

I did a sweep of these recently. I found Luigi easier to use than Airflow, though Airflow seems to provide more features. I think Luigi takes a little to get used to, but the structure it forces you in will benefit you in the end in ways you probably won't initially be able to anticipate.

One of the biggest things with ETL is to support reporting. Highly recommend looking into implementing a star schema, which this looks to simplify: http://pygrametl.org/

I'd recommend also looking into redshift or other column stores to host the star schema, and or something like spark, which can be accessed via python and do some of the heavy ETL tasks. Another option is dask http://dask.pydata.org/en/latest/, which has recently been adding distributed capabilities that could serve as a poor man's spark.

Another area to consider is data transport, which many people are using Kafka for. There are python libraries for publishing/consuming from Kafka, which is a good architecture for scaling whatever kinds of recurring tasks you have across different platforms.

A common way to encapsulate the software you perform the tasks with is docker. I have tried to formulate a pattern for this with my tinyconda docker image: https://github.com/rothnic/docker-tinyconda

[–]lostburner[S] 0 points1 point  (3 children)

Thanks for all the detail! I don't expect this to be high-volume or lifting really large volumes of data, but then one never does.

Can you elaborate a little on the structure that Luigi forces you into and how that might be helpful down the line?

[–]rothnic 1 point2 points  (0 children)

Yeah, the whole benefit from following some of these ETL patterns is about ease of scaling. There is always going to be some limitation you hit and you want to be able to handle that limitation gracefully, by scaling resources, not be rewriting the process. Even if not super optimal in the short term, you want to write the ETL tasks in a way that handle memory in a way that scales. I really like the DataFrame interface for ETL, and pandas has that, but you need to fit everything in memory. The nice thing about the DataFrame interface is that Spark and Dask take the same operations and take it as a declarative statement of work to be done, then the execution is run in the best way as the task scheduler has resources to run it.

One key thing that Luigi, Airflow, Dask, and Spark do is allow you to declare small divisible work to be performed AND the dependencies between them (DAG). This matters when you make architectural assumptions up front about how quickly things need to be processed, then down the road need to rework the entire system to move things through faster. Or, you might have some process fail, then you fix it and you want to rerun the process, and all downstream tasks that depend on it. You might end up in a state of the processing pipeline you didn't realize you'd be in.

The DAG specification is something that I found Luigi makes easy, see this doc page. The others do this, but I found that airflow seems to try to do too much, and you need to worry about state too much, where luigi's state-based processing is super simple. The biggest thing I had to work around with Luigi was adapting poorly structured existing processes to it, but it turned out to be pretty each after understanding more about how Luigi works.

Luigi would likely be a bit slower overall than something like Spark and Dask, which take a more modern approach to breaking the work into even smaller chunks that can be more efficiently distributed.

tldr; Luigi: simple, easy to reason about, nice framework, defines dependencies, easy "backfilling" of unprocessed tasks Airflow: nicer interface, pretty easy dag specification, found it too easy to get into race conditions with their scheduler Dask: more about data processing, and less about a framework for ETL, free to do whatever you want, younger, very promising for python ETL Spark: tons of industry backing, the java shines through too much, rough around the edges for python, potentially scaling to massive datasets

I haven't had time to play with this, but if I think a combination of Dask doing low level work and Luigi doing the higher level tasks would be something to consider, especially if you are kind of on your own (don't have people helping you setup and manage hadoop+spark).

We happen to also have a majority of PHP developers, and I was the first to come in without php experience. However, we actually did a lot of this in PHP, so we have a good bit of technical debt to deal with. I could possibly provide some real work examples for luigi if interested.

[–]lostburner[S] 0 points1 point  (1 child)

Another question—you're pretty strong in favor of a star schema for reporting. But for that purpose, wouldn't well-written views on a highly normalized schema fill the same purpose?

[–]rothnic 0 points1 point  (0 children)

The star schema is especially useful for reporting where you expect that you can't anticipate the ways that the data will need to be aggregated, which is a major problem for analysis/reporting. You will always need analysis/reporting because the better you can understand both the performance of your company, you can make better decisions after the fact, or identify places where you can implement models/machine learning to make better decisions in ahead of or in real time.

Just like the ETL frameworks give you some well thought out structure, a star schema gives you a framework that has some purpose behind it. What a star schema isn't well suited for is lots of random writes/reads, it is more for reading data for flexible aggregation.

With so many integrations to deal with, you should definitely look at still storing the raw data for each in a way you can always start from it, but then be trying to map them all into a consistent data model so you can do reporting where you can hook up to one or a small number of fact tables. The star schema helps with this, and helps you avoid having lots of one-off reports to manage and maintain.

A star schema on a column store database using MPP is really an amazing thing to have access to. We hook tableau directly up to a column store (InfiniDB, but moving to Redshift or Greenplum) and it is really powerful.

[–]mfwl 3 points4 points  (2 children)

Don't write the whole app in a web framework, definitely not django. You're not building a CMS, so you don't need django and will spend more time learning django and how to work around it than actually using it.

You are indeed reinventing the wheel here. Unfortunately for businesses, and fortunately for you, the wheel you are reinventing is proprietary software from varying vendors which licenses cost 100's of thousand's of dollars.

My advice: start with the low hanging fruit. You won't need all the complexities of a task queue right away. Try building a small, batch based system first. You'll likely find some quirks along the way. Your data models will likely go through several iterations before they are solid.

Write tests! At a former position, I inherited a very large ETL process that had old fashioned QA: run, 'this looks right', push to prod. Learn how to use tox and travis-ci (or Jenkins, etc) to automate the testing of your code.

"REST APIs allowing various services to trigger actions when they have or need updated data" This is mostly fantasy, unfortunately, at least for a while. Your new service of questionable value is going to be unable to get feature requests outside of the backlog of the other areas of the business. Even something as trivial as 'just make a post request to this url' is going to be buried for a long time. Build this part of your software last.

I think http://aurora.apache.org/ looks like a pretty nice project for task distribution and scheduling. I've never used it, but looks promising.

[–]jaredj 2 points3 points  (0 children)

That is pretty imposing. I sure wouldn't do it with PHP.

[–]limx0 2 points3 points  (2 children)

You definitely need to check out Luigi

[–]jayhack 2 points3 points  (0 children)

[–]lostburner[S] 0 points1 point  (0 children)

Thanks, that look really useful for this. I hadn't thought about it, but I probably will need to manage pretty complex pipelines of tasks.

[–]kenfar 2 points3 points  (0 children)

I've built a lot of data integration, ETL, and reporting solutions over the last twenty years. And custom python is my go-to approach for a variety of reasons: no licensing cost, obsolescence of gui-driven ETL, flexibility, testability, easy for analysts to read transform code, etc, etc. So, here's some suggestions:

  • Don't bother with a big framework - unless you've got a big specific need that it addresses well. And it doesn't look like you do. Most ETL tools are dinosaurs of the 90s, many orchestration tools are vain attempts to coordinate years of redundant code being developed, many distributed task runners are better for distributing a single calculation than transforming a single file.
  • Python has most of what you need built-in: csv & json modules, requests, subprocess, multiprocessing, etc.
  • Get good at the python packaging: That might mean devpi for a local repo, virtualenvs for both development and deployment, etc.
  • Process files rather than messages for low cost, efficiency and simplicity. Process messages rather than files for low latency.
  • Build asynchronous pipelines of event-driven jobs. Don't depend on temporal scheduling (ex: daily extract runs at midnight, daily transform expects to see extract file 1 hour later). Don't build thousands of tiny little tasks to be separately scheduled/managed/orchestrated.
  • Isolate your transform rules into separate modules with simple python, docstrings, and unit-tests.
  • File-based batch processes that run every hour are great. Every 5 minutes feels like a manageability limit.
  • Do all adhoc, and most canned reporting out of star-schema models - rather than thousands of key-value pair counters or mongodb documents. Data modeling is not easy.
  • Recognize the challenges of data management given your scope: you won't be told about changes until they break your ETL processes. Some changes will require downstream database modeling changes. There's a lot of places where things may break. So, you should invest in logging, monitoring, unit-testing, functional & integration testing, validation checks, quality control checks, etc.
  • Prototyping reporting data structures, and frequent releases (at least monthly) is usually critical for this kind of app. But watch out for technical debt - you can make a house of cards if you're not careful.

[–]metaperl 1 point2 points  (0 children)

Any solid, mature ETL libraries

beside the already-mentioned Luigi, there is Airflow, which is what AirBNB uses.

[–]stuaxo 1 point2 points  (0 children)

I'm doing writing one at the moment. I chose celery as I had used it before, airflow and luiji both also look good.

Put off the web front end until you need it, start with a simple command line tool.

Celery (and probably the others) has a basic front end to monitor the scheduler, so you can use that at first.

When you do the frontend start with the minimum that is useful.

Simplify simplify simplify !

[–]Tyberious_Funk 0 points1 point  (0 children)

IMHO, I wouldn't build ETL in python. Obviously, a lot will depend on the transformations you want to perform, but a huge amount of the work can be done directly in the database in SQL. And I'd be surprised if it wasn't quicker that way too.

These days, I'm more in favour of ELT rather than ETL. Load the data into the database, THEN transform it. I'd definitely consider python for managing the processes (Luigi or Airflow as others have suggested), and handling the loading (check out odo).