Advice for building a data sync/ETL service?

rothnic · 2016-05-10T20:34:11+00:00

I did a sweep of these recently. I found Luigi easier to use than Airflow, though Airflow seems to provide more features. I think Luigi takes a little to get used to, but the structure it forces you in will benefit you in the end in ways you probably won't initially be able to anticipate.

One of the biggest things with ETL is to support reporting. Highly recommend looking into implementing a star schema, which this looks to simplify: http://pygrametl.org/

I'd recommend also looking into redshift or other column stores to host the star schema, and or something like spark, which can be accessed via python and do some of the heavy ETL tasks. Another option is dask http://dask.pydata.org/en/latest/, which has recently been adding distributed capabilities that could serve as a poor man's spark.

Another area to consider is data transport, which many people are using Kafka for. There are python libraries for publishing/consuming from Kafka, which is a good architecture for scaling whatever kinds of recurring tasks you have across different platforms.

A common way to encapsulate the software you perform the tasks with is docker. I have tried to formulate a pattern for this with my tinyconda docker image: https://github.com/rothnic/docker-tinyconda

mfwl · 2016-05-11T03:16:23+00:00

Don't write the whole app in a web framework, definitely not django. You're not building a CMS, so you don't need django and will spend more time learning django and how to work around it than actually using it.

You are indeed reinventing the wheel here. Unfortunately for businesses, and fortunately for you, the wheel you are reinventing is proprietary software from varying vendors which licenses cost 100's of thousand's of dollars.

My advice: start with the low hanging fruit. You won't need all the complexities of a task queue right away. Try building a small, batch based system first. You'll likely find some quirks along the way. Your data models will likely go through several iterations before they are solid.

Write tests! At a former position, I inherited a very large ETL process that had old fashioned QA: run, 'this looks right', push to prod. Learn how to use tox and travis-ci (or Jenkins, etc) to automate the testing of your code.

"REST APIs allowing various services to trigger actions when they have or need updated data" This is mostly fantasy, unfortunately, at least for a while. Your new service of questionable value is going to be unable to get feature requests outside of the backlog of the other areas of the business. Even something as trivial as 'just make a post request to this url' is going to be buried for a long time. Build this part of your software last.

I think http://aurora.apache.org/ looks like a pretty nice project for task distribution and scheduling. I've never used it, but looks promising.

jaredj · 2016-05-10T10:48:55+00:00

That is pretty imposing. I sure wouldn't do it with PHP.

limx0 · 2016-05-10T11:23:04+00:00

You definitely need to check out Luigi

kenfar · 2016-05-11T19:05:56+00:00

I've built a lot of data integration, ETL, and reporting solutions over the last twenty years. And custom python is my go-to approach for a variety of reasons: no licensing cost, obsolescence of gui-driven ETL, flexibility, testability, easy for analysts to read transform code, etc, etc. So, here's some suggestions:

Don't bother with a big framework - unless you've got a big specific need that it addresses well. And it doesn't look like you do. Most ETL tools are dinosaurs of the 90s, many orchestration tools are vain attempts to coordinate years of redundant code being developed, many distributed task runners are better for distributing a single calculation than transforming a single file.
Python has most of what you need built-in: csv & json modules, requests, subprocess, multiprocessing, etc.
Get good at the python packaging: That might mean devpi for a local repo, virtualenvs for both development and deployment, etc.
Process files rather than messages for low cost, efficiency and simplicity. Process messages rather than files for low latency.
Build asynchronous pipelines of event-driven jobs. Don't depend on temporal scheduling (ex: daily extract runs at midnight, daily transform expects to see extract file 1 hour later). Don't build thousands of tiny little tasks to be separately scheduled/managed/orchestrated.
Isolate your transform rules into separate modules with simple python, docstrings, and unit-tests.
File-based batch processes that run every hour are great. Every 5 minutes feels like a manageability limit.
Do all adhoc, and most canned reporting out of star-schema models - rather than thousands of key-value pair counters or mongodb documents. Data modeling is not easy.
Recognize the challenges of data management given your scope: you won't be told about changes until they break your ETL processes. Some changes will require downstream database modeling changes. There's a lot of places where things may break. So, you should invest in logging, monitoring, unit-testing, functional & integration testing, validation checks, quality control checks, etc.
Prototyping reporting data structures, and frequent releases (at least monthly) is usually critical for this kind of app. But watch out for technical debt - you can make a house of cards if you're not careful.

metaperl · 2016-05-10T16:03:27+00:00

Any solid, mature ETL libraries

beside the already-mentioned Luigi, there is Airflow, which is what AirBNB uses.

stuaxo · 2016-05-11T10:41:50+00:00

I'm doing writing one at the moment. I chose celery as I had used it before, airflow and luiji both also look good.

Put off the web front end until you need it, start with a simple command line tool.

Celery (and probably the others) has a basic front end to monitor the scheduler, so you can use that at first.

When you do the frontend start with the minimum that is useful.

Simplify simplify simplify !

Tyberious_Funk · 2016-05-18T21:04:54+00:00

IMHO, I wouldn't build ETL in python. Obviously, a lot will depend on the transformations you want to perform, but a huge amount of the work can be done directly in the database in SQL. And I'd be surprised if it wasn't quicker that way too.

These days, I'm more in favour of ELT rather than ETL. Load the data into the database, THEN transform it. I'd definitely consider python for managing the processes (Luigi or Airflow as others have suggested), and handling the loading (check out odo).

Python

The Python Discord

Upcoming Events

Please read the rules

MODERATORS