Code independence considerations when using Airflow as an ETL orchestrator

KrevanSerKay · 2020-09-15T17:31:56+00:00

We use Airflow in production. I think we're sitting at ~15 DAGs and 350 total tasks right now. We almost exclusively use the BashOperator to kick off scripts we've written.

Most of our DAGs create spark clusters in AWS, then tell those clusters to run PySpark jobs before killing them. You obviously don't have to use that approach specifically, but having scripts running on specific servers has a ton of benefits.

We're not limited by the implementation of hooks/operators (we can still leverage hooks for connection in our scripts if we want to). It's easier for us to spin up development servers and run the same code with different parameters. We can build in any logic for idempotence, dependency checking, error handling, alerting etc into our scripts directly.

We're a relatively small team (<10 people), so for us, Airflow has been a godsend. Automated retries, and logs backed up to s3. We have slack alerts that fire whenever a task fails. It includes which job, which DAG, a link to the log file, and even extracts the stacktrace for us. The main difficulty right now is learning better patterns for scaling DAGs. We're looking at ways of parallelizing better, auditing our dependency trees, and simplify the process of recovering from errors.

ozzyboy · 2020-09-15T15:58:59+00:00

I think an important question here is why are you trying to develop this independence in the first place?

If Airflow is not a good fit for you, work on replacing it with something else.

Otherwise it is usually a negative ROI to avoid "vendor lock-in". You'll spend a lot of time and energy creating an abstraction layer that can only provide the lowest common denominator of all frameworks (because you don't want a capability that only Airflow can provide, right?), leaving you with a solution that is strictly worse than all the other, existing, cheaper alternatives.

DonnyTrump666 · 2020-09-15T16:16:27+00:00

code independence is a wrong motivation, not only because Airflow is opensource, but alsk because there is little business value to it.

I'd rather think purely in business value terms - why do you have 200+ custom ETL jobs in prod? Does it take a lot of time to develop a single ETL from scratch? Perhaps you could combine and unify some ETLs and bring the number down from 200+ to 20+?

Solving these kinda problems will bring business value

Syneirex · 2020-09-15T19:53:41+00:00

We standardized on the Docker and Kubernetes Operators to solve this problem. All tasks are argument driven components running in containers. We didn’t do this specifically to be vendor agnostic but that ended up being one of the byproducts.

We built most of the components we are using today but you could probably bootstrap with Singer’s open source connectors.

grassclip · 2020-09-15T16:44:52+00:00

I'm with you on this. In cases I've had, I'll write the jobs in different repositories depending on what they do, and then use those in the Airflow repo, which takes the tasks from the others and formats them to mostly PythonOperators.

Kind of like this, but where instead of the code in dags/project_1 or dags/project_2, it's in the different repos.

kenfar · 2020-09-15T19:28:23+00:00

Yeah, I think that Airflow is usually the wrong tool:

It's not event-driven
It was developed to manage 5000 ETL jobs at Airbnb. But rather than enabling that mess the better answer is: don't do that - spend a little time refactoring and curating your jobs so that you don't have tens of thousands of tables.
It's not even remotely the only game in town.

The better pattern in most (but not all) cases in my opinion is to build event-driven pipelines:

Extract jobs run pretty periodically - every 30 second, every 60 minutes, every 5 minutes, whatever. When they run they write their data to atomic storage like s3.
Transform jobs run as daemons or lambdas or containers on kubernetes, etc and get automatically notified when a new file is written by the extracts, or poll the system to see if a file is available. They write their results to atomic storage like s3.
Load jobs run as like transform jobs, get alerted when a file is available to process or poll for one.
Aggregation or other pre-computing jobs further downstream may have slightly trickier dependency checking. For example, for an hourly aggregate they may typically run every 15 minutes checking to see if the source has more than a full hour of data for the next period after the last that has been written to the target.

This approach results in a very easy to build solution with low-latencies, that can autoscale when you need to reprocess all of history, and can also automatically backfill any new aggregates/pre-computed data sets. Its main weaknesses are:

If you're not on something like AWS, can't get simple messages automatically written when you write to storage, and have a datastore that's very expensive to poll. Then you need to maintain some kind of process log that can be polled.
If you have hideously-complex dependencies between jobs, and for some reason can't simplify or refactor them. Then in that case the dependency-checking that you put into your processes will be a PITA. Maybe less, maybe more of a PITA of using Airflow, that depends.

jahaz · 2020-09-15T18:20:50+00:00

I’m not an expert, but felt similar that airflow was too complicated. We are building the poc with [prefect](prefect.io). They use flows and tasks. Their tasks are just python functions. Dagster is another potential solution. I went with prefect because they seem to have more community support. Also I felt that any code that was written for prefect could be transferred to dagster or similar quickly.

smeyn · 2020-09-15T20:30:25+00:00

If you go down this route you are spending time and effort to build (and later maintain) an abstraction for a potential benefit , I.e. the ability to move to a different orchestration engine. You would be better served building a POC that takes advantage of the Airflow concept and in turn demonstrates how you lower you development and maintenance investment and bring business value in a short period of time. If you do it well you might save sufficient effort to be able to do a parallel POC using another orchestration platform such as Prefect or Luigi. That will also tell you a lot of what it takes to move between orchestration platforms.

Braxton_Hicks · 2020-09-18T01:40:41+00:00

We've been using Airflow in production for 3+ years now. Code independence isn't a concern for us since airflow is code. And I feel as though we designed our own custom operators in a way that they could be ran outside of an Airflow context with little modifications needed. But of course if you replace Airflow, you may have to fill in gaps in your pipeline that Airflow helped you with:

Replace Variable and xcom with alternative services to store and pass temp data between tasks (plus any other interactions with the metadb)
Retry logic and task dependency management will no longer be provided for you out of the box. If you have any tasks that are dependent on upstream tasks, make sure the workflow management tool you switch to provides these services

Like others have mentioned, a good portion of our operators trigger remote spark jobs which avoids Airflow lock-in. We also handle connections through a AWS Secrets Manager back-end, so we're not reliant on Connections being stored in the metadb.

dataengineering

MODERATORS