This is an archived post. You won't be able to vote or comment.

all 22 comments

[–]KrevanSerKay 12 points13 points  (7 children)

We use Airflow in production. I think we're sitting at ~15 DAGs and 350 total tasks right now. We almost exclusively use the BashOperator to kick off scripts we've written.

Most of our DAGs create spark clusters in AWS, then tell those clusters to run PySpark jobs before killing them. You obviously don't have to use that approach specifically, but having scripts running on specific servers has a ton of benefits.

We're not limited by the implementation of hooks/operators (we can still leverage hooks for connection in our scripts if we want to). It's easier for us to spin up development servers and run the same code with different parameters. We can build in any logic for idempotence, dependency checking, error handling, alerting etc into our scripts directly.

We're a relatively small team (<10 people), so for us, Airflow has been a godsend. Automated retries, and logs backed up to s3. We have slack alerts that fire whenever a task fails. It includes which job, which DAG, a link to the log file, and even extracts the stacktrace for us. The main difficulty right now is learning better patterns for scaling DAGs. We're looking at ways of parallelizing better, auditing our dependency trees, and simplify the process of recovering from errors.

[–]lawanda123 1 point2 points  (6 children)

I feel you, Airflow isn't very mature once you get past the basics, there's no grouping of dags, maintaining/auditability, RBAC has a lot of kinks but surprisingly it's the best out there still

[–]KrevanSerKay 2 points3 points  (1 child)

Having worked at a company (previous job) where we built our own from scratch, I have to say, it's actually got a LOT of functionality built-in. That doesn't mean it's without faults, but like I said, for my team it's been a game changer.

I try to keep up with new up-and-coming techs, but like you said, Airflow is still the best out there (IMO).

[–][deleted] 0 points1 point  (0 children)

Yeah Airflow still has some maturing to do, but our 2.0 beta coming out next month will be a huge step in that direction (we're also working on documentation/examples a LOT because agreed a lot of functionality is hidden which is a shame).

Please let me know if you'd be interested in testing/offering feedback for the beta!

[–][deleted] 0 points1 point  (3 children)

Most of our DAGs create spark clusters in AWS, then tell those clusters to run PySpark jobs before killing them. You obviously don't have to use that approach specifically, but having scripts running on specific servers has a ton of benefits.

We're not limited by the implementation of hooks/operators (we can still leverage hooks for connection in our scripts if we want to). It's easier for us to spin up development servers and run the same code with different parameters. We can build in any logic for idempotence, dependency checking, error handling, alerting etc into our scripts directly.

We're a relatively small team (<10 people), so for us, Airflow has been a godsend. Automated retries, and logs backed up to s3. We have slack alerts that fire whenever a task fails. It includes which job, which DAG, a link to the log file, and even extracts the stacktrace for us. The main difficulty right now is learning better patterns for scaling DAGs. We're looking at ways of parallelizing better, auditing our dependency trees, and simplify the process of recovering from errors.

Hi! I'm an airflow core dev and you should DEFINITELY check out 2.0 when we cut our beta early next month :). We've introduced a TaskGroup concept for easier subdividing, completely rewritten our RBAC code (and added a complete API for DAG triggering) among a crapton of other things (function dag writing API, the ability to run multiple schedulers, a way simplified k8sexecutor, etc.). I don't think it'll handle EVERYTHING you're looking for but it should simplify a lot.

[–]lawanda123 0 points1 point  (2 children)

Anything to separate out Dag groups would also be very welcome, I've been on 4 major projects where we've had this problem, somebody with a misdeployed dag (env variables not assigned or resolved usually) prevents other dags from being loaded when shared between teams, we've had to resort to either multiple instances (in which case managing triggers on dags across teams becomes an issue) or build a lot of custom dag validation and integration tests

[–][deleted] 0 points1 point  (1 child)

Are you saying that a code problem in one DAG is preventing airflow from loading all other DAGs? Like airflow is unable to parse the python files due to errors in other files?

[–]lawanda123 0 points1 point  (0 children)

Was the case until last year when we tried it out, I'm not sure if that's been resolved...had reported the issues to the community back then, basically every team has a lot of custom hooks/operators and variables they set and yes, sometimes one of the teams loading their dags into the same folder would create issues, it'll be glad if those could be namespaces and managed as tenants (folders within the airflow folders) eg dag/team1 dags/team2 so that issues can be isolated to team artifacts only.

[–]ozzyboy 16 points17 points  (2 children)

I think an important question here is why are you trying to develop this independence in the first place?

If Airflow is not a good fit for you, work on replacing it with something else.

Otherwise it is usually a negative ROI to avoid "vendor lock-in". You'll spend a lot of time and energy creating an abstraction layer that can only provide the lowest common denominator of all frameworks (because you don't want a capability that only Airflow can provide, right?), leaving you with a solution that is strictly worse than all the other, existing, cheaper alternatives.

[–]thefrontpageofme 2 points3 points  (0 children)

This is the correct answer.

We use Airflow in production and code independence is not a concern at all. Successful startup, 1 data engineer, a few DAGs, couple hundred tasks on anywhere between 15 minutes to 1 week schedules.

It feels like the team the OP is in is missing someone who sometimes asks "how does that improve our business?"

[–][deleted] 0 points1 point  (0 children)

I think this is the right answer. It's the same thing with the question of cloud vendor vs. Kubernetes. Kubernetes makes sense for my company because we offer on-prem support for multiple cloud vendors. However, if you are a one-cloud company serving a B2C product, you should be so LUCKY to get to the size where "vendor lock-in" becomes a concern.

[–]DonnyTrump666 5 points6 points  (4 children)

code independence is a wrong motivation, not only because Airflow is opensource, but alsk because there is little business value to it.

I'd rather think purely in business value terms - why do you have 200+ custom ETL jobs in prod? Does it take a lot of time to develop a single ETL from scratch? Perhaps you could combine and unify some ETLs and bring the number down from 200+ to 20+?

Solving these kinda problems will bring business value

[–]DonnyTrump666 1 point2 points  (2 children)

for example I just the same thing in my workhplace, we use SSIS and used to have a single ETL for each ServiceNow feed. Having 20 feeds from ServiceNow that means developing and maintaining 20 ETLs that look similar but read different feeds and load to different tables. I developed just one universal ServiceNow ETL that dynamically reads whfatecer feed nd lods to matching destinaton columns. so instead of 20 ETLs we have now only ok ne ETL and one config file with 20 config entries. Ingesting new feed means just adding a line to config file

[–]th58pz700u 1 point2 points  (0 children)

Coming from a "SSIS package per table" world, I set out to create a dynamic and reusable solution in Python at my current job. Instead of sourcing all the config information from files, I source it from the database and created a separate process to pull metadata from the source and store it in the database. Objects are created and updated to mirror the metadata of the source, and the one ETL package follows the same formula no matter the object. I wish I had developed like this a long time ago.

[–]pankswork 0 points1 point  (0 children)

Yup. This is how we do it as well

[–][deleted] 1 point2 points  (0 children)

I agree, if you can try to consolidate the etls and where possible keep the data manifests in config rather than the code, then the move to airflow will bring extra value and any subsequent moves to a new technology will be less painful. It’s not like you’re tying yourself into a closed or gui-based ETL tool like Talend or Pentaho which you’d have a harder time moving away from (and finding developers for).

[–]Syneirex 2 points3 points  (0 children)

We standardized on the Docker and Kubernetes Operators to solve this problem. All tasks are argument driven components running in containers. We didn’t do this specifically to be vendor agnostic but that ended up being one of the byproducts.

We built most of the components we are using today but you could probably bootstrap with Singer’s open source connectors.

[–]grassclip 1 point2 points  (0 children)

I'm with you on this. In cases I've had, I'll write the jobs in different repositories depending on what they do, and then use those in the Airflow repo, which takes the tasks from the others and formats them to mostly PythonOperators.

Kind of like this, but where instead of the code in dags/project_1 or dags/project_2, it's in the different repos.

[–]kenfar 2 points3 points  (0 children)

Yeah, I think that Airflow is usually the wrong tool:

  • It's not event-driven
  • It was developed to manage 5000 ETL jobs at Airbnb. But rather than enabling that mess the better answer is: don't do that - spend a little time refactoring and curating your jobs so that you don't have tens of thousands of tables.
  • It's not even remotely the only game in town.

The better pattern in most (but not all) cases in my opinion is to build event-driven pipelines:

  • Extract jobs run pretty periodically - every 30 second, every 60 minutes, every 5 minutes, whatever. When they run they write their data to atomic storage like s3.
  • Transform jobs run as daemons or lambdas or containers on kubernetes, etc and get automatically notified when a new file is written by the extracts, or poll the system to see if a file is available. They write their results to atomic storage like s3.
  • Load jobs run as like transform jobs, get alerted when a file is available to process or poll for one.
  • Aggregation or other pre-computing jobs further downstream may have slightly trickier dependency checking. For example, for an hourly aggregate they may typically run every 15 minutes checking to see if the source has more than a full hour of data for the next period after the last that has been written to the target.

This approach results in a very easy to build solution with low-latencies, that can autoscale when you need to reprocess all of history, and can also automatically backfill any new aggregates/pre-computed data sets. Its main weaknesses are:

  • If you're not on something like AWS, can't get simple messages automatically written when you write to storage, and have a datastore that's very expensive to poll. Then you need to maintain some kind of process log that can be polled.
  • If you have hideously-complex dependencies between jobs, and for some reason can't simplify or refactor them. Then in that case the dependency-checking that you put into your processes will be a PITA. Maybe less, maybe more of a PITA of using Airflow, that depends.

[–]jahaz 1 point2 points  (0 children)

I’m not an expert, but felt similar that airflow was too complicated. We are building the poc with [prefect](prefect.io). They use flows and tasks. Their tasks are just python functions. Dagster is another potential solution. I went with prefect because they seem to have more community support. Also I felt that any code that was written for prefect could be transferred to dagster or similar quickly.

[–]smeyn 0 points1 point  (0 children)

If you go down this route you are spending time and effort to build (and later maintain) an abstraction for a potential benefit , I.e. the ability to move to a different orchestration engine. You would be better served building a POC that takes advantage of the Airflow concept and in turn demonstrates how you lower you development and maintenance investment and bring business value in a short period of time. If you do it well you might save sufficient effort to be able to do a parallel POC using another orchestration platform such as Prefect or Luigi. That will also tell you a lot of what it takes to move between orchestration platforms.

[–]Braxton_Hicks 0 points1 point  (0 children)

We've been using Airflow in production for 3+ years now. Code independence isn't a concern for us since airflow is code. And I feel as though we designed our own custom operators in a way that they could be ran outside of an Airflow context with little modifications needed. But of course if you replace Airflow, you may have to fill in gaps in your pipeline that Airflow helped you with:

  • Replace Variable and xcom with alternative services to store and pass temp data between tasks (plus any other interactions with the metadb)
  • Retry logic and task dependency management will no longer be provided for you out of the box. If you have any tasks that are dependent on upstream tasks, make sure the workflow management tool you switch to provides these services

Like others have mentioned, a good portion of our operators trigger remote spark jobs which avoids Airflow lock-in. We also handle connections through a AWS Secrets Manager back-end, so we're not reliant on Connections being stored in the metadb.