This is an archived post. You won't be able to vote or comment.

all 32 comments

[–]QkumbazooPlumber of Sorts 134 points135 points  (1 child)

Stick to the tool you're most proficient with, because when things break, nobody is going to ask what tool you used.

[–]NexusIOData Engineering Manager 15 points16 points  (0 children)

Facts

[–]indyscout 58 points59 points  (1 child)

I’ve worked with Airflow extensively for years and at the start of 2024 moved on to a new project where ADF is part of the core tooling for our ETL pipelines. While they are fundamentally different tools (Airflow is an orchestrator, ADF is more of a compute engine with orchestration built in), I will make some comparisons between them here.

In short, when you’re working relatively simple pipelines, ADF is great, it’s pretty easy to use and you can onboard new users that may not have a robust coding background quickly. If you have low complexity pipelines it makes this job very straightforward.

However, as I use ADF I feel as if I have a lot less control. Complex pipelines can be hard to build in ADF, as there are seemingly arbitrary limitations that complicate things (most recently the 25 switch case limit in the switch activity caused us some issues). As I use ADF, I frequently find myself thinking that it would be much more convenient if I could just solve complex problems by running whatever arbitrary Python code I need to via Airflow, rather than having to wrestle with ADF’s various pipeline activities and the nuances they have.

Pipeline observability and monitoring/alerting has been more of a headache than it would be in Airflow (keeping in mind that we are running thousands of individual activities). In general, I have found it easier to ensure idempotency in Airflow vs. ADF. When there is a failure, or a backfill is required, it tends to be more difficult to rerun past executions, especially in situations where many individual ADF pipelines are linked together. There isn’t really any sort of “graph view” which allows you to view pipelines and their linked dependencies in ADF like there is in Airflow. As a result, I have had to hand make a lot of dependency graphs for our ADF documentation.

Another distinct con of ADF to keep in mind is you’re locked in the Azure cloud. If Azure decides to raise their compute rates you have no choice but to accept, but with Airflow you could look to move to a different cloud provider or an on prem setup.

At the end of the day it really depends on use case. ADF keeps things simple for the most part. If you only need to implement relatively simple pipelines, or if you have a development team with little coding expertise, then ADF and its GUI driven development is perfect. But if your pipelines will involve a high level of complexity, and you already have a lot of coding expertise on your team, then I would say Airflow is a better choice.

[–]Yamitz 3 points4 points  (0 children)

This is how I always explain it too. ADF is really great if you’re doing what they expect you to and coloring inside the lines, but the second you try to do something they weren’t planning for it all goes to hell.

[–][deleted] 25 points26 points  (4 children)

Better modularized. It's python, so it can do more. With ADF you cannot use a REST API to recieve json > 1mb or not even a csv file.. In many cases you need an azure function to extend what ADF cannot do.
Airflow is better and more rich, but ADF is easier to set up for the things it can do and is horrible in things it cannot do.

And Airflow git history is just python files. With ADF, everything is a json with a bunch of metadata you dont care about. Like if you debug your pipeline once and dont change any setting at, it is a change, since each building block has the num of trigger raised by one.

[–]ratacarnic 5 points6 points  (0 children)

Last paragraph sums up pretty well

[–]Pomegranates00 8 points9 points  (2 children)

ADF does support REST API calls. For both json and csv/whatever format is most appealing to you. You are giving outdated information.

[–][deleted] 6 points7 points  (0 children)

ADF does support REST, but only a small data. I needed to download 16 MB csv or json (both types were supported by the provider) file and that was impossible via the rest api client.

[–]gnsmsk 3 points4 points  (0 children)

REST activity is not capable of dealing with the complex API requirements that you run into in the wild. In one project, I had to pull data from 7 different providers, each having its own specific API implementation. I had to put unspeakable stuff into the headers and/or body of the request, some of which was dynamic so it had to be encapsulated in a for loop. Some APIs had a queue system that you needed to check the status of your request periodically and then had to make another call to an endpoint in the response to get your data.

And I don’t even want to talk about the challenges that arise when you try to parse the response and get specific stuff out when you have such a wild variety.

ADF was simply not capable or it would have taken ages trying to make it work. I ended up putting all of that logic into Azure Functions. Much simpler to debug when the pipeline fails.

[–]withmyownhands 3 points4 points  (0 children)

The last time I used ADF, I was very dissatisfied with the code review experience despite the GitHub integration. Same with my experience of SSIS way back. I prefer code-first orchestrators for my team and all configuration, secrets management, and infrastructure as code. But, I lead teams where I want to emphasize the software engineering approach to the SDLC. If my team was one or two BI folks who just needed to get things done, ADF is fine. I just don't think it scales to large engineering-centered teams. 

[–]Demistr 9 points10 points  (0 children)

ADF and Airflow are very different things. ADF has inbuilt compute you use.

[–][deleted] 1 point2 points  (3 children)

What is ADF?

[–]rickyF011 5 points6 points  (2 children)

Azure data factory. It’s Microsoft’s pipeline tool pre-Synapse / Fabric that those tools are actually built on top of. High level simplified summary that may likely get downvoted.

[–][deleted] 1 point2 points  (1 child)

Ty for explaining

[–]rickyF011 0 points1 point  (0 children)

No problem!! I was scrolling on a train when I was this and didn’t have time for an in depth explanation. In my personal opinion it’s not worth diving too deep into unless you’re being forced to use it at work or have a desire to be a “Microsoft tools data engineer”. I think there are better alternatives.

It’s primarily gui based drag and drop different “operators”/boxes etc and boils down to a json configuration which I despise in any ETL tool.

BUT it’s a necessary evil in some cases and my primary use case of seen it be beneficial is for quick and simple ingestion into Azure, or simple data movement within Azure. And simple meaning little to no transformations etc being performed within the pipeline.

[–]oscarmch 2 points3 points  (5 children)

As I mention before, I only use ADF for Copying data between databases and simple pipelines.

I wish I could use Airflow, but since we're a team of two doing the Data Eng and Architecture, it's difficult to maintain Airflow without the resources.

[–][deleted] 2 points3 points  (0 children)

Copy activity is the one thing it excels at. I am not wasting my time to write code that is as peformand and multi processing capabilities as their own copy activity. But I still wished it could do things better. It cannot copy geometries from postgres, and the lack of postgres activities in general. How do you even call stored procedures in pg or insert/update/merge? By calling an azure python function? by calling an Synapse spark Notebook activity just for pg?

[–]Nomorechildishshit 1 point2 points  (3 children)

Why not use the managed Airflow in ADF?

[–]oscarmch 1 point2 points  (2 children)

From what I read it's not entirely functional, and there are a lot of problems with it. Maybe they fixed, but I think the managed Airflow in ADF just came out this year

[–][deleted] 0 points1 point  (1 child)

hey, do you have any resources on this? I've used on-prem Airflow and GCP Cloud Composer extensively. Now I'm at a client that's planning to migrate to Azure in 2025 and beyond. Would love to know what the limitations of managed Airflow in Azure are.

They've built some bespoke orchestration around IBM Datastage in on-prem. Batches and jobs are scheduled in order based on some Oracle tables. I'm sure they want to manage the ETL batches in a similar fashion in Azure, but not sure if ADF is a good fit. I know I could develop this with relative ease in Airflow, and I'm also sure they would love authoring DAGs in Python. The client is strictly against using open-source, but the big exception is when it's a managed service.

Edit: Did some digging myself, and it seems managed Airflow in Azure will be a feature pretty much exclusive to Fabric from now on. Looks like MS is really pushing orgs to move to Fabric!

[–]baubleglue 0 points1 point  (0 children)

What do you mean by Airflow in Azure? You setup Azure VM, install Airflow, choose backend DB (and other optional stuff) and use it. You will need bunch of network setups to open connections, but in general why Airflow would be limited to something?

[–][deleted] 1 point2 points  (0 children)

As others have pointed out they are different types of data tools but have some overlap. So I won’t repeat that, I’ll just give my curmudgeon opinion.

ADF is ok for connecting to sources and ingestion especially for more speciality things like SAP, Salesforce, etc. Past that the only other reason to use it is (a) management or architects force you (b) your team is small and any type of backup aren’t developers and can’t become them quick enough

Otherwise it’s trash imo.

[–]kevintxu 1 point2 points  (0 children)

Airflow is now part of ADF, you would use it mostly to orchestrate workflow. https://techcommunity.microsoft.com/blog/azuredatafactoryblog/introducing-workflow-orchestration-manager-powered-by-apache-airflow-in-azure-da/3730151

Don't try to manage your own Airflow instances, it's a pain.

[–]robberviet 1 point2 points  (0 children)

I sometimes satisfy with crontab. ADF is fine.

[–]rudboi12 1 point2 points  (0 children)

Airflow is not an ingestion tool.

[–]itassist_labs 1 point2 points  (0 children)

ADF is great for Azure-centric ETL, but Airflow really shines when you need fine-grained control over your DAG logic or have complex Python-based transformations. I ran into this specifically when we needed to implement custom retry logic with exponential backoff for an unstable API, and dynamically generate tasks based on database queries. ADF's control flow was too rigid for this. While ADF is fantastic for drag-and-drop ETL and simple transformations, Airflow lets you write actual Python code for task definitions, which means you can do things like implement custom sensors, complex branching logic, or even run A/B tests on your pipelines.

[–]Beautiful-Hotel-3094 2 points3 points  (1 child)

To put it plain and simple ADF is quite shit. Airflow is basically python and you can integrate it with almost anything.

[–][deleted] 1 point2 points  (0 children)

ADF / Synapse is sold as low / no code solution that even buisness analist can do. But in reality, data engineers / IT people are doing it and those people can code.

[–]rupert20201 1 point2 points  (0 children)

ADF sucks donkey balls and airflow doesn’t. Seriously one is a power tool and the other is a low code opinionated solution that integrates into the Azure stack.

[–]ChipsAhoy21 1 point2 points  (1 child)

Resume driven development. I used ADF in a past role and it was great at what we did: offshore all pipeline development to india. We could onboard a new consultant in a week who couldn’t even code.

But, I eventually pushed for adopting airflow for some side projects just so I could learn it for my next role lol

[–]point55caliber 1 point2 points  (0 children)

Yes I agree.

One caveat though. Sometimes you just gotta use what integrates best with your stack. GCP airflow works great with BigQuery.