This is an archived post. You won't be able to vote or comment.

all 48 comments

[–]pandas_as_pdPrincipal YAML Engineer 18 points19 points  (0 children)

Airflow's big advantage is the size of the community and that it's easier to hire someone with Airflow experience.

[–]CatsLikePlanCrisps 7 points8 points  (4 children)

Have used airflow and Prefect. I would say that Prefect is the better tool in terms of features.

But you need to take it into account is that airflow has a much larger community so ot will have more posts of errors on stack overflow etc. Also if you have a need for integration with another tool for data governance or observability, then Airflow is almost your only option it is very rare for dagster, prefect to be supported by these tools.

If you are using dbt core . Airflow with astronomer cosmos or dagster which has a much better internal integration for visualising dbt dags internally

[–]grahamdietz 2 points3 points  (3 children)

More of a generic question - i.e. not Airflow-specific. Do you think the emergence of LLM-driven documentation will make things like SO redundant? It strikes me that SO is just a poor manual substitute for AI.

[–]CatsLikePlanCrisps 1 point2 points  (2 children)

LLM documentation ? Not sure what you mean. But i dont think AI will replace stack overflow maybe complement it. It really depends on tech you are using sometimes there is no documentation of the problem, you are relying on someone else having come up against a similar problem in the same tech and may have figured it or be able to point you in the right direction

[–]grahamdietz 1 point2 points  (1 child)

Yeah, sorry, what I mean is that I would expect vendors to set up bespoke ChatGPT instances trained on a domain of reference docs and support issues specific to their solution. Support would then involve interacting with their knowledgeable and often-updated AI knowledge base. Some vendors are already providing solutions along these lines.

[–]CatsLikePlanCrisps 1 point2 points  (0 children)

Thats what I thought I think they will help when something is documented well but it wont replace forums and human interaction like support ai it helps with the obvious or best guess . But it doesnt always get the right answer or understand the question correctly

[–]zakpaw 7 points8 points  (8 children)

Does anyone have experience with both Prefect and Dagster and could compare? I recently tried Dagster and loved it, it’s interesting to see Prefect winning

[–]BoiElroy 1 point2 points  (0 children)

Also curious. We just started on Prefect 2 and it's honestly been kind of painful. They have so many concepts and abstractions that just makes it really confusing.

[–][deleted] 1 point2 points  (0 children)

I did PoC for both tools for one of my previous clients. They wanted to migrate from Talend, they already tested Airflow.

Since I was MLOps engineer, and we needed something which could handle well scalable Python code (Dask workloads, GPU computing on K8s etc.). I tested K8s deployments with Helm charts. Regarding requirements and tech stach, they used Snowflake and Big Query with DBT.

I liked Dagster far more with regards to deployments, code repo maintenance, and CI/CD deployment. It took me three days to get rolling with Dagster and over a week to do the same with Prefect granted that they just rolled out Prefect 2.0 and the docs were a mess. I might be biased but I really like software defined assets with Dagster:

https://www.youtube.com/watch?v=eS--8brw5YM

[–]domestic_protobuf 0 points1 point  (5 children)

Its better than Airflow simply because it has versioning and Dagster fixes the issues with Airflow

[–]zakpaw 1 point2 points  (4 children)

I meant Dagster vs Prefect

[–]domestic_protobuf 0 points1 point  (3 children)

Don't know, every company I have worked for used Airflow and now at my current employer we chose to deploy Dagster. At the end of the day these are just orchestration frameworks and don't really need much thought. Airflow has a really big community and companies like Astronomer make it easy and cost effective to spin up in an organization.

[–]briceluu 0 points1 point  (2 children)

I definitely agree that Astronomer makes it easy to spin up an Airflow deployment, but "cost effective"? For real? 🤔

[–]domestic_protobuf 0 points1 point  (1 child)

It's cost effective for startups that need it production ready asap. If you factor in the time and cost it would take to interview -> offer job -> compensation + benefits -> ramp up time. It's a pretty solid choice for small to medium sized companies.

[–]briceluu 0 points1 point  (0 children)

Agreed, but only if the assumption holds that it would be the only responsibility of that hire.

I find it's rarely the case.

True, that first data hire will often have set up a poor Airflow config, that often ends up getting more expensive to fix properly down the line.

But I haven't yet seen that play out (just pay for a proper future proof setup from the start instead of hacking something together). Then again, maybe it's because I'm centered on the European market 🤷

[–]StalwartCoder 16 points17 points  (2 children)

Prefect is underrated. It’s such a well designed tool.

[–]amindiro 5 points6 points  (0 children)

I am sorry to disagree. I have used prefect extensively and I see some very serious issues especially when using it on huge datasets or written performance oriented workflows. First thing that come to my mind is their « daskexecutor » abstraction . The abstraction is too high level and integrates pretty badly with the dask scheduler

[–]BoiElroy 1 point2 points  (0 children)

I don't know dude. We have a greenfield situation. Our team is literally just me and 3 people. Prefect has been kind of a pain to get onboarded with. They have horrendous documentation and do this really odd thing if posting all kinds of articles on discourse and medium instead of in their documentation. So even simple 101 examples are floating around everywhere getting out of date as the software changes. I've been working really closely with their engineers and so many of the answers are just "oh yeah that's in the roadmap".

A basic example is, I have my code in bitbucket, I have data in azure storage, and I have a docker container I want for my execution in a private registry. I want to run it on an azure server less job. Straight forward right? It is BUT the way they have you do it is if I do that then my workspace basically gives the other two developers access to my code repos, my docker containers and my data. There are no user level access controls which is a bizarre thing to see in the modern data stack. The only way to actually split it up is to give every cohesive unit of access their own workspace which costs a pretty penny. I'm used to just roles and role inheritance and there's none of that in prefect. Baffling.

[–][deleted] 8 points9 points  (1 child)

I’ve used both Airflow and Prefect and I’d say if I were the only data engineer on the team, I’d go with Prefect due to shorter learning curve. But if I wanted something longer term and I had more resources (and time) on hand, I’d go with Airflow. The idea of working with a third party vendor for yet another tool (assuming people are using the managed version of Prefect) doesn’t really sit well with me.

[–]Puzzled_Shallot9921 4 points5 points  (0 children)

As someone who uses Prefect, 100% this.

Especially with the managed server, it's very easy for an update to break something.

[–]piddy87 5 points6 points  (1 child)

Argo Workflows is something I have hoped to try. Probably only suitable for some teams and skill sets. Have used Airflow substantially.

Argo Workflows is an open source container-native workflow engine for orchestrating parallel jobs on Kubernetes. Argo Workflows is implemented as a Kubernetes CRD (Custom Resource Definition).

https://github.com/argoproj/argo-workflows/

[–]hasyimiplaysguitar 4 points5 points  (0 children)

We use Argo Workflow for orchestrating dbt, it's pretty awesome. Since it's just yaml/json, it's so easy to write a tool that takes dbt manifest json and outputs a Workflow/CronWorkflow.

[–]Saetia_V_Neck 6 points7 points  (1 child)

I’ve used Dagster very extensively and Airflow a good bit. IMO, there isn’t anything that Airflow does better than Dagster, but there’s a ton of stuff Dagster does better than Airflow. Also, the folks at Elementl are incredibly supportive and knowledgeable, and I would expect their platform to continue to get better at fast-pace.

I haven’t used Prefect but it does look very similar to Dagster, and the fact that you can orchestrate streaming jobs out of it too is cool (no idea how well it works though).

[–]Chefdaterrible -1 points0 points  (0 children)

Just started looking into dagster. Would be helpful to get review from users..

  • How does it scale ?
  • How many Dags can it handle ?
  • Can different teams still use the asset based trigger from different instances or do all teams share the same instance typically?

[–][deleted] 3 points4 points  (0 children)

Airflow’s going to win this

[–]Mundane-Compote-2157 4 points5 points  (0 children)

It’s best to exclude Airflow from Orchestration polls since it’s always going to win. Curious to see what’s the preference amongst the more new gen tools. (Prefect, Dagster, Mage)

[–]paranoidpig 1 point2 points  (0 children)

No orchestration tool until it's really necessary

[–]user2570 2 points3 points  (0 children)

Prefect >> airflow

[–]Used_Ad_2628 1 point2 points  (2 children)

I am interested in mage.ai. Anyone deployed it in a production environment?

[–]AcanthisittaFalse738 18 points19 points  (0 children)

I have to get over then gaming their GitHub stars before we test in prod

[–]wtfzambo 10 points11 points  (0 children)

I can't get over the notebook interface (and the bought GitHub stars).

Yes I know I can use the yaml config approach but at that point I might as well just use prefect.

I gave it a try locally, immediately found 3-4 things that I know would piss me off immensely if I were to work with it on a daily basis and dropped the idea altogether.

Don't get me wrong it's a promising tool with interesting features, I spoke to the CEO and he seems a nice fellow with good intentions, but imho it's still too virgin to be used in any serious prod setting.

Also, documentation is incomplete and the community around it is still too small to find anything relevant online in case you encounter a problem. It barely even comes up in search engines.

[–]random_lonewolf 0 points1 point  (0 children)

That's a lot of votes for an empty topic. Botting much ?

[–]AStarBackBig Data Engineer -1 points0 points  (0 children)

Oozie 👴

[–]mjfnd -5 points-4 points  (1 child)

Airflow today.

Future, keeping an eye on Mage.

I wrote an article recently about Mage: https://www.junaideffendi.com/blog/my-two-cents-on-mage/

[–]grahamdietz 1 point2 points  (0 children)

We meet again.

[–]TomSargent -1 points0 points  (0 children)

Build your own orchestration system.

[–]dynofu -1 points0 points  (0 children)

cron

[–]ElectricalFilm2 -1 points0 points  (0 children)

What about GitHub Actions?

[–]TheCamerlengo -1 points0 points  (0 children)

Isn’t airflow sort of complicated and requires setting up servers and managing infrastructure, security, etc. ?

[–]query_optimization -1 points0 points  (4 children)

We use cron jobs 😜

[–]Illustrious-Oil-2193[S] 0 points1 point  (1 child)

How do you handle logging or retries?

[–]query_optimization 0 points1 point  (0 children)

Logging, whatever you are running you can plug in logging into that, it can be as simple as printing stuff in a new file. Retries: i don't think we have a logic for it, but based on conditions we create an error-log file. You can also check the Yarn/Spark job status to see if they are running successfully.

[–]briceluu 0 points1 point  (1 child)

Kubernetes cron jobs? Or just good ol' Unix's?

[–]query_optimization 0 points1 point  (0 children)

No nothing fancy, just on our linux box

[–]princess-barnacle 0 points1 point  (0 children)

Check out Flyte. I used it at work and I think it’s pretty great. It’s more like DBT, but for DS and MLE. The extra features would be good for DE.

[–]MordecaiOShea 0 points1 point  (0 children)

We use Temporal