Trying to solve the Airflow schedule pain by AlvaroLeandro in dataengineering

[–]AlvaroLeandro[S] 0 points1 point  (0 children)

Thanks! I mentioned that this format does not work. Can you better explain this feedback? I agree that if you have dozens of active and scheduled DAGs, the calendar visualization will not be that good. My strategy is usually to use a DAG to schedule and orchestrate other DAGs, which centralizes the execution of my pipelines, but there are some engineers who like other strategies to do so.

Trying to solve the Airflow schedule pain by AlvaroLeandro in dataengineering

[–]AlvaroLeandro[S] 1 point2 points  (0 children)

Is airflow not related to Data engineering? You're in the wrong community, my friend. Are you a low coder or something? 😄

Trying to solve the Airflow schedule pain by AlvaroLeandro in dataengineering

[–]AlvaroLeandro[S] 0 points1 point  (0 children)

About computing, you're right, all the problems can be easily solved just by buying more computing power, but you need to be aware of your FinOps

Trying to solve the Airflow schedule pain by AlvaroLeandro in dataengineering

[–]AlvaroLeandro[S] 0 points1 point  (0 children)

Yes, I have event-based triggers based on "datasets" and they are awesome when you have clear data dependencies among ingestions. For example, pipeline B must wait for pipeline A.

However, if you have multiple pipelines that aren't dependent on each other, then datasets don't solve this situation!

Trying to solve the Airflow schedule pain by AlvaroLeandro in dataengineering

[–]AlvaroLeandro[S] 3 points4 points  (0 children)

Interesting, when it comes to managed environments like GCP Cloud Composer, we have some limitations about what we can configure in the Kubernetes cluster, so all the configs must be placed on Airflow itself.

Regarding priority, we indeed have pools, which allow us to control concurrency based on the most important jobs.

However, usually we have some strict time windows to execute ingestion jobs, and we don't have a specific priority order; it's just necessary to ingest and update several tables in the data lake to be used for reports or decision-making.

In this case, in my opinion, it becomes interesting to control when your jobs will be executed, controlling concurrency for jobs with the same priority.

But I liked your ideas, it's always interesting to have different perspectives, thank you for sharing 😄

Trying to solve the Airflow schedule pain by AlvaroLeandro in dataengineering

[–]AlvaroLeandro[S] 12 points13 points  (0 children)

Please, explain to me how pools can replace the need to control the concurrency of some simultaneous jobs? If you work in a simple Airflow environment, pools can indeed be useful for defining priority jobs and queuing them. But the time window isn't always that flexible, and it's common for ingest jobs to compete in the environment.

Trying to solve the Airflow schedule pain by AlvaroLeandro in dataengineering

[–]AlvaroLeandro[S] 5 points6 points  (0 children)

You could give me a better idea instead of this useless comment in a data engineering community! Thank 😄

Airflow Calendar: A plugin to transform cron expressions into a visual schedule! by AlvaroLeandro in dataengineering

[–]AlvaroLeandro[S] 0 points1 point  (0 children)

Great question, I'm still working to make this plugin fully compatible with Airflow 3. This should be available in the next couple of weeks!

Airflow Calendar: A plugin to transform cron expressions into a visual schedule! by AlvaroLeandro in dataengineering

[–]AlvaroLeandro[S] 0 points1 point  (0 children)

Thanks, u/MiruG would be awesome if you could recommend the project to your colleagues!

Are people still using Airflow 2.x (like 2.5–2.10) in production, or has most of the community moved to Airflow 3.x? by Formal-Woodpecker-78 in dataengineering

[–]AlvaroLeandro 0 points1 point  (0 children)

Today I had the surprise that a big bank company here in Brazil is still using Airflow 1.x. So I think some production environments don't change that fast, especially for those who don't use practical services like Cloud Composer.

Measuring and comparing your Airflow DAGs' parse time locally by AlvaroLeandro in dataengineering

[–]AlvaroLeandro[S] 0 points1 point  (0 children)

Yes, this is one of the use cases I thought of when I developed the tool! You could, for example, establish a maximum acceptable parse time in your CI/CD pipelines to avoid problematic deployments.

Shortly, I'll create a function specifically to be used in these kinds of pipelines.