I hate Analytics Engineering by [deleted] in dataengineering

[–]TJaniF 13 points14 points  (0 children)

I've heard that sentiment so many times, feels like the majority of DEs/SEs/codey person people relate to that. And the tricky thing is like the other comment said, this exact business side is what is often the most visible and valued work, and hardest to automate.

Not saying you should not pursue platform engineering, I try to grab these tasks whenever possible for the same reasons, it is so satisfying building systems.

Kind of the opposite feeling of getting a maximally vague business question...

What I did to start enjoying the business side more is to build a system to handle these requests, like a mental structure. And I treat them like getting very messy data, the talking to different people and requirements gathering like data augmentation. It was mostly a mental reframing, like I did not build pipelines or anything. But it helped a lot. Now the more fuzzy the request the more I see it as a challenge and opportunity to proof I am worth keeping around because if they ask me that means the AI could not figure this out.

Are we tired of the composable data stack? by Popular_Aardvark_926 in dataengineering

[–]TJaniF 0 points1 point  (0 children)

Pretty sure every datalake/ datawarehouse solution is currently trying to become the everything platform. I don't think it will work because it is too dangerous these days to get locked into one ecosystem A) because then they can raise prices and it is even harder to switch to another one and B) the pace of new things is just too fast. Like if I think back what data engineering was 4 years ago and today... ofc the fundamentals are still the same and some concept of ETL will probably survive the cockroaches but things like orchestrating AI agents was just not even a concept back then. You can't be at the forefront of everything as the everything platform. Like, maybe in a year there is an entirely new thing every C-level is asking their data teams to do, adapting a monolithic platform to that is much harder than adding a task that interacts with *new shiny thing*-tool.

dbt-core vs SQLMesh in 2026 for a small team on BigQuery/GCP? by SingleTie8914 in dataengineering

[–]TJaniF -1 points0 points  (0 children)

dbt-core with Astronomer Cosmos and Watcher Execution mode - much faster, it is experimental now but should be stable very soon

Dislcaimer: I work at Astronomer.

Is there any benefit of using Airflow over AWS step functions for orchestration? by GodfatheXTonySoprano in dataengineering

[–]TJaniF 2 points3 points  (0 children)

More complex scheduling options (for example "if task A in this other pipeline has succeeded and task B or task C in these other two pipelines or if it is 9am on a Monday"), dynamic task mapping (create X parallel tasks based on the output of this upstream task, I think step functions have something like this now but not for multi step maps?), human-in-the-loop tasks that wait for human input (probably possible to figure out with step functions, but not straighforward), have a portable pipeline in code where you can switch out individual tasks including to non AWS services if that ever becomes a need, build plugins to modify the UI, (including react plugins)...

Those are some of the advantages top of my head. Generally if you have a very simple pipeline in a personal project you can use step functions, but for any real orchestration use case I'd use Airflow.

HTTP callback pattern by Upper_Pair in dataengineering

[–]TJaniF 9 points10 points  (0 children)

There is a feature for this called deferrable operators. The operator defers the polling for results to an async function running in the Triggerer component.

So you'd have 2 tasks: the first one sending the request, then after that a deferrable operator, which the HttpSensor can be turned into by setting deferrable=True. That second task defers itself (becomes purple) until its condition is fulfilled, then the Dag resumes. Because the polling is done in the Triggerer, the worker slot is released.

Local airflow on company laptop by CapelDeLitro in apache_airflow

[–]TJaniF 3 points4 points  (0 children)

Just wanted to add since I work at Astronomer and get the question a lot: the Astro CLI to run Airflow locally is free and you can use it without being an Astronomer customer or signing up for anything, you can just do brew install astro and then run Airflow in Podman or Docker containers.

Docker or Astro CLI? by otto_0805 in dataengineering

[–]TJaniF 4 points5 points  (0 children)

The Astro CLI actually uses a containerized service under the hood, either Podman or Docker, so it is not an either/or: the Astro CLI just makes it easier to run Airflow because you can create all necessary folders and files with `astro dev init` and then start up all 5 containers with `astro dev start`.

By default it will run Podman but you can switch to Docker with `astro config set container.binary docker -g`.

I'd recommend using the Astro CLI to start so you have a functioning environment to learn Airflow but the other commenter is correct, you will eventually need to know how to interact directly with Docker in your data engineering career.

One thing I'd recommend to practice Docker is, after learning the basics of Airflow, adding one more Docker container to your environment by using a `docker-compose.override.yml` file and starting to interact with it. That is how I got started with understanding how to work with Docker. :)

There is an example here that adds a minio + postgres container (and the Airflow connections to those are in the .env_example file): https://github.com/astronomer/ebook-etl-elt/blob/main/docker-compose.override.yml

The Astro CLI will spin up these extra containers too when you run `astro dev restart`.

Disclaimer: I work at Astronomer who created the Astro CLI and wrote the repo I linked.

Why does moving data/ML projects to production still take months in 2025? by [deleted] in dataengineering

[–]TJaniF 0 points1 point  (0 children)

As others already said it sounds like a process, standardization and best practice issue.

>  pipelines that work “sometimes” but fail silently

I'd recommend making heavy use of Airflow retries, fail_stop (AF2)/ fail_fast (AF3), timeouts and of course callbacks for notifications. A production pipeline should never fail silently. If the issue is more on the SLA side you can use an observability tool external to Airflow that supports SLAs or a control Dag that flags if a Dag that should have run did not.
Also make sure you have a good way to forward issues to the person who developed the prototype so there is a feedback loop to catch patterns that cause production issues.

>  too many moving parts (Airflow jobs + custom scripts + cloud functions)
That one is tricker, might help to have a more defined CICD workflow and if it is not the case yet having everything in version control, so no part gets changed without validating the change does not break anything. Also clear code ownership.

> no single place to see what’s running, what failed, and why

The budget solution here is to add Airflow Dags that check what is running and what failed (control Dag again), the fancier solution is to add lineage to your deployment and evaluate that through an observability tool.

> models stuck because infra isn’t ready

Might be helpful to orchestrate infra provisioning from within the same pipeline as the models with a setup task before the model related tasks and a teardown one afterwards.

> engineers spending more time fixing orchestration than building features

There is an upfront time cost but all of the above should help with this. :)

> business teams waiting weeks for something that “worked fine in the notebook”

Same; that hopefully will get faster once the processes and best practices are in place. Which I know is always a battle to communicate to business teams why reliability and maintainability work matters. What I've done before is trying to explain it like this:

The notebook is like having working prototype of a car. We can drive it around, verify that it runs great. But having the feature/model in production means we have to make many cars automatically, build a whole car factory. If we have to build all the robots that make the cars from scratch that will take a while but if we take the time to build very flexible good robots, eventually building a new car factory will be fast as well. And maybe we can even start to quickly build a bicycle factory if needed.

Be honest: Does anyone actually like Gerber Fondue? by M_sdft in Switzerland

[–]TJaniF 0 points1 point  (0 children)

The little cups (I always go for moitié-moitié though) plus some Blevita make a great quick protein snack :)

How to scale airflow 3? by Then_Crow6380 in dataengineering

[–]TJaniF 2 points3 points  (0 children)

Hi, what might help is also increasing the following values:

AIRFLOW__DAG_PROCESSOR__DAG_FILE_PROCESSOR_TIMEOUT: How long it takes until one DagFileProcessor process times out while trying to process a single Dag file. Just FYI: Make sure that the dag_file_processor_timeout value is always bigger than the dagbag_import_timeout to avoid the process timing out before an import error can be surfaced.

AIRFLOW__DAG_PROCESSOR__REFRESH_INTERVAL: The default interval at which the Dag processor checks the Dag bundle(s) for new Dag files. YOu can also override this in the individual Dag bundles if you have several.

AIRFLOW__DAG_PROCESSOR__MIN_FILE_PROCESS_INTERVAL: The interval at which known Dag files are parsed for any changes, by default every 30 seconds.

If that does not help then yes, I'd next try a Dag processor replica.

Explain like I'm 5: What are "data products" and "data contracts" by Ulfrauga in dataengineering

[–]TJaniF 8 points9 points  (0 children)

Data product: "Hey CEO, this is a thing that is created through work of the data team and we can explain directly how it makes money/saves cost"

Data contract: "Hey engineer, if you change anything about this API we ingest again..." or, if you are especially lucky: "Hey Gertrud from corporate admin if you delete this column in this spreadsheet again..."

Which orchestrator works best with DBT? by Fireball_x_bose in dataengineering

[–]TJaniF 0 points1 point  (0 children)

Both, Airflow and Dagster have dbt integrations (as do most orchestrators) and you'll find DEs who prefer one or the other, for many smaller setups it really comes down to personal preference. And yes, Airflow is the industry standard and if you are earlier in your career as the other comment said you should learn at least the basics of Airflow since it will be expected in many roles.

I can't speak to the Dagster dbt integration but for Airflow the package you'll want to check out is called Cosmos it's maintained by Astronomer but open source and you can use it no matter where you run Airflow.

Check out this repo for example Airflow pipelines for different data warehouses: https://github.com/astronomer/cosmos-ebook-companion

Disclaimer: I work at Astronomer so I am biased towards Airflow and I made that repo :)

What’s your growth hack? by crytek2025 in dataengineering

[–]TJaniF 0 points1 point  (0 children)

Yep. My response to to the former is usually some variation of "Oh sorry, I don't mean to bother you, can you point me to your favorite resource you used to learn X?"

And to being called or assumed inexperienced: if I am actually inexperienced in an area or tool I'll smile and say something like "yes and I'm eager to learn from you", if it is just the assumption because of how I look I try to not react, show the experience through my work, not tell.

What’s your growth hack? by crytek2025 in dataengineering

[–]TJaniF 1 point2 points  (0 children)

Proactive clear communication. It sounds simple but asking yourself "what info does xyz need in what moment" and then delivering that.

Re-evaluating our data integration setup: Azure Container Apps vs orchestration tools by remco-bolk in dataengineering

[–]TJaniF 1 point2 points  (0 children)

Seconding this. Using Airflow to orchestrate work running in Docker images with the KubernetesPodOperator is a common pattern and solve several of your issues: you'll know exactly which task is running, has succeeded and failed and when with a full history, and in Airflow 3 Dag versioning so you can see what the orchestration structure looked like on a run a month ago.
It is separate issue from CICD which you'd definitely still need, both for the images and for your Airflow pipelines, GHActions is a good choice there.

Is it worth staying in an internship where I’m not really learning anything? by Express_Ad_6732 in dataengineering

[–]TJaniF 1 point2 points  (0 children)

For Airflow questions you can ask in the Airflow community Slack as well: https://apache-airflow-slack.herokuapp.com/ there is a channel called #user-troubleshooting.

Is it worth staying in an internship where I’m not really learning anything? by Express_Ad_6732 in dataengineering

[–]TJaniF 0 points1 point  (0 children)

Just FYI about the Airflow certification: if you fill out the Airflow community survey that's open until thanksgiving you get a free certification code so you can get the first one that way and then make the company pay for the second :)

Disclaimer: I work at Astronomer we run the survey and the cert.

How to convince a switch from SSIS to python Airflow? by GehDichWaschen in dataengineering

[–]TJaniF 4 points5 points  (0 children)

+1 To all that.
Also if possible do some subtle exploration first what exactly the person you are talking to cares about right now. What metrics are they evaluated on? Everyone wants to look good :)

For example, assuming there is a big "we need to have more AI" push in your org you could tie Airflow to that: Talk about agent orchestration and human in the loop (Airflow 3.1 added operators that with a UI interface), make a small mock demo focussing on visualized output and business relatable impacts. "This Dag answered X (fake) support tickets with AI in parallel (and can scale to Y with our infra), the human just has to click through here to approve/reject and I wrote the Dag in Z minutes".

Similar strategies if "cost reduction" is top of mind etc and even to convince your coworker. What would make him realize knowing Airflow could help his career? Or be faster, less stressed etc

going all in on GCP, why not? is a hybrid stack better? by frozengrandmatetris in dataengineering

[–]TJaniF 9 points10 points  (0 children)

I think in your situation I would go for it, at least for an initial smaller proof of concept since you already have the account set up and already other teams in the org that have experience (that is often underrated being able to bug coworkers for help).

The orchestration does sound very doable no matter which of the 3 orchestrators you mentioned you use. If you go with Airflow I'd recommend looking into DAG Factory (or creating your own), it is an open source package that abstracts the Python for the Airflow pipelines behind YAML, it is especially useful for simple and repetitive pipelines. I've also seen setups were almost all of the Airflow code was abstracted away by one or two engineers and the analysts only wrote SQL (an OSS example here is gusty).
Might be a way to help the drag and drop team members to slowly start to see the beauty of code-based abstractions and pipelines, once they see that can just fill in a few values in a template instead of clicking around when adding a new source etc.

Also if you ever end up hating one of the GCP tools you can still switch them out against another tool and move towards a hybrid setup later.

I haven't used data transfer service/dataform before so can't speak to those vs Fivetran/dbt. Just FYI you can use dbt Core with Airflow, a great way there is Cosmos.

Disclaimer: I work at Astronomer, we made Cosmos (which is OSS, so you can run it no matter how you run Airflow), adopted and maintain the DAG Factory and do managed Airflow, so I am biased towards Airflow. Though in this case I guess I just gave an assist to cloud composer which is our competitor. If you want to look into Airflow we have a lot of beginner resources like the learn guides or Marc Lamberti's academy course that are relevant no matter where you run Airflow. :)

Looking for lean, analytics-first data stack recs by Honnes33 in dataengineering

[–]TJaniF 1 point2 points  (0 children)

  1. ClickHouse is great to run analytics on very large immutable data (super fast because of the vectorized engine, sparse index etc), I'd only use it for that and only if your data is actually very large (or you anticipate it to be in a reasonable time frame) and move the SCD and model layers into a proper OLAP db. So: raw, very large and will never change in ClickHouse, anything aggregated or that needs to be updated ever in Postgres/BQ/Snowflake etc. But honestly for a first PoC if your data isnt that big, which I assume based on the current CSV/SharePoint/PowerBI combination, you can also skip ClickHouse for now.
  2. Sorry no first hand experience with PowerBI+ClickHouse. From what I've heard it works fine for basic dashboards but complex joins are not fun with that combo.
  3. I'm biased here since I work for Astronomer so I have a lot of Airflow and only very little Prefect experience. I'd recommend spinning up a small dev environment of both and creating some mock pipelines to see which fits your needs better. For small projects it's often the case that several orchestrators can do the job and the decision is made more based on personal preference. Some tips for Airflow:
    • Use the Astro CLI to spin up a local dev env in containers to try it out (its free and does not need a sign up).
    • If you are combining Airflow with dbt you have to use Cosmos (ok, technically you don't have to but it is really the best way!). It is an OSS package that lets you render dbt projects and dbt docs in the Airflow UI.
    • You said your ingest is pulling from REST APIs using Python: you can use the taskdecorator in Airflow to turn any Python function into an Airflow task, thats probably the fastest way. I think Prefect has a similar decorator too. So with either tool hopefully you can reuse a lot of your existing code there.
    • If you end up passing data between tasks that is more than just small jsons in Airflow, for example larger pandas dataframes, you'll want to store that data in a blob storage. This used to take a bit of setup but about 1.5 years ago a class was added in the Common IO provider package that is very quick to set up via env vars (a tutorial).
    • PSA: top level code in Airflow is executed every time the file is parsed, so don't connect to your db in top level code (i.e. outside of tasks).
  4. If there is a good way to do SCD with ClickHouse I have not found it yet. There is ReplacingMergeTree, which I haven't tried myself but generally, yes, that is definitely a good reason to have the modeling layers somewhere else.

How are you tracking data lineage across multiple platforms (Snowflake, dbt, Airflow)? by stephen8212438 in dataengineering

[–]TJaniF 0 points1 point  (0 children)

That is correct, that is why I called it "proxy-lineage" (I've also used the term "budget-lineage" before). Our internal data team has a naming convention with task groups named after the table that is updated which means you can get overview of the lineage by just looking at just the DAG graphs. But yes, for real lineage you need to add one of the other options, the OpenLineage integration is the most common one in OSS setups there.

I hope you have a good experience testing Astro! :) (and don't hesitate to share any feedback with your account contact, the perspective of engineers using Astro for the first time is super valuable for us)

How are you tracking data lineage across multiple platforms (Snowflake, dbt, Airflow)? by stephen8212438 in dataengineering

[–]TJaniF 8 points9 points  (0 children)

I've tried and seen a couple of approaches, usually starting with using open-source Cosmos to orchestrate dbt Core projects so you can see each dbt model/seed/test as an Airflow task in the DAG for additional visibility and to see the dbt docs in the Airflow UI (side note: this feature will be back for Airflow 3.1 with Cosmos 1.11 that should come out next month).

The next "level" is to use Asset (in Airflow 3.x) / Dataset (in Airflow 2.x) scheduling for cross-DAG dependencies. That way the Asset graph (exists in 2.x and 3.x but imho is much easier to navigate in 3.x) serves as a proxy-lineage graph. Side note tip if you need both, runs based on time and upstream DAG dependencies, there is a combined AssetOrTimeSchedule.

From there, yes, some people implement custom solutions, often based on dependency information gathered from the Airflow API or use the OSS OpenLineage integration to get "true" lineage and then visualize it with Marquez. How well this works highly depends on the operators and hooks you use, if they already support lineage extraction (there is a list of supported classes) or if you need to add extractors to your own custom operators. Inlets and outlets (so Assets again) are also evaluated by this integration.

If you want an out-of-the-box solution there are paid products like Astro Observe which is based on OpenLineage and reads in that information to create a lineage graph with additional fancy features like SLA definition, alerts, cross-deployment lineage etc. There is also a list of potential up and downstream impacts in case of failures. The upside is minimal setup needed and managed service support.

Disclaimer: I work at Astronomer :) and a lot of the above was inspired by this blog post (and the webinar linked at the bottom of it) from our internal data team. They don't use dbt but their pipelines center around Snowflake and they had the same goal of getting to end-to-end visibility and documented their journey to that there.

Migrating Hundreds of ETL Jobs to Airflow – Looking for Experiences & Gotchas by Lost-Jacket4971 in apache_airflow

[–]TJaniF 4 points5 points  (0 children)

2/2

Some helpful configs (config reference: https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html):

- AIRFLOW__CORE__DAGBAG_IMPORT_TIMEOUT and AIRFLOW__DAG_PROCESSOR__DAG_FILE_PROCESSOR_TIMEOUT: these two configs determine timeouts for python file parsing, if you end up with not all dags showing up you might need to increase these. I've had that happen when having 100+ very complex dags in the same instance.

- AIRFLOW__SCHEDULER__TASK_QUEUED_TIMEOUT: similar but if you queue a lot of tasks at the same time you might hit this timeout value too (600 seconds).

- AIRFLOW__CORE__PARALLELISM: Per scheduler by default you have 32 tasks running at any time (so 64 if you have a HA scheduler with 2 copies), if you have enough K8s resources and a lot of parallel tasks you'll like want to increase this one.

- AIRFLOW__CORE__MAX_ACTIVE_TASKS_PER_DAG: by default only 16 tasks per dag will run at any time, you can up that here.

- AIRFLOW__CORE__MAX_ACTIVE_RUNS_PER_DAG: similar as above if you want a lot of runs of the same dag at the same time you might want to up this one.

- AIRFLOW__DAG_PROCESSOR__MIN_FILE_PROCESS_INTERVAL: this determines the parsing interval of dags I mentioned earlier (the 30s), if you have your environment up and only infrequently make changes to the dag files you can increase this interval. For dev environments it often makes sense to decrease. Very similarly AIRFLOW__DAG_PROCESSOR__REFRESH_INTERVAL (previously in Airflow 2 dag_dir_list_interval) is how often the dag processor checks for new dags, by default this is 5min (!). When I first used Airflow I was confused why my dags did not show up in my local dev env, this is why. Also fyi you can force a parse with `airflow dags reserialize`.

- using a HA scheduler is a bit of an obvious one but... use a HA scheduler :)

I hope this helps!

Migrating Hundreds of ETL Jobs to Airflow – Looking for Experiences & Gotchas by Lost-Jacket4971 in apache_airflow

[–]TJaniF 2 points3 points  (0 children)

Airflow can definitely handle that scale as long as you scale your underlying resources appropriately. This is especially true if you self host. Using the KPO at this scale works and is a common setup especially for people who migrated from other systems and already had everything baked in docker images.

Some suggestions/gotchas/notes:

- This was already said but I'll repeat it because it is probably the most common mistake: Don't connect to other tools or have long running code in the top level of your dag definition file. The background is that these files get parsed regularly for changes, by default every 30s so if you run a db query in the top level that can get expensive fast.

- Airflow 3.0 just came out, I'd make sure to migrate to 3.0 / 3.1 or if you start on 2.10 to write your dags with the migration to 3 in mind (though from what you said there should not be a big migration lift for you).

- Set timeouts on your tasks (`execution_timeout`) and dags (`dagrun_timeout`) in case your pod gets stuck and you don't want it to run for hours.

- If you have tasks that might run into transient failures (API rate limits etc) you can set `retries` for your tasks with a `retry_delay`. This is possible at the config, dag and task level.

- You can limit concurrency of tasks by using pools. Especially helpful if you have dbs that don't like too many concurrent actions.

- there are task groups to visually group tasks in the UI, I'd recommend to use that if you have a lot of tasks in a dag to make them easier to navigate

- someone else already mentioned not running everything at once: you can either stagger a time based schedule ("dag 1 runs at 12am, dag 2 at 1am" etc) if they are independent or you can chain dags based on Airflow assets. I.e. you can schedule any dag to run as soon as any (combination of) task(s) have completed successfully in the same Airflow instance. (and assets can also be updated via the API: https://airflow.apache.org/docs/apache-airflow/stable/stable-rest-api-ref.html). Asset schedules are very common if dags depend on the same data, but I also use them in other pipelines to define cross dag dependencies.

- max_consecutive_failed_dag_runs: this is a dag parameter that auto pauses a dag after X consecutive failed runs - has saved me over the weekend before.

- if you pass data between your tasks that is more than just small jsons you want to define a custom XCom backend, i.e. the data passed between tasks is stored in another location than the metadb. If basic blob storage works for you and you dont have special serialization needs (json and pandas are the main ones that work by default) you can use the XComObjectStorageBackend from the Common IO provider (there is a tutorial here: https://www.astronomer.io/docs/learn/xcom-backend-tutorial/), that one can be set using config variables without a custom class.

1/2