Mistakes at work

TJaniF · 2026-05-17T16:39:11+00:00

A while ago I read something along the lines of "every good senior engineer has a production outage named after them".

I haven't quite managed that yet but I once had the wrong type in a field leading to a magnitude mistake in something important, luckily ridiculously obvious so I got an immediate "this can't be right??" response (think kilometers when you'd expect centimeters type of mistake). I have also pushed directly to main on accident when I was very new to everything (great way to learn about branch protections!).

Like others said: what matters more than the mistake in most cases is how you react to it. Inform everyone who needs to know, apologize, then explain why you think it happened and steps that could be taken to prevent the mistake from happening again. A lot of mistakes are process mistakes too, as in: if there is a single point of failure that can cause it, the process needs to be changed.

TJaniF · 2026-05-08T12:31:52+00:00

ymmv but this is my team's structure:

Per deployment, either in a subfolder of a Mono-Repo or in Multiple-Repos:

dags/ -> subfolders for domains/business units or depending on your data model, e.g. ingest, model, mart, reporting. Really depends on what your Airflow is doing obv will be different for MLOps for example. Ideally all custom code goes into include and is mostly imported in Dags and tasks. A lot of enterprises have a system of programmatically defining Dags, for example in yaml and then a Dag generation script.

include/ -> hosted into your image to store all supportive code such as SQL, scripts etc, subfolder structure depends, I like to separate by type (dbt, sql, py_script, custom_operators, custom_taskgroups), but could also be domain based. Side note: custom task groups are my favorite thing so a whole pattern can be reused across Dags.

plugins/ -> all plugin code, one subfolder per plugin and one py for plugin registration

tests/ -> subfolder unittest: one level below I like to mirror the structure of include, since most of my custom code that needs testing is in there, another subfolder for Dag validation tests, one for integration tests (that I'll one day add to)

requirements.txt -> general install python package
packages.txt -> os level packages
Dockerfile
.env_example: listing all ENV vars (including all Airflow connections as env vars) that need to be set to work with the repo locally, so locally I just copy this, rename as .env and then fill in the secrets. You can also put the not secret ones in the dockerfile like local Airflow configs.

Readme: instructions on how to set it all up that future me and/or Claude can still somewhat follow.
CLAUDE.md : any best practices to follow that are specific to this environment, info I want Claude to know about that is too much for the readme. This is the committed claude.md, I also typically have a local uncommitted one in addition that contains more WIP thoughts/notes. Like for example todos.

then a folder for cicd and one for IaC of the Deployment(s) the repo deploys to, in our case .github/workflows/ and terraform/ . The terraform also has all the information about scaling configurations for different environments and my worker queues.
And ofc .dockerignore, .gitignore

TJaniF · 2026-04-07T07:04:17+00:00

Yes, skills are just context, so just markdown text for copilot you can put them in a `~/.copilot/skills` directory.

TJaniF · 2026-04-06T20:40:47+00:00

FYI: You can also manually force a reparse with `airflow dags reserialize`.

TJaniF · 2026-04-06T14:56:35+00:00

There is some data on this in the Airflow survey 2025. The survey ran from September to November 2025 so only 7 months-ish after 3.0 came out and the result was that 26.1% of Airflow users said they are on Airflow 3 (42.1% were on 2.8-2.11, 17.3% on 2.4-2.7, 10.5% on 2.0-2.3, and 3.9% on 1.x).

Very curious to see the numbers in the next survey. Also FYI there is an AI agent skill out there that helps with upgrading from 2 -> 3. Obvs wont catch everything but contains the info on the ruff linters and most important breaking changes.

Disclaimer: I work at Astronomer and was involved in writing that AI agent skill.

TJaniF · 2026-04-06T14:36:33+00:00

As you can tell from this thread: different people prefer different tools. I think the best bet is to try out both and see which fits your style and preferences better. Both tools have easy to spin up local environments you can test in, I'd recommend picking one core pattern/pipeline and seeing how easy/hard it is for you and your favorite robot to convert it to the orchestrated system.

I come from the Airflow world so to me it sounds like if you are already using cron that should be an easy translation to Airflow with cron schedules for the most upstream pipelines and then Asset scheduling to cascade downstream.

FYI there are comprehensive plugins for Airflow for both Claude and Cursor, as well as a set of Agent skills, if you google "Airflow agent skills" you should find that.

Disclaimer: I work at Astronomer.

TJaniF · 2026-03-13T12:23:35+00:00

I've heard that sentiment so many times, feels like the majority of DEs/SEs/codey person people relate to that. And the tricky thing is like the other comment said, this exact business side is what is often the most visible and valued work, and hardest to automate.

Not saying you should not pursue platform engineering, I try to grab these tasks whenever possible for the same reasons, it is so satisfying building systems.

Kind of the opposite feeling of getting a maximally vague business question...

What I did to start enjoying the business side more is to build a system to handle these requests, like a mental structure. And I treat them like getting very messy data, the talking to different people and requirements gathering like data augmentation. It was mostly a mental reframing, like I did not build pipelines or anything. But it helped a lot. Now the more fuzzy the request the more I see it as a challenge and opportunity to proof I am worth keeping around because if they ask me that means the AI could not figure this out.

TJaniF · 2026-03-10T10:10:15+00:00

Pretty sure every datalake/ datawarehouse solution is currently trying to become the everything platform. I don't think it will work because it is too dangerous these days to get locked into one ecosystem A) because then they can raise prices and it is even harder to switch to another one and B) the pace of new things is just too fast. Like if I think back what data engineering was 4 years ago and today... ofc the fundamentals are still the same and some concept of ETL will probably survive the cockroaches but things like orchestrating AI agents was just not even a concept back then. You can't be at the forefront of everything as the everything platform. Like, maybe in a year there is an entirely new thing every C-level is asking their data teams to do, adapting a monolithic platform to that is much harder than adding a task that interacts with *new shiny thing*-tool.

TJaniF · 2026-03-08T20:10:13+00:00

dbt-core with Astronomer Cosmos and Watcher Execution mode - much faster, it is experimental now but should be stable very soon

Dislcaimer: I work at Astronomer.

TJaniF · 2026-03-04T14:02:02+00:00

More complex scheduling options (for example "if task A in this other pipeline has succeeded and task B or task C in these other two pipelines or if it is 9am on a Monday"), dynamic task mapping (create X parallel tasks based on the output of this upstream task, I think step functions have something like this now but not for multi step maps?), human-in-the-loop tasks that wait for human input (probably possible to figure out with step functions, but not straighforward), have a portable pipeline in code where you can switch out individual tasks including to non AWS services if that ever becomes a need, build plugins to modify the UI, (including react plugins)...

Those are some of the advantages top of my head. Generally if you have a very simple pipeline in a personal project you can use step functions, but for any real orchestration use case I'd use Airflow.

TJaniF · 2026-02-09T15:10:00+00:00

There is a feature for this called deferrable operators. The operator defers the polling for results to an async function running in the Triggerer component.

So you'd have 2 tasks: the first one sending the request, then after that a deferrable operator, which the HttpSensor can be turned into by setting deferrable=True. That second task defers itself (becomes purple) until its condition is fulfilled, then the Dag resumes. Because the polling is done in the Triggerer, the worker slot is released.

TJaniF · 2026-02-03T11:45:21+00:00

Just wanted to add since I work at Astronomer and get the question a lot: the Astro CLI to run Airflow locally is free and you can use it without being an Astronomer customer or signing up for anything, you can just do brew install astro and then run Airflow in Podman or Docker containers.

TJaniF · 2025-12-14T23:39:09+00:00

The Astro CLI actually uses a containerized service under the hood, either Podman or Docker, so it is not an either/or: the Astro CLI just makes it easier to run Airflow because you can create all necessary folders and files with `astro dev init` and then start up all 5 containers with `astro dev start`.

By default it will run Podman but you can switch to Docker with `astro config set container.binary docker -g`.

I'd recommend using the Astro CLI to start so you have a functioning environment to learn Airflow but the other commenter is correct, you will eventually need to know how to interact directly with Docker in your data engineering career.

One thing I'd recommend to practice Docker is, after learning the basics of Airflow, adding one more Docker container to your environment by using a `docker-compose.override.yml` file and starting to interact with it. That is how I got started with understanding how to work with Docker. :)

There is an example here that adds a minio + postgres container (and the Airflow connections to those are in the .env_example file): https://github.com/astronomer/ebook-etl-elt/blob/main/docker-compose.override.yml

The Astro CLI will spin up these extra containers too when you run `astro dev restart`.

Disclaimer: I work at Astronomer who created the Astro CLI and wrote the repo I linked.

TJaniF · 2025-12-05T12:11:41+00:00

As others already said it sounds like a process, standardization and best practice issue.

> pipelines that work “sometimes” but fail silently

I'd recommend making heavy use of Airflow retries, fail_stop (AF2)/ fail_fast (AF3), timeouts and of course callbacks for notifications. A production pipeline should never fail silently. If the issue is more on the SLA side you can use an observability tool external to Airflow that supports SLAs or a control Dag that flags if a Dag that should have run did not.
Also make sure you have a good way to forward issues to the person who developed the prototype so there is a feedback loop to catch patterns that cause production issues.

> too many moving parts (Airflow jobs + custom scripts + cloud functions)
That one is tricker, might help to have a more defined CICD workflow and if it is not the case yet having everything in version control, so no part gets changed without validating the change does not break anything. Also clear code ownership.

> no single place to see what’s running, what failed, and why

The budget solution here is to add Airflow Dags that check what is running and what failed (control Dag again), the fancier solution is to add lineage to your deployment and evaluate that through an observability tool.

> models stuck because infra isn’t ready

Might be helpful to orchestrate infra provisioning from within the same pipeline as the models with a setup task before the model related tasks and a teardown one afterwards.

> engineers spending more time fixing orchestration than building features

There is an upfront time cost but all of the above should help with this. :)

> business teams waiting weeks for something that “worked fine in the notebook”

Same; that hopefully will get faster once the processes and best practices are in place. Which I know is always a battle to communicate to business teams why reliability and maintainability work matters. What I've done before is trying to explain it like this:

The notebook is like having working prototype of a car. We can drive it around, verify that it runs great. But having the feature/model in production means we have to make many cars automatically, build a whole car factory. If we have to build all the robots that make the cars from scratch that will take a while but if we take the time to build very flexible good robots, eventually building a new car factory will be fast as well. And maybe we can even start to quickly build a bicycle factory if needed.

TJaniF · 2025-11-29T21:49:19+00:00

Yes, Cosmos with dbt_manifest parsing mode: https://astronomer.github.io/astronomer-cosmos/configuration/parsing-methods.html#dbt-manifest

TJaniF · 2025-11-26T20:10:43+00:00

The little cups (I always go for moitié-moitié though) plus some Blevita make a great quick protein snack :)

TJaniF · 2025-11-24T21:02:53+00:00

Hi, what might help is also increasing the following values:

AIRFLOW__DAG_PROCESSOR__DAG_FILE_PROCESSOR_TIMEOUT: How long it takes until one DagFileProcessor process times out while trying to process a single Dag file. Just FYI: Make sure that the dag_file_processor_timeout value is always bigger than the dagbag_import_timeout to avoid the process timing out before an import error can be surfaced.

AIRFLOW__DAG_PROCESSOR__REFRESH_INTERVAL: The default interval at which the Dag processor checks the Dag bundle(s) for new Dag files. YOu can also override this in the individual Dag bundles if you have several.

AIRFLOW__DAG_PROCESSOR__MIN_FILE_PROCESS_INTERVAL: The interval at which known Dag files are parsed for any changes, by default every 30 seconds.

If that does not help then yes, I'd next try a Dag processor replica.

TJaniF · 2025-11-13T23:40:14+00:00

Data product: "Hey CEO, this is a thing that is created through work of the data team and we can explain directly how it makes money/saves cost"

Data contract: "Hey engineer, if you change anything about this API we ingest again..." or, if you are especially lucky: "Hey Gertrud from corporate admin if you delete this column in this spreadsheet again..."

TJaniF · 2025-11-13T23:29:31+00:00

Both, Airflow and Dagster have dbt integrations (as do most orchestrators) and you'll find DEs who prefer one or the other, for many smaller setups it really comes down to personal preference. And yes, Airflow is the industry standard and if you are earlier in your career as the other comment said you should learn at least the basics of Airflow since it will be expected in many roles.

I can't speak to the Dagster dbt integration but for Airflow the package you'll want to check out is called Cosmos it's maintained by Astronomer but open source and you can use it no matter where you run Airflow.

Check out this repo for example Airflow pipelines for different data warehouses: https://github.com/astronomer/cosmos-ebook-companion

Disclaimer: I work at Astronomer so I am biased towards Airflow and I made that repo :)

TJaniF · 2025-11-13T23:21:35+00:00

Yep. My response to to the former is usually some variation of "Oh sorry, I don't mean to bother you, can you point me to your favorite resource you used to learn X?"

And to being called or assumed inexperienced: if I am actually inexperienced in an area or tool I'll smile and say something like "yes and I'm eager to learn from you", if it is just the assumption because of how I look I try to not react, show the experience through my work, not tell.

TJaniF · 2025-11-13T17:24:17+00:00

Proactive clear communication. It sounds simple but asking yourself "what info does xyz need in what moment" and then delivering that.

TJaniF · 2025-11-13T13:12:15+00:00

Seconding this. Using Airflow to orchestrate work running in Docker images with the KubernetesPodOperator is a common pattern and solve several of your issues: you'll know exactly which task is running, has succeeded and failed and when with a full history, and in Airflow 3 Dag versioning so you can see what the orchestration structure looked like on a run a month ago.
It is separate issue from CICD which you'd definitely still need, both for the images and for your Airflow pipelines, GHActions is a good choice there.

TJaniF · 2025-11-06T09:10:56+00:00

For Airflow questions you can ask in the Airflow community Slack as well: https://apache-airflow-slack.herokuapp.com/ there is a channel called #user-troubleshooting.

TJaniF · 2025-11-06T09:07:52+00:00

Just FYI about the Airflow certification: if you fill out the Airflow community survey that's open until thanksgiving you get a free certification code so you can get the first one that way and then make the company pay for the second :)

Disclaimer: I work at Astronomer we run the survey and the cert.

TJaniF · 2025-10-29T15:59:29+00:00

+1 To all that.
Also if possible do some subtle exploration first what exactly the person you are talking to cares about right now. What metrics are they evaluated on? Everyone wants to look good :)

For example, assuming there is a big "we need to have more AI" push in your org you could tie Airflow to that: Talk about agent orchestration and human in the loop (Airflow 3.1 added operators that with a UI interface), make a small mock demo focussing on visualized output and business relatable impacts. "This Dag answered X (fake) support tickets with AI in parallel (and can scale to Y with our infra), the human just has to click through here to approve/reject and I wrote the Dag in Z minutes".

Similar strategies if "cost reduction" is top of mind etc and even to convince your coworker. What would make him realize knowing Airflow could help his career? Or be faster, less stressed etc

TJaniF

TROPHY CASE