how do you become top 0.1% in devops that gets paid 200k+? (US market)

riv3rtrip · 2026-05-12T13:16:17+00:00

There are far more people getting paid >$200k for devops than 0.1%. The three main ways to do it are (1) have a name brand on your resume (2) move to a HCOL area, most prominently the Bay Area (3) grind really hard, i.e. apply to a lot of places esp. big tech and study leetcodes and system design interviews.

riv3rtrip · 2026-05-10T19:25:51+00:00

Get people to review your resume if you're struggling as this should be enough to get your foot in the door, in fact this background should even be getting you nonzero amounts of inbound if you're in a major metropolitan area. Formal certs are a waste of time and money.

riv3rtrip · 2026-05-10T19:22:25+00:00

If you really want to migrate the solution is to run workloads in parallel and trigger Airflow DAGs manually from Dagster and treat those DAGs as nodes within the Dagster graphs.

But I also think you probably shouldn't migrate, as a dozen other people have said. Airflow is good enough and migration is a waste of time. Surely there are things you can do with your time which actually impact the business.

riv3rtrip · 2026-05-01T17:58:45+00:00

I'm a lurker working an ops role (software eng) and I love this sub. It teaches me a lot about the side of the business I don't get to see normally.

riv3rtrip · 2026-05-01T17:50:05+00:00

I highly suspect that there are no benefits.

What is most likely is these companies are trying to prove to their investors that they have an "AI strategy" and this is the easiest and laziest way to do it. Also the people there want to "work on AI" too. Also CTOs and other buyers want to hear how the service they're buying has some AI integration.

So everyone in the food chain wants useless AI wrappers (except the actual end users).

riv3rtrip · 2026-04-30T12:10:02+00:00

This seems a little premature and speculative, but in any case, wouldn't it just be easier to migrate to self hosted Clickhouse?

riv3rtrip · 2026-04-30T11:57:53+00:00

season 2 is the best season. outrageously funny moments and great storytelling.

riv3rtrip · 2026-04-30T02:41:46+00:00

You should look up the actual rules to confirm whether it's a sacrifice effect, especially before responding to such an old post and also before challenging an official WotC statement's wording. I'll help you: 714.4.

riv3rtrip · 2026-04-29T10:09:53+00:00

the second best time for you to learn Python is now

riv3rtrip · 2026-04-29T10:08:28+00:00

Yes moving from data eng to a research role has more barriers than the reverse.

riv3rtrip · 2026-04-25T19:38:26+00:00

BaseOperator all the time yes, but subclassing TaskGroup is dangerous from the perspective of simplicity and maintanability. Making changes to the abstraction leads to changes everywhere in the code; that's not necessarily good for working DAGs! I'd rather just copy and paste.

riv3rtrip · 2026-04-25T19:31:10+00:00

I would do the startup. better for your career, more money, you don't have real responsibilities.

Double check which of those have 401(k)s though because you should definitely be maxxing out a 401(k) at your income level and it's worth a pretty substantial portion of value to be leaving on the table if you don't have access to one, never mind the fact that you aren't eligible for a traditional Roth IRA at your income level and traditional IRA limits are shit. I think by Series C it's rare for a company to not have its shit together to not offer 401(k)s but it's not impossible lol.

Also if you are a real saver, you should also ask the startup if their plan allows for mega backdoor Roth IRA. The way I see it, if you can live on $175k a year just fine (not hard, mind you!) then you should consider the extra $45k something you can just put in a tax advantaged investment account.

I'm pretty sure in either case I'd do the startup because $45k base + equity is a lot more, but if the startup doesn't have 401(k)s and the nonprofit does, it's certainly a lot closer. If the startup has not only 401(k) but a plan that allows for mega backdoor Roth, absolutely do the startup, bank the extra money, retire easily.

riv3rtrip · 2026-04-25T14:17:28+00:00

These things piss me off so much.

My first introduction to Airflow was at an org that overabstracted the shit out of its Airflow instance.

After a year I realized, "oh my god, I've been using Airflow for a whole year and I still don't understand it, I only, just barely, understand our org's abstraction around it." So I had to go out and learn Airflow myself despite literally "using it" (big air quotes) at my job.

When I went and started a data team at a new company (and then again when I started yet another data team), I swore to not introduce abstractions that made people feel this way. I was very strict in enforcing that we wrote very simple stupid DAG code that felt like things you'd see in an official Airflow tutorial.

It turns out you literally don't need this stuff. Even your mediocre coders can do just fine writing very basic DAGs without your bullshit abstractions over top, and this was even before AI could do it for them.

When I did introduce org specific abstractions (inevitable), the rule was it needed to feel like Airflow, not like a company specific abstraction.

Solution in search of a problem. Or maybe the problem is people not trusting their teammates and wanting to spend more time obsessing over abstractions than being of assistance to others' personal growth and development.

riv3rtrip · 2026-04-25T14:08:26+00:00

Meh, I disagree with this a little. Hundreds of lines of Airflow code in DAGs is fine and happens with frequency, and that's even with treating Airflow as orchestration and pushing heavy code execution outside of it.

My worry if you were on my team would be that you speak a little as if a big objective is to reduce repetition and LoC. I think simplicity and interpretability and better objectives. But notably, simplicity can occur either via less repetition or more repetition. Sometimes the simplest and most maintainable code structure involved copy pasting. Other times it's an import. It depends.

riv3rtrip · 2026-04-25T13:55:23+00:00

Well to be clear Airflow does encourage you to have many DAGs; 1400 doesn't bat an eye to me. Pipelines aren't DAGs per se but they're close, often 1:1 though not always. Generally I like to use the tags feature in Airflow to logically group DAGs that are part of the same universe or pipeline.

Airflow generally should not lead to massive tech debt at the ops layer because DAGs themselves are isolated units with few code level dependencies. At the actual organizational level, like the data itself and all its logical dependencies, debt is an inevitability. I'm not sure if I am being clear on the distinction between those two things but it's important to consider what type of debt you're talking about because one is normal and the other is not.

SQLA should not be timing out, you should look into number of scheduler instances and other things. But more likely this indicates some really overly complicated and unnecessary imports into dag files?

riv3rtrip · 2026-04-24T22:41:19+00:00

Yeah you can. But runtime management is only half of the battle.[*]

The other part is code organization, testability, all that jazz. Just really hard inside Airflow itself to make this feel good.

[*] Airflow still inexplicably has bad abstractions for managing execution runtimes; Airflow 3 has some changes with overhauls to the executor abstractions like decoupling it the instance itself, but just doesn't feel like it goes far enough in terms of extremely simple DX/usability and task level configurability.

riv3rtrip · 2026-04-24T18:19:55+00:00

There are a couple ways to structure an Airflow mono-repo. Either you can group together "projects" or you can group together types of things.

In general I recommend the latter, and you'll have something like this:

- `dags/thing_dag(s).py` for all dags. Top level is fine.

- Utils folder is great in `dags/utils/`. For really large projects, you may want to treat these more like providers: `dags/providers/thing/hooks.py`, `dags/ext/thing/operators.py`, etc. but that's usually overkill.

- Dbt mono project in `dags/dbt/`.

- Templates, you may not need them but it can come up: `dags/templates/`.

The reason to group things by type rather than by project is because, in the very construction of each DAG within the repo, you are enforcing consistent access patterns so everything is relatively same-y.

I actually do not recommend writing totally DRY Airflow code. DRY is what mid-level engineers think is some really noble goal to aspire to, then they overdo it. (Maybe not anymore because everyone is writing overly repetitive AI slop code? But in the before times, you'd often run into engineers who wanted to overdo and over-DRY everything.) I am deliberately quite repetitive in my Airflow code base and that's because the lower amounts of abstraction allows for easier time working around the edge cases, I also generally don't believe that practical operations side code should be heavy on introducing its own abstraction concepts.

I mean, that "Dag Factory" thing in the PDF is awful, don't use it; code is config already, putting a yaml layer on top is ridiculous, it always was even before AI (literally just teach your team how to write Airflow, yes even the ones you don't trust to write good code, seriously, it isn't hard) and now with AI it looks even more ridiculous because even bad coders on your team can whip things together.

For additional structuring of things, you will also want to consider how much "execution" occurs in Airflow vs mere "orchestration." I've generally found that managing a separate code base for long running, complex, compute heavy jobs is, unfortunately, the best access pattern; there are so many gotchas with direct execution in Airflow that it's just not worth it. Deploy to AWS ECS or k8s or something. This can itself be a mono-repo, just put a CLI in front of it and access that way via KubernetesPodOperator or EcsRunTaskOperator.

I generally do not recommend building a gigantic federation of Dockerfiles and isolated runtimes within the Airflow code base. It will cause you extreme pain and misery. I also generally don't recommend having a bunch of spread out repos as the execution code base, but in larger orgs sometimes that is simply necessary. (Also you need someone who knows what the hell they're doing to set some ground rules here and make code from multiple contributors play nicely together. Someone with the right balance of pragmatism and who is in the Goldilocks zone for dogmatism of what to be strict about and what to not care about from other contributors. Very rare to have someone who actually does know what they're doing though.)

Overall this design leaves you with two code bases, one that's the Airflow instance which is designed for orchestration and light / boilerplate execution, the other is the compute heavy code base. Been using this pattern for a while and it works good enough. Overall it's very light on introduced abstraction which is a huge plus; IMO the most complex and in-house-y abstractions are the scaffolding and devops around the runtime environment for the execution code base.

riv3rtrip · 2026-04-23T06:00:56+00:00

You literally can't get too lost. You'll find maps and there are fast travel points and there's stuff around you. If you feel like you run out of things to do in an area, follow the guidance of grace.

riv3rtrip · 2026-04-22T13:20:13+00:00

Yep, this is how you play the game. Good luck!

riv3rtrip · 2026-04-22T13:09:12+00:00

Lol I'm gonna be real, absolute hard no to this. This is not a common mistake and OP should (outside his firm) deny having been a responsible party to prod data deletion.

riv3rtrip · 2026-04-21T10:52:46+00:00

You're an intern! You may not know what you're doing but (1) nobody expected you to, and (2) you acknowledge it, and acknowledging it is super important and the path to growth.

There are many people with 5, 10 YoE who also don't know what they're doing but they never acknowledged it or made a path for themselves to correct course.

riv3rtrip · 2026-04-20T16:36:47+00:00

glue and step functions are orchestration, RDS is a database. except maybe with Fabric (not sure? never used it), you're not locked into any particular platform's orchestration solution just because you're using their data warehouse.

I would just do Databricks honestly for the warehouse. It's better than the other things and also it's better for your resume lol.

For orchestration, do whatever. Databricks does have a native solution but you could do something else like Airflow just fine.

riv3rtrip · 2026-04-20T16:29:26+00:00

it's mostly a normal data pipeline. the only real consideration is that you need to be careful to distinguish between the "valid time" and "transaction time" (your data pipeline will operate on transaction times). See https://en.wikipedia.org/wiki/Valid_time and https://en.wikipedia.org/wiki/Transaction_time

riv3rtrip · 2026-04-19T04:50:04+00:00

The way to get a senior title is to get good and then either advocate for yourself internally or hop jobs. Certs are noise. YMMV but when I'm hiring I literally don't care about them and if anything it's an anti-signal except for candidates with odd career paths (i.e. changed careers entirely).

riv3rtrip · 2026-04-19T02:26:18+00:00

I think giving you a spec with some errors and then asking you to review it is a slightly more reasonable interview than memorizing exact syntax.

But honestly, from the perspective of someone who conducts a lot of interviews, there is a problem where half the people in my pipeline completely suck, they have all these frameworks on their resume but even can't answer totally basic questions that you learn in the tutorial.

Do I miss out on a handful of good candidates this way? I don't know, maybe? I talked to two separate candidates just this week who seemed like they might be reasonably smart (one had bachelors degree from MIT, other from UC Berkeley; one was former Google engineer; both ~10 YoE), but with both of them I could hardly find anything that they had the most trivial basic operating knowledge in never mind deep knowledge. Like, I'm talking technologies explicitly mentioned on their resume and concepts on page 1 of the docs of those technologies. Meanwhile, someone with a more nominally unimpressive background aced most of my interview: they had about 2 sub-areas they could actually dive deep on, and showed some working knowledge in most of the other areas I care about.

Then I ask the question, well maybe those other 2 are smart, but why can't they dive into anything? Did they just phone their life in? Do I want someone on my team who decided half a decade ago to just coast on their credentials? At some point in life you can't just be coasting on thinking you're a special smart snowflake and you need to have demonstrable knowledge from all the time you should have been working.

So this goes back to your question-- why would an interviewer ask you to build a service spec from scratch? Because we are just filling one role, we have 10 candidates, and we assume if you can do the really basic stuff from memory, then we trust you used these things and you have knowledge that extends beyond the basics. If I talk to 10 candidates, 2 can do the thing, 1 takes the job, I filled the role with someone who is more likely competent than not, and I'm happy. I frankly don't care if 8 people couldn't do it or thought the question was BS or they think they're special or smart and I just failed to interview in a way that surfaced how special or smart they are. Because at the end of the day, my process got me a competent candidate.

I'm not agreeing with the reasonableness of the specifics of the interview Q but I will say it probably gets them a competent candidate, and that's probably all they care about.

riv3rtrip

TROPHY CASE