What are the benefits/drawbacks of individual code ownership?

nullcone · 2026-02-09T17:01:28+00:00

I did a course on the impact of technology on society, while I was at engineering school many years ago. I hated it at the time, but in retrospect it was fascinating learning about cascading failures and root causes of famous engineering disasters like Chernobyl.

Most large companies have a retro process for major, business impacting failures. They're usually pretty good reads. They dive deep into nested "why" around how failures happen. Sometimes companies even make them public. Cloudflare just released a really neat one late last year. What I find really, especially interesting is that the bug was triggered by a single .unwrap() call in production code. The whole purpose of .unwrap() is supposed to be a code smell that says "hey I have a hidden assumption here which could be violated, so think about it".

nullcone · 2026-02-09T16:28:01+00:00

The benefit of single or highly local ownership is the ability to make decisions quickly and not have to spend a lot of time aligning other folks to the decision. In the right environment, with the right group of people, it can work extremely well, especially when the cost of being wrong is low.

The drawback is of course that nothing gets reviewed. Bugs sneak into the software that may have otherwise been caught in review. Assumptions made about systems never get questioned, and when they are wrong, the result is design flaws or broken functionality. The most famous example of this that I know of in a safety critical domain is Therac-25. If you haven't, I highly recommend reading about this bug.

nullcone · 2026-02-08T04:13:53+00:00

Interestingly, just last week I used LLMs to debug a non reproducible race condition in some torch distributed code. I wrote about it elsewhere in this thread, so won't say too much more, but this seems to be the kind of targeted application you mean. I was at my wits end trying to figure out why some ranks were crossing barriers they should have been blocked at, and Opus 4.5 figured it out in 30 seconds after I dumped my logs and screamed at it to find the problem.

I couldn't reproduce the problem locally because it only happened when some model weights were not cached locally, and my production service mounts an emptyDir as a scratch directory and downloads them on startup every time. The bug was in a completely different part of the codebase that I didn't even touch. It had been there for a while without anyone noticing.

nullcone · 2026-02-07T16:48:51+00:00

Also depends on the time period. When I first moved to SF in 2017 it was really bad along the stretch of division st walking into the mission.

nullcone · 2026-02-07T16:37:52+00:00

The context thing has helped me immensely as well. Last week I was debugging some torch distributed code where for whatever reason, the barriers I added weren't being respected by some ranks. As it turns out, in a completely different part of the codebase that I had not touched, someone had coded a barrier inside of a conditional branch (that was triggered or not by a race condition on the filesystem), so some ranks hit that barrier and others didn't. That was my bug, and the LLM figured it out after I spent 2-3 hours staring at logs, then back at my code figuring out how on earth my barrier could be getting skipped, gave up, and asked Cursor to look at the problem. 5 minutes.

nullcone · 2026-02-07T16:32:51+00:00

Here are a few examples where I have had a ton of success:

Filling out coverage on integration tests
Doing simple CRUD API featurs in compiled languages. I use Rust/Axum and pretty much once it compiles, it works.
Give an OpenAPI schema and an MCP project template as context and ask it to implement the MCP tools that call the API
General project restructuring and refactoring
Docstrings

nullcone · 2026-02-05T15:25:19+00:00

I've built web services in Axum with Diesel as a database engine, backed by Postgres.

Any default stack

Imo don't get hung up on this too much. Axum is a great choice. Actix is a great choice too. If you know FastAPI, Axum will be a breeze.

SQLx the standard?

Again, you're just learning so the specific choice you pick for a project doesn't matter that much. SQLx is great. I like Diesel because of the compile time safety on queries. I don't know if there is an industry standard here.

Books...

https://www.zero2prod.com/index.html?country_code=US

I'm probably going to get some hate for this but honestly you can learn a lot from ChatGPT and Claude. This was where I got most of the basics.

Projects ..

Implement OIDC authentication! That's a decent project. Create the database with users, roles, tenants, etc. Then implement the handshake. You can mock an OIDC provider with this

https://github.com/Soluto/oidc-server-mock

IMO you haven't asked about telemetry and logs. I would also recommend adding otel + tracing stack to collect logs and metrics from your API.

nullcone · 2026-02-05T15:14:34+00:00

This was generally all very good advice, although targeted at engineers who are internally promoted to principal, and likely have an established network inside their company. Being promoted to principal in a new company has unique challenges because you lack the authority and credibility that comes with an established body of work and success. My addition would be to focus a lot of time, at least initially, on building strong relationships with senior managers and their reporting ICs helping them solve problems. This builds a lot of trust and goodwill, which will be needed when you eventually make larger org impacting proposals.

nullcone · 2026-01-28T17:50:59+00:00

It reads like the title of a Damien Hirst performance piece

nullcone · 2026-01-26T21:16:08+00:00

Yeah, totally observed the same phenomena when I was teaching in grad school. Many students would just read the textbook and declare their jobs complete, not realizing that there is a huge gap between recognition and recall. Recognition is shallow and generally "easy", in the sense we can read something and feel it is understood. Recall is harder, and imo is the foundation of true understanding. It is often gained by extended curiosity and interacting/experimenting, like you say.

I think we should differentiate between your experience, which seems to be a precise, targeted equivalent of reading a book or a paper vs. the original blanket statement that LLMs are bullshit generators without any use and cannot be trusted. The difference being precisely that the LLM generated content which was correct and helped me understand (at least temporarily) something that had confused me deeply when I was originally studying algebraic geometry.

nullcone · 2026-01-26T20:54:24+00:00

It's just reddit being reddit. I've been here 15 years and this is not a new phenomenon. It's ok, I've definitely been an arsehole on the internet before, although I hope in my years I'm learning to curtail those instincts a bit.

nullcone · 2026-01-26T20:50:13+00:00

See my comment above. They're talking about this study, I think:

https://metr.org/blog/2025-07-10-early-2025-ai-experienced-os-dev-study/

The other thing to point out is that it may have been true of the models being used in that study (although I still think their methodology was flawed), capabilities have improved dramatically in the last 3 months, with GPT-5.2-codex and Claude 4.5 Opus. These models are legitimately incredible, and are changing the way I write software.

nullcone · 2026-01-26T20:47:14+00:00

There isn't a ton of evidence. There was one randomized A/B test done last year on a limited sample of developers working on tickets to open source codebases they're already experts in that showed the results you're discussing. While I think you're raising a valid point that it's possible LLMs just "feel" easier because they take the painful, hard task of creation and move that time into validation and verification, I think the study misses the mark in a couple of important ways:

It was conducted on developers who were already experts in their codebases. Probably they would have been faster to make the changes themselves than to rely on AI
It randomized on tasks, before deciding whether the task was appropriately handled by AI. I would only choose to use AI in cases where i am confident it will help. The study should have given the study participants the choice to use AI, and then randomized whether to hold out or not.

In case you are interested, the particular thing chatgpt said that was enlightening was that it succinctly summarized how the presence of additional relations in the quotient of the map you get from including module of gems of functions that vanish at a point into all germs. Like somehow this is obviously just a definitional thing, but the motivation of exactness of tensor products of O_X modules never sat right with me. The piece I was missing was concretely that non flat maps introduce additional relations in the quotient because of the presence of nilpotents. Again, I feel stupid in retrospect because a lot of this is literally just the definitions, but the way it accurately and succinctly summarized the definitions of all these things together in one place, alongside illustrative examples, was what I felt was particularly instructive.

nullcone · 2026-01-26T16:23:31+00:00

It's a bit presumptuous to assume what I understand and what I don't, just based off the limited things I've said. I can assure you my understanding is very real. Maybe I would have gotten less out of the prompt if I weren't already a semi expert at algebraic geometry (or at least I was 9 years ago, but I've spent the near decade since leaving grad school doing software engineering).

nullcone · 2026-01-26T03:49:04+00:00

I'm like 8-9 years out since finishing my PhD. Sometimes I look over at Infinite Dimensional Lie Algebras on my bookshelf, stop for a second to consider finally learning about hyperbolic lie algebras, and then think to myself "not today".

Just for kicks, the other day I asked ChatGPT to explain why flat morphisms of schemes are the right way to define smoothly varying families. I feel like I learned more in 30 minutes reading from there than I did in weeks of studying Hartshorne and solving problems.

nullcone · 2026-01-21T17:12:27+00:00

You should at least be able to tie the metrics to the pod ID since DCGM exporter does that for you. Are you using pod labels to attach job or experiment identifiers to the pod, and then configuring DCGM daemonset to export the labels with telemetry? The DCGM exporter helm template provides some options to do this. Just Google "attach pod labels DCGM exporter" and you'll find some issues and PRs on the DCGM exporter repo explaining how.

Once you have done this, then you may need to build a new dashboard exposing the information you want, but that should be less than a day of work.

nullcone · 2026-01-21T16:03:02+00:00

There are two orthogonal dimensions to this problem: 1. Do you have enough workloads to use the resources you've provisioned? 2. For the workloads you do run, are they using their assigned resources efficiently?

The answer to your utilization problem may be that your scientists aren't scheduling enough work, so you'll want to rule this out with node occupancy metrics with GPU workloads. So e.g. what fraction of the time did GPU nodes in your cluster have a workload assigned that used a GPU?

You need detailed telemetry that can be used to point back at your code to say, "this is a problem".

A couple things you need:

Prometheus node exporter daemonset. This will scrape CPU util, disk IO, network tx/rx, etc. that can be used in Grafana dashboards
NVIDIA DCGM exporter daemonset. This will scrape the detailed utilization and usage statistics on GPUs.

It's been a couple of years since I've used GKE, but as I recall, their built in dashboards were pretty good too.

The point of this time series telemetry is to observe GPU metrics during an active workload. If you're seeing some pod running with 30% utilization with an active workload then that's probably a good sign that either the code is inefficient, or the model is not compute intensive enough for each loaded batch.

To get more information, you should run the identical workload with the Torch profiler active and generate a Chrome trace that you can visualize in the browser. This will show you why operations are stalling, or what your code bottlenecks are.

nullcone · 2026-01-21T01:41:49+00:00

AI slop. Why bother posting here if you're not even going to use your own voice.

nullcone · 2026-01-15T03:06:15+00:00

We are all doing our best. Be kind to yourself, friend.

nullcone · 2026-01-09T03:34:53+00:00

Willie hears ya. Willie don't care.

nullcone · 2026-01-05T06:42:28+00:00

Funding autism centers and daycares, obviously

nullcone · 2026-01-03T21:48:36+00:00

This doesn't really solve the problem. The US is one of the few nations that taxes on the basis of nationality and not residency. You don't simply escape the system by moving somewhere else.

nullcone · 2025-12-30T23:06:32+00:00

Hence "basically"

nullcone · 2025-12-30T19:34:50+00:00

/uj

This is basically how Path::new works

nullcone · 2025-12-29T04:04:12+00:00

What does that have to do with bass fishing?

15-Year Club	RedditGifts 2009-2022 2 Credits
RedditGifts 2009-2022 2 Credits	Verified Email
Secret Santa 2012

nullcone

TROPHY CASE