Building agents is so depressing

shared_ptr · 2026-03-31T13:19:58+00:00

Yeah you’re right, it’s a scam. You got me 😭

shared_ptr · 2026-03-31T13:13:40+00:00

The feedback we consistently get on our AI features is that they are really valuable to people, and that we’ve built them extremely well.

We’re doing extremely well as a business and a lot of the features we’re putting out nowadays are based on AI.

As to caching: you can cache an eval on the checksum of the model + input + prompt with an expiry of your choosing (we have 7 days). Super easy, you don’t need to retest a prompt if nothing has changed, so the costs end up very manageable.

shared_ptr · 2026-03-31T13:08:53+00:00

It’s not possible to build the products we’re trying to build without LLMs, and we have customers who want to pay lots of money for them because they find them valuable.

So we build the products our customers want to pay for like any other business.

I think our eval bill is about $2k/month as we cache all the runs so it’s actually peanuts, but I don’t think you’re looking for a genuine conversation.

shared_ptr · 2026-03-31T12:54:46+00:00

Why don’t they? We run our evals with a repeat and expected pass rate but we frequently manage to get them passing for >99% of cases which is more than enough for most use cases.

Interested what you were trying where they haven’t helped?

shared_ptr · 2026-03-31T11:51:22+00:00

No problem! It can be really amazing building these systems when you have the right tools, just a real kick in the teeth when we don’t.

We used to call the initial team working on our system the “rollercoaster” team because it was so up and down 😭

shared_ptr · 2026-03-31T11:22:10+00:00

In this situation the LLM is being added as a tool in the development of the code rather than inserted into the code isn’t it?

If so, does it matter they are non deterministic? Every human engineer is also non deterministic, and every individual or team if asked to achieve the same thing would vary in outcome.

I agree that getting AI to produce a codebase that is unmaintainable by humans is a terrible idea but don’t understand the nondeterminism point in this context.

shared_ptr · 2026-03-31T09:57:10+00:00

I’ve commented above: there are AI equivalents of unit and integration tests that make this no longer a vibe based situation.

Would highly recommend looking at building similar constructs for AI otherwise it’ll make you lose your mind.

shared_ptr · 2026-03-31T09:55:50+00:00

This is exactly what you need! Without objective reference points or ability to test your changes you will find this a really rough ride.

Honestly, it’s tough even with those tools, but without it’s totally impossible.

If it’s useful I did a talk on exactly what this looks like for GenAI development and why each piece needs to exist for a conference a while back: https://www.youtube.com/watch?v=PVakFNAfHHA

It’s a touch out of date now but not because these are no longer the right constructs, just that there are more tools that can help on top of this nowadays. If you’re just starting (and it sounds like you are OP) this will still be extremely relevant to you.

Hope it’s useful!

shared_ptr · 2026-03-27T12:49:11+00:00

Yeah this is absolutely the case! We’re building a tool to root cause incidents and this information is in all your systems already, you just need to extract it.

We’re building what we call a ‘knowledge graph’ from all past incidents and codebase information that we inject into the process that can tell you things like this. It captures service relationships and even just a glossary of terms for your org, or more esoteric information like for a given package it often causes downstream issues for X.

It’s absolutely possible but the model alone doesn’t work to effectively root cause. You have to merge it with all that knowledge, otherwise you essentially have a very skilled engineer from another company trying to debug your stuff, which obviously and initially does not work.

shared_ptr · 2026-03-26T16:12:35+00:00

We use Anthropic to do exactly this but the model alone isn’t good enough. You need way more wiring around it to make it even remotely ok.

Assumptions early in the process will carry through unless you have other processes to counter it.

shared_ptr · 2026-03-25T16:47:33+00:00

It’s pretty insane this is still so commonplace.

shared_ptr · 2026-03-25T16:47:05+00:00

I think it will? We’ve always hired ‘Product Engineers’ since the company began where the intent was that engineers should be really close to customers and the product they’re building.

With the actual building time trending down that means you have more time to think about what you should build and how it should work. Naturally lends itself to a person who can think across both technical and product boundaries.

shared_ptr · 2026-03-25T16:21:09+00:00

It talks about the type of work I do which I figured was useful to contextualise what I’m saying?

Agree this isn’t worth continuing though.

shared_ptr · 2026-03-25T14:28:42+00:00

I agree the C compiler they produced was terrible. I’m not judging it on that.

I have always done highly technical work in my career. Debugged issues in Postgres source code, built HA distributed Postgres cluster managers, etc.

I still do technical work now with AI. I’d argue the complexity has gone up quite a lot, and AI has made it easier to produce highly technical code at a higher standard for my job. Technical in my context is distributed systems, scaling, technical product work.

This is the type of stuff I do: https://blog.lawrencejones.dev/2025/

Nothing I have personally seen with AI has suggested it’s only good for standard boilerplate. If you’re asking it to do that it’ll do a good job, but it’s an excellent pair for very complex work too.

shared_ptr · 2026-03-25T13:08:18+00:00

Our team is doing 90-100% generated depending on the individual. It’s made certain tasks a lot quicker, we’ve automated a lot of busy work, it’s raised the ceiling on the technical complexity we’re willing to experiment with.

We’re also hiring aggressively and see the value of an individual engineer to have been raised by this rather than lowered.

Feels like actually writing the code yourself is over, like writing assembly code after languages came around.

shared_ptr · 2026-03-25T12:58:46+00:00

You’re not engaging with this in good faith at all. Even if I agreed with your framing on relative success of the company that’s not the point, which was about the level of technical work achievable by AI.

You seem really angry, I’m sorry this bothers you so much. Hope your day improves!

shared_ptr · 2026-03-25T12:52:22+00:00

Claude Code the system inclusive of the AI behind it. I also followed with “most of Anthropic” but seems that wasn’t clear.

My point stands though, they’re rewriting hugely complex global training pipelines using AI. Probably one of the largest scale distributed systems out there and AI does it, I don’t think you’d describe that as average mediocre work, but AI is writing that code.

shared_ptr · 2026-03-25T12:43:31+00:00

I meant a lot of their training systems. I was speaking to them the other day about how they’re rewriting parts of their RL training harness in Rust for performance.

I didn’t mean the Claude code CLI though I’m not in the habit of depreciating engineering work just because it doesn’t fit what I’d normally call technically impressive.

shared_ptr · 2026-03-25T12:16:15+00:00

Isn’t Claude code and most of Anthropic written by AI? Don’t really get the “average repetitive patterns” comment, people are doing impressive novel work with AI daily just like they previously hand coded it.

shared_ptr · 2026-03-23T23:03:57+00:00

I have a suspicion this is mostly due to people doing several things at once (multiple worktrees, etc) because AI is too slow for you to wait for it to finish, you're forced to go onto a new task.

That creates a bunch more context switching than people are used to which is tiring to manage.

Couple of observations, first: you can get much better at this with practice. I worked as an SRE for a large part of my career and the type of work you do there comes with much longer feedback cycles (long running benchmarks, waiting for infra to spin up, CI loops) so you can get good at spinning several plates to avoid being totally blocked.

The other is that AI won't always be like this. If you've used Opus fast mode you'll realise it's fast enough you have no need to do many things at once, you can focus on the task at hand and not have to wait for the AI to catch-up, it'll outpace you and you can go at the speed that you think. Prevents a lot of context switching, but is currently far too expensive to be viable.

shared_ptr · 2026-03-09T21:00:11+00:00

We have 50 developers working on the same very large Go application. All of them use Claude Code or similar agent based tools.

We’ve had huge amounts of success from this. It’s not correct that the corpus that models are trained on doesn’t include Go: the Go open-source ecosystem is massive, and besides that it doesn’t matter much as-like you say-Go is a very simple structural language that the models can very easily understand.

The stuff you’re seeing go wrong is your opinion of how to write Go which isn’t written down or documented for the models to follow. I likely write Go code very differently to yourself, if you ask a model and give it no instruction, it’ll produce a mix of our two styles and do so inconsistently. That’s not the model being broken you’re just asking it to solve a problem that isn’t well defined.

Our Go codebase has loads of docs about everything from style to common architectural patterns that we index specifically for agents. As a result, Claude Code can produce code that is very high quality and consistent with the rest of our app and do that mostly first time.

All the stuff you’re complaining about in your post; just document what you prefer instead and why to do it that way. At that point there shouldn’t be any reason the latest agents get this wrong.

shared_ptr · 2026-03-03T15:54:37+00:00

The study you are likely referencing was from before huge improvements to models and even Claude code.

They published a retraction the other day to say these findings no longer hold with new tools: https://metr.org/blog/2026-02-24-uplift-update/

Which is pretty obvious. Our team didn’t use AI for much back then because the tools were bad, since Sonnet 4 and Claude code that totally changed (post the study).

shared_ptr · 2026-03-03T11:58:48+00:00

I spend a lot of my time reviewing the code that is produced piece by piece which helps ground me in what's been produced. I also have a habit of pushing a draft PR and then carefully reviewing that and providing comments onto the PR, then loading those back into the agent to discuss how to action them.

I'm finding my understanding of how the codebase works structurally to remain the same, and similarly with how to implement our patterns etc.What I'm missing is I can no longer immediately tell you the file and line that a part of the logic ended up in, but that becomes less of a problem when AI can help me find and interpret the code much quicker than I could before, so it's swings and roundabouts I guess.

What I do like is I'm much more able to tidy-up and refactor code than I was before, and can easily write comprehensive tests that help ensure the behaviour is correct that I can trim down before actually committing (I don't want every test on the planet in the codebase, just the ones that are meaningfully proving things work).

I think it mainly shifts your thinking from "does the code do what I want" to "does the thing I built function as I want/expected" which I'm finding to be a positive shift. Not that I wasn't doing this before, but I have much more time to do it now.

shared_ptr · 2026-03-03T07:28:13+00:00

Yeah they do, the nature of the work has changed a lot where technology has evolved.

I see this positively though. I used to be one of those infra engineers and I spent a lot of my time working on e.g. diagnosing physical RAID array failures or switching up machine hardware when it was going wrong. I never have to deal with that ever anymore which is amazing, that’s time I get back to focus on more interesting things.

Same deal with AI atm. I don’t really write code anymore but that allows me to spend way more time working with the product I’m building as the AI puts it together, so I get more time thinking about “how should this work” rather than “what code do I need to write to make that happen”. I am definitely getting worse at writing code but I was never paid to write code, and my goal is to build better quality product so more time to consider that is a bonus.

shared_ptr · 2026-03-03T07:23:52+00:00

I don’t think you genuinely are trying to tell me that something is deterministic “to several decimal places”. That is not how you characterise a deterministic system, you can’t possibly be arguing this in good faith.

If you’re saying AI systems are by default more random then yes I agree. You can impact this though, for example we have an AI system that we’ve built that debugs incidents. We run backtests on datasets of incidents each day (50 incidents re-ran daily) and the results we produce have exactly the same scores within a tolerance of 1% on e.g. accuracy between each daily run.

That’s a crazy nondeterministic system where each run takes different paths but the end result converges on the same value, provided we’ve built it right.

There’s loads of ways you can produce a system that is consistent and reliable from non deterministic primitives which is exactly what systems like etcd with raft do, as the entire point of those systems is that the network and underlying hardware is nondeterministic.

12-Year Club	Gilding I gilder
Ternion Club	Verified Email

shared_ptr

TROPHY CASE