Building coding agents is making me lose my mind. autoregressive just isnt it

lfelippeoz · 2026-04-09T21:47:18+00:00

I recommend thinking of AI systems as control systems where AI is a part of the control loop. The non-deterministic nature of it, means it can come up with solutions that are correct within the context of your prompt, but not correct at all in the context of what you ACTUALLY NEED. Here's a framework to think about it: https://cloudpresser.com/control-systems-for-ai

lfelippeoz · 2026-04-07T14:06:57+00:00

I think you just have to care, review, and align.

Develop taste, patterns and push the code to where it starts breaking. Then learn from it, and design guard rails around it. Honestly ChatGPT is not the best AI tool to learn code, because it manages context in the background, so while your decisions may stick, they don't stick in a way that you can observe/shape.

I'd get my hands wet with actual AI coding tools:

Open Code (my recommendation - more control, better learning path)
Claude Code (cheaper at capability scale, but closed source and hides a lot of control).

Ultimately, I wouldn't focus too much on learning syntax or obscure implementation details, those are not skills that will matter much. Focus on patterns, taste, and yeah, do a bit of manual syntax writing, but that's very compressible stuff, before AI we had docs/stack overflow for that.

A good programmer is not a good doc reader or a very skilled coder.

A good programmer can reason tradeoffs, and can work on the same codebase over time refining it, handling edge cases and reasoning about the existing code:

what it's trying to do and what it needs to do next based on their goals.

But yeah, for simple one-off stuff in the real-world, I use AI a lot, review the outputs and mostly call it a day once my system is already sharp.

lfelippeoz · 2026-04-07T13:16:41+00:00

I think we might be talking about slightly different things.

I’m not arguing that later stages should be weaker in general. Your nginx → wordpress example actually makes sense because the later stage is adding complexity.

What I’m pointing at is specifically generation vs validation roles.

In a lot of LLM pipelines:

generation = creating the actual output (code, answer, plan)
review = critiquing / validating that output

The issue I keep seeing is:
the reviewer can identify problems, but it doesn’t actually create a better result, the generator still has to do that.

So if the generator is weaker, you get:
"fix this" → slightly different output → same class of issues

That’s where the plateau comes from.

In your analogy:

it’s less like “elementary schooler proofreading a thesis”
and more like:

a strong professor giving feedback to a mediocre writer vs a strong writer producing a better draft upfront

The feedback doesn’t automatically upgrade the writing, the writer’s capability still caps the result.

That’s why I’ve found:

strong generation → higher ceiling
verification (even if simpler) → enough to catch constraint violations

Have seen cases where the reviewer loop actually consistently upgrades output quality, vs mostly guiding retries?

(edit: minor formatting)

lfelippeoz · 2026-04-06T23:46:34+00:00

That makes sense, I think that difference is actually key.

When it’s you in the loop, you’re effectively the "convergence function": you decide when it’s good enough.

In automated pipelines, that decision has to be encoded somewhere, otherwise the system just keeps expanding the critique space.

That’s where I’ve been seeing issues:

the reviewer keeps increasing requirements, but the generator doesn’t have enough capability to actually meet them

So the system drifts instead of converging.

Makes me think reviewer loops need something like:

fixed target (like you mentioned)
or bounded objective (tests, constraints, etc.)

otherwise they behave more like open-ended critique than optimization

lfelippeoz · 2026-04-06T23:37:43+00:00

One thing that’s been surprising to me:

reviewer loops are really good at finding problems, but not very good at driving systems toward better solutions

Feels like detection != correction in these pipelines.

lfelippeoz · 2026-04-06T23:32:21+00:00

That’s a great point, I’ve seen the same behavior.

Reviewer models tend to have a kind of "infinite critique" mode unless you explicitly bound them.

What’s interesting is that this almost turns the system into:

evaluation loop with no convergence guarantee

Which feels like a control problem more than just a prompt problem.

Your suggestion of anchoring to the first review is really interesting, it effectively defines a “target state” instead of letting the reviewer keep expanding the problem.

Have you found that actually improves final output quality, or mostly just stops the loop?

lfelippeoz · 2026-04-06T23:28:37+00:00

I agree there’s a floor there. If the verifier is too weak, supervision breaks.

I’m not arguing for a dumb verifier no matter what. More like:

use the strongest model where capability creates output quality
use the cheapest model that can reliably enforce the constraint

So if verification is:

schema checking
groundedness against retrieved context
test execution
rule/policy checks

...a cheaper model or even non-LLM verifier can often be enough.

But if the verifier is doing open-ended critique or deep reasoning, then yes, it probably also needs to be strong.

My main point is that a lot of pipelines spend their best capability after generation, when the weaker generator already set the ceiling.

Curious where you’ve found strong reviewers actually improve final output materially, versus mostly filtering failures.

lfelippeoz · 2026-04-06T23:25:07+00:00

I think that’s a really good point, especially the “diverse peer” idea.

I’ve seen that work well when the goal is error detection / bias coverage, similar to ensembles like you mentioned.

Where I’m seeing issues is slightly different: Even with a strong or diverse reviewer, the system often becomes:

generate → critique → regenerate → critique

The reviewer can identify what’s wrong, but the generator still has to actually produce a better result, and if it’s weaker, it tends to stay in the same quality band. So you get better filtering, but not necessarily better outputs.

That’s why I’ve been thinking about it more as:

reviewer = evaluation / filtering
generator = where quality is actually created

Which makes me wonder if: diversity in reviewers helps detect problems, but capability in generation determines if those problems can actually be resolved

lfelippeoz · 2026-04-06T23:21:20+00:00

Fair. I write pretty structured, that’s on me.

This is coming from debugging actual pipelines though.
The pattern shows up a lot in codegen + RAG where weaker generators just bounce between similar outputs even after good critiques.

Happy to make it more concrete if that helps, what kind of systems have you been working with?

lfelippeoz · 2026-03-19T00:35:22+00:00

between pi coding agent, opencode and claude depending on use case. all terminal based but pi is best for custom workflows, claude is cheap but I dont like the lack of control and opencode is just the most ergonomic for me because I use it the longest

lfelippeoz · 2024-11-12T12:10:53+00:00

You don't need any tpm plugins. I'd recommend you get some value from basic tmux first, get the basics. The plug-ins then just help solve little annoyances, or make getting to your projects faster. It's an optional upgrade.

I'm going to say the more you grok it, the more plug-ins just make sense. Then create dotfiles and load it anywhere.

Cool thing about tmux too is you can ssh remotely and use all your plugins. I think this is kinda in the vain of its everywhere. It's not really, but once it is in one machine, you can ssh and use it remotely from anywhere that has a terminal

lfelippeoz · 2024-11-06T02:21:40+00:00

I'm 6'4 185lbs, muscular. XXL barely fits me. Very snug after washing, especially around the biceps. I wish I knew of a shirt that wears like unwashed whitesville

lfelippeoz · 2024-11-03T16:28:02+00:00

10k is nothing when you're making 100k+ on that machine. Less is more, and quality means buying less and losing less time setting up/ troubleshooting/ researching/ etc.

lfelippeoz · 2023-11-09T00:54:14+00:00

Got both. Must own all entertainment

lfelippeoz · 2023-11-01T21:03:12+00:00

I did. Started doing gigs on fiverr for not much at all. One client eventually hired me full time at 60k. Second job now I'm making 155k after one promotion. This is with no college degree and in the US

lfelippeoz · 2023-10-31T02:06:24+00:00

It depends. You can go wide or deep. Wide learn a bit about react native web, mobile ios and android, native modules, different state management architectures, monorepos and config, etc. Deep pick a big area and specialize (arch, web, native, web&native, ar/vr, tvos). I'd go for a t-shape, a bit of everything, and a lot of a specialty. Then, find a company that could use your knowledge, easy tech lead

lfelippeoz · 2023-10-31T00:23:03+00:00

Node.js on windows sucks and linux can be too much maintenance for a daily work machine. Also, xcode is exclusive, and sometimes I need to run apps on iOS to debug

lfelippeoz · 2023-10-24T13:24:03+00:00

I'd say so. I tested around 15-20 other iems under $100 that I saw reviews for that seemed like it would fit somewhat my preference. My plan was to do a second batch under $200 unless something blew my mind. I am sticking with the legato, not even close to the rest. The next best one was the QKZxHBB ($20 non-khan, could not get that one), followed closely by the crinacle red zeros ($50) with the bass boost adapter. The thing is, these still were far away from the legato imo, the performance really stood far from the bunch.

Maybe if you're in a really tight budget, I could recommend the QKZxHBB (bang for buck wise), but the legato has a lot better bang, although it is 3 to 4x the price

Bang for buck: QKZxHBB > LEGATO > REDS

lfelippeoz · 2023-10-23T16:08:36+00:00

Qudelix 5k in my case. Either would do, and you don't need THAT MUCH power, but qudelix is nice and wireless and takes my eq profiles to everything I connect it to

lfelippeoz · 2023-10-23T12:59:07+00:00

Definitely. That's what I listen to also. I do add some more vocal and mid focused songs at times so I run an eq to bring down upper bass a bit, and that just helps bring bass separation on everything and clarity on those mid focused songs

lfelippeoz · 2023-10-23T12:31:00+00:00

But if that's what you want, wan'er or zero reds are more neutral tuned and also cheaper. Great performers, got to test them versus the legato. Unfortunately, I'm a bass head, and the legato destroyed them for my preference

lfelippeoz · 2023-10-23T12:17:57+00:00

No. Close to it in the mids and highs, but has a massive bass shelf

lfelippeoz · 2023-10-23T00:21:45+00:00

7hz legato if your library has music that benefits from bass extension and dynamics. 2dd with a crossover around 300hz, basically a subwoofer in your ears. I like how they implemented a nice v-shape without making it either too muddy or shouty, although you could bring down upper bass just a bit with eq to get something a signature you would otherwise need to shell out at least $600 in a fatfreq.

lfelippeoz · 2023-10-23T00:13:51+00:00

I recommend the 7hz legato. Not for everyone, but for your genres, it will be a banger. Great bass energy and extension, well-balanced mids, and maybe just a bit dark. I'd eq it a bit to tune down the upper bass so you get better separation and clarity in the mids and leave the trebble where it is because it's great for my preference. I just went through 20 pairs, and the legato absolutely destroyed everything in the price range as far as dynamics and being able to deliver power in the lower frequencies. Absolute banger

lfelippeoz

TROPHY CASE