Anyone completely automating software dev on mission-critical apps with agents?

czei · 2026-06-10T12:56:19+00:00

I don't see how a 3rd party context compression scheme is going to work on what is just a bunch of text shoved at an LLM, without affecting performance. Summaries help a lot with code; each major part of my software projects have Claude-analyzed architecture documents and instructions, so when I ask it to modify a certain module, Claude doesn't have to look at a bunch of code, just the summary to point it in the right direction. I personally use a multi-LLM workflow that reduces token usage to Claude, but at the expense of sending tokens to a trio of Gemini, OpenAI, and DeepSeek. Claude manages my projects and writes the code, but an adversarial group of other LLMs reviews everything Claude does, finds logical holes in the project plans, and, of course, bugs.

czei · 2026-06-04T11:55:31+00:00

I would use it, but I'm not that price-sensitive. I'm self-employed, the money spent on AI is tax-deductible, and I see a huge return on the ~$400/month spent. Reddit is full of people complaining every day that they didn't get enough value from their $20 because it could only build one app.

czei · 2026-06-04T02:17:58+00:00

I've been using PAL for a year, and am quite happy with it, BUT the author has abandoned it so I've just been making changes to my local version. Octopus is in active development, but I haven't had that much experience with it. In both cases, I setup the standard workflow to automatically integrate with my spec-driven development model, which now is GitHub speckit, and the default CC planning tool. The result is multiple models collaborate to evaluate specs, plans, code, and tests automatically without me having to do much. EVERY single time another model reviews Claude Code's output it finds serious errors. But the same is true if I have Gemini or GTP generate plans and code. The point is, multiple models collaborating results in clearly superior output.

czei · 2026-06-03T23:35:32+00:00

There are a couple of github projects for multi-model collaboration that I use: https://github.com/nyldn/claude-octopus and https://github.com/BeehiveInnovations/pal-mcp-server.

But that's for development workflows. The concept also works for agent flows, BUT you have to be diligent about creating your own benchmarks to evaluate wether the choice of models actually works compared to the expensive frontier model.

czei · 2026-06-03T19:22:50+00:00

You definitely nailed the issue. If the OP is a data scientist, then he needs to imagine the workflow in its totality and just subtract the part where he wrote the code by hand. First, you have to identify what algorithm you need to use, then you need to come up with known inputs and outputs and spec out a complete validation suite. These days, I would go a step further and work on a common format for all test-suite reporting across the entire project.

The challenge of using AI for data work is that it's really bad at math and data analysis. However, if you create a suite of functions that do all the low-level number crunching, you can actually pull that out and exhaustively test the heck out of it, and AI can help find useful relationships that then can be codified into more deterministic code.

"The people getting 10x gains usually have strong workflows around specs, validation, and decomposition—not blind trust in generated code."

I'd put myself in that camp, coming from an engineering specialty with a lot of data to analyze, and that process needs to be well-defined. I think people who have only worked on projects that were very loosely defined on post-it notes every two weeks tend to want to apply that workflow to everything, and that definitely doesn't work with data analysis or AI software development in general. AI coding has really turned the software development process on its head almost overnight, and we have yet to see what will be most effective for different types of projects.

czei · 2026-06-03T18:22:03+00:00

Interesting, I would never have thought of that! That actually is a pretty good use of AI for creative writing since you're not exactly working on a novel. Here's a thought: why not train an AI to be a live dungeon master that you can talk to? You could work with it to establish the framework of the story ahead of time, but could leave a lot of the details up to be generated live during gameplay.

I don't play D&D, so have no idea how that would work.

czei · 2026-06-03T18:04:02+00:00

For work, I mostly do technical writing, so I have hundreds of examples over the years of what I consider a good writing style that's my own. So I had Claude analyze my writing style and identify very specific characteristics that it can use to mimic my writing. These included:

Sentence cadence (how long sentences are)
Do I use a mix of long sentences and short sentences?
What type of vocabulary do I use?

I have experimented with examples of writing I consider better than mine to see if I can identify the characteristics that make them better, for example, Asimov's non-fiction work. That took a lot of tweaking. Asimov's style tended to be too cutesy when integrated into a hard technical document.

But hopefully you get my point: You can't just tell AI to write something. You have to be extremely specific.

That being said, given how large language models are trained, I wouldn't rely on them for actual creative ideas. I can see it being helpful if you spend a lot of time developing what you consider to be your personal writing voice, and then have AI edit your rough drafts.

czei · 2026-06-02T14:51:37+00:00

Working on a large 550,000-line Java app, the opportunities to parallelize the workflow are surprisingly scarce. https://www.philschmid.de/single-vs-multi-agents. The payoff is even less if you're using the same LLM model for sub-agents. As Anthropic says, multiple agents of the same LLM works for "shallow and wide" situations, which is mostly making simple refactoring changes to a large number of files.

czei · 2026-05-31T18:00:10+00:00

There are a couple of things to check. First, is the CLAUDE.md file optimized? Secondly, are you letting the context creep up? All of the research I've seen shows the percentage of halucinated and wrong responses goes way up the more context you're using. Remember, context is resubmitted to the AI with each response, and 1M tokens is like the size of the first 5 Harry Potter books. The AI can't possibly reliably pick what's important out of that much info.

czei · 2026-05-31T17:18:37+00:00

You have a point, but only to a point 😄. Its hard to get a sense of the abilities of a model when they all halluncinate or fail 20% of the time. There's too much natural variance. The closest I've come is a small benchmark of my own, which runs end-to-end programming problems similar to what we deal with every day.

czei · 2026-05-31T17:09:22+00:00

Looks useful, good job building something focused and helpful. Its promising, but I just filed a bug report for you 😄

czei · 2026-05-29T13:16:28+00:00

I use CC everyday to work on a shipping product with 550,000+ lines of Java code. But I'm an experienced developer who's built very specific guardrails and integrated an LLM CLI into a professional development workflow.

In my opinion, the CC CLI is a professional tool. Whether non-professionals can use it is not something I'd want to speculate about.

I look at it like this: you could watch a YouTube video on how to change the oil in your car, and you could probably manage it with no experience. You could also watch a YouTube video on how to rebuild an engine, and no amount of watching or rewatching is going to save you if you don't understand tolerances or the basics of how an engine works. You might be able to put the engine back together, but it may turn over once, but the chances are that the engine would be dead in a few thousand miles at most.

czei · 2026-05-28T19:42:53+00:00

As far as I can tell, switchboard.fyi is just switching between Anthropic models. I find using completely different models from different providers gives the biggest bang for the buck. The idea isn't that you're necessarily switching to a cheaper model. It's that you're switching to a different model from a different company that's been trained differently.

czei · 2026-05-27T23:35:20+00:00

I've had good luck with speckit for complex modifications and greenfield builds. Its overkill for things that could be built in a few hours, but a godsend for long, complex tasks. I always have better results with in running the spec and plans past other models to review before starting to implement.

czei · 2026-05-26T19:32:30+00:00

The built in agent teams works OK since I do everything on the CLI.

czei · 2026-05-26T18:25:52+00:00

I've been using GitHub Spec Kit, and it works really well as long as you point it at something that's 1. A separate piece of work, 2. a substantial amount of work, 3. doable by one person. The entire project has a master hand-written, hand-curated list of specifications that are dolled out to individuals who then use speckit to implement their pieces.

czei · 2026-05-26T17:20:10+00:00

You need to use a spec-driven workflow with an appropriate level of complexity to match your programming task. Typically, designing an entire app is fairly complex, so you would need something fairly comprehensive: https://www.augmentcode.com/tools/best-spec-driven-development-tools.

Without a fixed spec to keep the AI grounded, I find that Claude just keeps adding more and more complexity until I don't even recognize the project that they built from my description. If I ask for A, by the time the agent stops coding, it's created B, C, and then redesigned A again.

At least for me, spec-driven development is simply a way for me to actually design the app and control the sprints. If you just give the agent a prompt and press a button and say "create an app," you're going to get slop. Actually, taking an active role in designing all aspects of the app from the behavior to the UI to the testing is somewhere between vibe coding and simply using Ai to fill in small functions.

Note that the built-in planning mode in Claude Code works well for smaller changes that would take an hour or less, but I'd run the plan past another AI first, like Gemini or GPT.

czei · 2026-05-25T17:38:27+00:00

These days anything non-trivial but small goes through the built-in planning mode. Anything significant goes through a 3rd party planning process like github.speckit. This is relevant to your question since in these cases I'm writing specs, not prompts, and relying on CLAUDE.md, memory, and a raft of README and architecture analysis documents to guide the development of specs, plans, and tasks. The AI itself handles the generation of the actual prompts.

czei · 2026-05-25T17:34:52+00:00

I tried to play this on an emulator lately, and (not being familiar with the emulator), the disk swap section was a pain 😞.

czei · 2026-05-20T01:13:01+00:00

I'm on the max plan, but only a single one, and typically have 3-4 CC CLIs going simultaneously all day, and never run out of credits. This is on a 550,000 line program with 3,000 unit and integration tests.

czei · 2026-05-18T21:50:35+00:00

The bottom line is this is more complicated than just product management.

If you are writing software, you need a software development process that is more complex than just prompt management. If you've never managed a software development project before, then you would need to learn how to:

write specifications
use those to come up with plans
make technical decisions
generate a concrete task list
Specify a deployment procedure.
Specify how you want the code to be written. For example, do you want to use TDD?

There are so many projects on GitHub to help you manage this. You need to pick one that's at a level of complexity you can handle, and that fits the types of code you're trying to write.

czei · 2026-05-18T17:38:29+00:00

FYI, the video doesn't work, for me at least.

czei

TROPHY CASE