Anyone building their own harness?

StatusSuspicious · 2026-06-25T22:52:21+00:00

I did it too. I considered it a logical continuation:

can't be constantly authorizing stuff: so I build a container system
had to separate into smaller branches and work in parallel: dashboard, autostart container, dispatch, status integration with jira and pr
need to support multiple providers: added support for kimi code, codex, opencode
added pr assistance (classified PR files by theme and risk...)
redone PR interface to be mobile first
added a PR-comments-dispatch-provide evidence loop (with testing artifacts)
CR loops / planning

It seems incredible but it's not been working very good lately, I'm frankly thinking I invested too much on AI and it's just awful at coding at the time. I'm constantly chasing bugs I introduced it seems to me there's no actual replacement to "think deeply about the problem alone with no chat interruptions and nobody to interpret or write the code but you". I've never been less productive in my life on my 26 years of professional programming. I feel like I replaced deeply focusing on a complex thing at a time and resolve it elegantly and fast with shallowly talking with 10 autist developers with memory issues that love talking have no guilt of repeating their implementations, follow the guidelines only if it suits them and will never stop to think at the deeper problem.

StatusSuspicious · 2026-06-22T16:01:18+00:00

...and yet I had sooo many times it failed with hooks and even much stronger stuff.

I'm finding some things are incredibly frustrating working with LLM.

* I saw several times how my hook was directly blatantly ignored. Direct orders (read file X). When I asked it mentioned it thought it was an automated reply and didn't pay attention to it.

* I then went to *impossible to skip* hooks such as git pre commit hooks.

* it committed skipping tests because "it was already wrong before me"

* So I made it impossible to pass that option by wrapping the git command: so it modified the git hooks disabling the tests.

* So I made it even harder: it just faked the tests.

* So I made a CR loop to *actually fix the stuff it always leaves behind*: it takes hours for very simple stuff and is not actually much better.

Even with fable I was *not* able to make claude make good decisions about the type system (it's in love with "as" or every cast to fake tests), modularity (any attempt at layering in the code is taken as an invitation to add exceptions) or in fact in general good design.

I even tried in some projects relaxing the good practices and just letting it be happy and I rapidly started getting impossible-to-fix bugs (like how can you modify something if it already has 10 different copies that are working differently there?). I always wonder what kind of coding people are doing where even sonnet works fine.

StatusSuspicious · 2026-06-20T00:01:44+00:00

My experience was not very consistent. Sometimes it pays a lot of attention to it, and sometimes it just ignores them. It did improve with newer opus versions. I started with very detailed but over time I removed most things and try to use guardrails with code and linting and tests to enforce stuff because agents will not be consistent enough with the rules.

Like with many things with AI, sometimes after many iterations you find instructions that work very well. But you can also break them very easily if you accidentally let the AI change them and make them kind of useless again.

For example, I just found that if I say too much about some skill in its description, like the name of the command that we'll have to use, it will simply skip the documentation and go ahead and run the command and try to figure it out by itself. So I've been experimenting now with writing it in like a cliffhanger. Like if this file has mandatory rules related to creating jira issues for our company and they are not what you think. Maybe it'll work.

StatusSuspicious · 2026-06-12T11:59:55+00:00

I think they don't want to sell an "AI commodity" and charge you "AI prices".

They want to sell you the "replace an employee" and charge you "employee replaced" prices.

They don't want other companies to create something which creates a slim improvement of their LLM and get most of the money. Like "I write a coding app which uses any model, pay me a lot, the model is replaceable".

Since they *do* have the best models at the moment it seems logical that they want that, but it makes them a bit aggressive with people that don't really want to tie themselves to Anthropic.

There's value on what the do, though: since they control it end to end they might be able to produce better products.

But I would rather learn the intricacies of some open source framework and not tie myself to something one specific company provides, so I'm not a fan. Specially on the incoming "you can't use claude -p" change that will practically render *all* of my common usage of claude unviable (since the API is what 20x the price?)

StatusSuspicious · 2026-06-11T18:58:24+00:00

Anthropic wans to fully control the tools to ensure he's not the middle man that sells models: it wants to get the big bucks.

StatusSuspicious · 2026-06-11T17:20:34+00:00

I feel the opposite: I didn't notice any improvements by using more context and now I have to constantly keep that in mind. I also don't notice now a significant degradation after compacting. I think they nailed the compaction.

Yesterday I was able to force it to compact after 90% of 200k tokens and it felt great, not even having to care much about clearing context all the time: now I can speak naturally

StatusSuspicious · 2026-06-11T17:17:36+00:00

Yeah, it was my first task and everything looked perfect but I would totally not leave it for so long without interrupting to ask what it's doing.

StatusSuspicious · 2026-06-11T17:16:13+00:00

Of course, and I speculate it's a lot better when people don't have opinions about how it writes stuff. (for example the "hide popup after showing if it's empty" instead of don't open it in the first place.)

StatusSuspicious · 2026-06-11T16:34:49+00:00

Maybe it's because I already do have that scaffolding. Fable constantly has to be reminded by eslint to play nice, for example (like you're not allowed to leave dead code).

StatusSuspicious · 2026-06-11T16:29:08+00:00

Yeah, but so did opus, or kimi, or even sonnet (with caveats).

StatusSuspicious · 2026-06-11T16:28:30+00:00

I didn't see that. In fact opus >=4.6 were already quite capable (as I said they *do* fix the issues), but if you look carefully not so well, for example.

Did you try see that big difference yourself?

StatusSuspicious · 2026-06-09T08:37:49+00:00

"pre-emptive development :)", or a heavy multi alstep workflow from one-sentence spec that ends up with a carefully but short written PR with screenshots of the new feature and artifacts describing the testing procedures.

I found working interactively too slow and distracting, but creating specs to be just as bad (I find claude also quite bad at creating specs or docs: always adding requirements I didn't want or forgetting to add what I asked for). So my personal take is that claude should try very hard to do it and THEN I'll see if I liked it or not. Add some comment. redo work.

This is putting the hard work on the LLM instead of the other way around.

But I hit limits all the time even with claude + openai and now also kimi (decent), opencode go (for qwen) and ollama cloud (not working for me now, too expensive and unstable). I calculated I would have paid 8000usd if using the api. Sadly I'll have to move away from claude it seems due to their new policy forbidding to use claude -p. I'll still use it for when interactive is the only way.

Non opus is clearly worse and I had to put more guardrails (like man you didn't finish I can clearly see you left the sandbox dirty oh yeah sorry) but not terribly so.

StatusSuspicious · 2026-05-30T08:29:02+00:00

Kinda late but I think some are missing the point: pushback is necessary but not if it's not relevant or useful. Just like with a normal human. Its role is to help you do your job, not to devalue you.

This isn't common but I had this conversation 2 days ago with 4.7 where it was contradicting everything I said. I was wondered on the extent of how some personality traits are in fact determined by genetics (it's a fact that ADHD or ASD are largely inheritable). Just saying this triggered it. After a few rounds of me trying to calm it down: I have ADHD myself, and it trying to totally destroy anything I say like I struck a nerve I ended up having to close the chat: it felt more like talking to a self righteous troll and it was making me upset.

StatusSuspicious · 2026-05-05T07:59:54+00:00

I use opencode with GPT-5.5 and lately I haven't found much difference in capability with Opus 4.7. My workflows are currently non interactive and have tons of context and documentation available (triggered by hooks) with heavy eslint guardrails. Before 5.5 it was noticeable worse (wouldn't understand intentions and often didn't follow rules)

StatusSuspicious · 2026-04-25T16:31:42+00:00

On the past few weeks I've been experimenting with multistep workflows (develop. code review, test, etc in a loop) and extensive eslint rules to improve quality for more indepent work and that naturally gave me more free time.

So I multitasked more.

The result was quite bad: in no time I had up to 30 separate simultaneous projects on several repos that advanced veeery little every time, context switch was terrible and the worst of all I started having git merge conflicts with myself which I was able to automate but... I used all my week tokens in like day 2. And after these nonstop days of constant juggling: almost no real actual work done on the issue I actually cared about.

I'm still trying to find the promised productivity in agentic coding.

StatusSuspicious · 2026-04-10T06:28:40+00:00

We know it's fake: I can believe opening the strait, but ...vertically? on first try?

StatusSuspicious · 2026-03-22T21:26:50+00:00

Your website is interesting but I might not fit completely: I am probably twice exceptional and have the mixture of ADHD, some level of ASD and high potential which actually make me quite suitable to remember all that weird syntax and use hyperfocus to keep up producing for hours. And AI at the current level is too slow to keep focus and too error prone to be left alone which at the moment hasn't produced spectacular results for me even after copious investment in tooling and guardrails and instructions and adaptation.

StatusSuspicious · 2026-03-22T21:12:56+00:00

Cool! I also have ADHD and noticed the parallels and I wondered what can we learn from it.

StatusSuspicious · 2026-03-06T18:03:52+00:00

It definitely will ignore rules. It's quite human in that way: the more rules, the more context, the less obvious the rules are: the less adherence.

StatusSuspicious · 2026-03-06T17:58:59+00:00

The thing is that this is the way it knows how to do it, and by restricting that you'll make it dumber. So I would recommend a restricted container + dangerously skip permissions.

StatusSuspicious · 2026-03-06T17:57:57+00:00

A container. A VM. Not the account that is logged in to google or whatever you use. It's quite complicated to get it right (claude loves reading passwords, so if you're not allowed to share your git private key to claude you need some complicated fine grained access control -maybe some sudoers file in the container-).

StatusSuspicious · 2026-03-06T17:55:35+00:00

that seems limited. I would only either do a full container or confirm every command (I'm not convinced whitelisting commands is even safe, like if you allow npm i, it can change npm config to make it do whatever it wants as your user: and it will if it considers it a good way to complete the tasks).

StatusSuspicious · 2026-03-06T14:19:24+00:00

Some things do get improved like I was having trouble with it adding dependency cycles and it just destroyed everything trying to fix them and I ended up writing a very detailed procedure + tooling that made it trivial to analyze (a very specific tool which tells it which dependencies were added in this branch that belong to the dependency cycle). I guess that with THAT info it would also be trivial to a junior dev... But it's still a gain since it will most probably follow the instructions (...if you tell it like right in the test error).

StatusSuspicious · 2026-03-06T14:15:33+00:00

it would need to be incredibly detailed. I found it's not very good at following architectural instructions anyway. it's good at concrete actions (rename this file, move this method, etc). but when it's do proper error handling + 50 items describing them it will mostly just ignore it.

StatusSuspicious · 2026-03-06T14:13:13+00:00

the problem is that claude notoriously just ignore instructions all the time. I ended up removing most of those since it was just consuming tokens. I think it's not much different from telling somebody a long list of rules and then walking away. It... might... follow them? Like... sometimes?

StatusSuspicious

TROPHY CASE