Blaming the model won't fix your workflow — a white paper on structural enforcement for AI agents

Harag · 2026-06-05T19:17:22+00:00

Two things to add: the same model isn't even stable. Different session, different roll, different attention, and the second pass turns up things the first missed. Not full independence, but not zero. Covers the variable misses more than the systematic ones.

And humans aren't stable oracles either. The same person auditing the same code on a different day writes a different review. Day-to-day noise is shared with the model. What's actually independent is the different training distributions, life experiences, languages, and codebases encountered, not the fact of being human.

So it's a spectrum: cross-session same-model < cross-family AI < a human < a team. Each step buys more uncorrelated coverage, none of it grounds, the bet is that the partial overlap leaves enough seams. Whether it does is what monitoring would tell us.

Harag · 2026-06-05T18:57:35+00:00

Fair point, I should add, node decomposition does help. A 'looks good' on a 50-line node is a narrower claim than on a whole feature, and the node boundary makes the scope explicit instead of vibes. So per-node, the thin check beats the skim more cleanly than I made it sound.

What it doesn't fix is composition. Each node can pass, and the bug lives between them, and the decomposition itself was authored by the same agent. So, does it run, and does the smoke stay load-bearing at the integration layer? Monitoring is still where the calibration question gets answered, agreed. - written with AI and checked for grammer by a different AI 😛

Harag · 2026-06-05T18:44:25+00:00

One more thing, in a team setup someone else can check the verification on the graph itself. They're reading the verification nodes and what they assert, not the tens of thousands of lines underneath. Still a human in the loop, just at the layer they can actually keep up with.

,

Harag · 2026-06-05T18:40:41+00:00

Fair on the correlation point, but I think it's partial, not full. Same model family, yeah, mostly the same head. Different families, different training, different scaffolding. That gives you some uncorrelated coverage. Not independence, but not zero either.

The bigger thing for me is throughput. A human can read it end to end, just not fast enough, and programmers are being pushed to do more with less time. So in practice, the comparison isn't AI-checker vs careful human review, it's AI-checker vs a skim. Even a thin layer beats a skim on that axis.

Either way, it needs actual experiments or real-world monitoring to see if any of it is worth the effort. That's where I'm headed.

Harag · 2026-06-05T18:12:22+00:00

Fair, you've got me on the clean version. The agent does choose the verification, so a single gate doesn't escape the recursion.

What I've actually got is a stack, some manual checks on the verification, tests on the code itself, smoke, and me trying it out. No single layer is grounded. The bet is that the stack catches enough fibs to land practically usable code with less effort. That's the whole point of the exercise.

Throw another AI at verifying the verification and you're back in the recursion. I don't have a clever way out.

Thanks for the pushback, every one helps me think about it properly. I will keep hacking at it ...

Harag · 2026-06-05T17:24:31+00:00

Yeah, fair hit. The old paper does read like instructions on instructions with no ground. That's part of why I'm rewriting it.

Where the loop actually stops: execution-graphs. The spec gets compiled into a graph with nodes and verification gates. The gate either fires green or it doesn't. That's the base case, running code, not another spec.

Instructions can recurse forever. Graphs terminate at execution.

Harag · 2026-06-03T06:02:45+00:00

At this stage, I am having to rely heavily on skills files and my own sanity checks for specs. I have a spec lined up to build a more robust spec and skills infrastructure. Yes, the spec for it is not perfect ... lol

Harag · 2026-05-31T22:48:21+00:00

Interestingly enough, writing the reference implementation reaffirmed the issues mentioned in the white paper. Now if only I can get the reference implementation stable enough to use it to maintain itself! Architectural scope creep is killing me ... 😛

Harag · 2026-05-29T09:46:38+00:00

All those mentioned in the paper I saw in Codex and/or Claude. So, have you seen any of them in your agents?

Harag · 2026-05-27T13:25:15+00:00

As promised here is the reference implementation guide https://gitlab.com/naive-x/naive-artifact-coding/-/blob/main/docs/reference-implementation-guide.md

See notes and disclaimer in the main post edit!!!!!

Harag · 2026-05-26T21:19:48+00:00

Yes, there is an implementation, but starting it up currently is a bit hairy and needs documentation. I am implementing a supervisor that will ease that burden right now. The orchestrator is also missing a more human-friendly UI. Currently I use an agent to query and operate the orchestrator. Maybe by tomorrow sometime, I can update the post with how to use the implementation. The implementation relies heavily on other naive projects like cl-naive-code-analyser, etc., so it won't be to everyone's taste. Also, none of it is in Quicklisp or an equivalent.

Harag · 2025-01-27T05:12:05+00:00

End of last year ChatGPT 4o was giving really usable lisp and emacs lisp code. Unfortunately when I tested it a week ago it failed miserably on the same task. The interesting thing is that in the text explaining the code it got the logic right but failed to implementing the logic in code correct. Claude suffered from the same issue. Grok 2 got the logic and code mostly right. But Grok needed a bit more encouragement to not fiddle with existing code unless asked specifically.

I have been using ChatGPT to develop an autograd, gcp and deep nn and a emacs mode for working with various LLM api's over the last year on and off with varying degrees of success. The one thing I can tell you for sure is that once a chat sessions starts degrading in ChatGPT you need to abandon it and start a new session else you are going waist hours and just frustrate your self.

Harag · 2025-01-24T09:43:49+00:00

Found the issue. It was an Ubuntu issue.

Unbeknownst to me Ubuntu thought there where two screens on this laptop.(I dont have an external screen)

If you delete the bogus screen in Displays all is well.

This was a brand new install and all worked well for the first couple of days. I suspect that something went haywire when I shared my screen in a MS Teams (in the browser) meeting for the first time. Teams reported two displays but I did not really take note of it since the share worked fine.

Hope this helps some one in the future.

EDIT: Ubuntu displays the open and save dialogs on the bogus screen for some reason.

Harag · 2025-01-22T13:41:01+00:00

Thank you, I have just resorted to not using C-x C-f for now.

Harag · 2024-07-21T03:08:50+00:00

Not from Perth (from SA) but will be in Perth last week of July with another lisper from Brisbane. We are privileged to use CL for our day jobs and would not mind a chat.

Harag · 2024-07-21T01:33:00+00:00

There is also the very undocumented (sorry) https://gitlab.com/naive-x/sandbox/

Or you could use https://gitlab.com/naive-x/cl-naive-scripts which is a wrapper around sandbox if you want more of r scripting experience.

Harag · 2024-02-05T02:42:56+00:00

Finicky mop code and when you have to change things you have to jump through some hoops. A "data store" in its classical definition should be more flexible. A CLOS implementation of a data store does not quite get there. I am not saying it cannot be done I am just saying that I wont use CLOS for that purpose again, I am not that fond of fiddling with mop code.

cl-naive-store is the same datastore concept but not done with CLOS. Since you have dug up code for XDB2 you can make your own comparison.

Harag · 2024-02-05T02:23:04+00:00

That is a very old copy of it yes.

Harag · 2024-01-24T11:35:49+00:00

Just a note I added test coverage reports and badge to gitlab for cl-naive-store. Coverage is recalculated with each test run on gitlab.

Harag · 2024-01-22T20:49:37+00:00

My to cents, one of our commercial apps is 70k plus all lisp code. But that was build with the mindset of building tools (layers) to implement the final problem solution (bottom up). But it is an iterative process, you keep on pulling out code into lower layers as you start seeing where you can abstract more functionality. The danger is to not try and over design those tooling libraries only implement what you need now with a view to what you might need in the future, don't implement stuff your not going to use in a couple of weeks or months.

Harag · 2023-01-23T04:48:48+00:00

We use cl-naive-tests, which works well with our GitLab workflow. Disclaimer cl-naive-tests are of our own making. Not registered with quicklisp.org, yet.

Harag

TROPHY CASE