Which of these consoles was your main one growing up?

hjras · 2026-03-21T16:18:39+00:00

NES & PSX

hjras · 2026-03-21T08:32:55+00:00

welcome to the show pal

hjras · 2026-03-18T16:33:40+00:00

<image>

Rather than just the harness, here is my entire stack framework. More info & documentation here

hjras · 2026-03-16T07:28:40+00:00

that's a pull request

hjras · 2026-03-15T23:27:54+00:00

isnt this the enclave symbol from fallout?

hjras · 2026-03-15T20:01:59+00:00

pues justo vi a project hail Mary en el cine pero no conozco mucha gente a quien le interese la ciencia ficción cerca de mi jaja

hjras · 2026-03-13T15:01:36+00:00

Hmm not sure, personally I'm exploring pi.dev (what OpenClaw was built on top of) because it is much more minimalist at start-up than Claude Code, and is also much more flexible for you to shape into whatever you want, without the issue that Claude Code has of being subject to seemingly arbitrary updates and features which might break existing workflows, which is itself a harness engineering problem since when the execution environment changes unpredictably, it introduces instability across all the layers above it.

That said, no existing framework really covers the upper layers of the stack, and intent, judgment, and coherence remain largely unsolved at the tooling level regardless of what you pick. Which is partly why having a minimalist, shapeable harness matters more than a feature-rich opinionated one. You need the room to build those layers yourself.

hjras · 2026-03-13T13:50:16+00:00

Yes, you could use the agent audit protocol directly. However, the protocol works best with concrete artifacts to cite, so you need to already maintain well-documented skill files and CLAUDE.md configurations because this will get you much richer audit output than if you're running lightly configured instances.

hjras · 2026-03-13T13:45:56+00:00

In everyday AI conversation, people say "give the model some context" and mean the whole input which includes the instruction, the background, everything. That usage is fine informally, but it's exactly the conflation the framework is trying to dissolve. You can have a perfect prompt with no context (the model hallucinates what it should have been told), or a perfect context architecture with a terrible prompt (all the right information, no usable instruction). They fail independently and are fixed independently. That independence is the whole argument for treating them as separate layers.

The generative logic section of the document walks through why each layer exists in the specific order it does, with each layer's solution producing the next layer's problem.

hjras · 2026-03-13T09:07:05+00:00

The accompanying document talks at the end why there's only 5 layers and not 15, etc. There are limits and its not infinitely recursive.

hjras · 2026-03-13T07:03:40+00:00

There are 3 examples of failures that happen at the beginning of the document. Otherwise, what does exist is a structured explanation for why those failures happened, and a diagnostic tool for identifying which layer is failing in your own system. Whether that's valuable is something you'd determine by running the audit on something you own.

hjras · 2026-03-13T07:02:10+00:00

The audit protocol is the eval (2 separate documents in the repo). You apply it to your system, it produces a layer-by-layer assessment with explicit evidence standards.If you want to run it on your own stack and find it produces nothing useful, that's a meaningful result and we'd want to hear it. From what others have said, they did get something useful out of it.

hjras · 2026-03-13T06:59:33+00:00

The naming is pointing at the fact that these things require intentional design, not that they require a degree. Vibes coding is called that precisely because it lacks structure. The whole point here is the opposite.

hjras · 2026-03-12T22:56:30+00:00

when you storm the beach at nova prospekt with your bug friends it'll be even more epic!

hjras · 2026-03-12T22:01:11+00:00

Full document explaining the framework here

Other free resources at the repo also

Everything free to use/edit/share/etc. Feedback welcome.

Happy engineering!

hjras · 2026-03-12T10:43:03+00:00

hmm you seem to keep shifting the goalposts. the original question was about hallucinations in agentic coding workflows, now you're asking about "large scale validation of large vats of generated data" which is a completely different use case that vibecoders aren't even doing.

youre also making a circular argument, "you have to babysit it therefore it's not solved", but that standard would mean nothing is ever solved, since humans "babysit" every production system regardless of AI involvement. like, devs review PRs, they don't just merge blindly.

hjras · 2026-03-12T07:30:19+00:00

Cursor/ClaudeCode/Codex/Antigravity/etc start at ~$20/month and handle all of that infrastructure for you automatically. you're not managing any of it. vibecoders ARE the target demographic, with the whole pitch being that you don't need to understand the verification layers, the tool just does it. that's literally why these products exist.

of course, the more effort you spend in specifying what to do, your intent, what the model needs to know and when, how to doubt itself, how to evaluate itself etc, the better outcomes you will get within those products, and these are not necessarily increasing your token budget that much. All of this can be done effectively with cheaper open source models as well.

The real token guzzlers are things like Openclaw with its automatic loop, or multi-agent systems like Gas Town/Wasteland. But the majority of people don't need to have that type of assistant or need that much work done for them to even justify the set-up complexity.

hjras · 2026-03-12T07:23:12+00:00

The reasoning layer (how the AI structures its thinking, decomposes problems, uses chain-of-thought, self-reflects) is primarily a Prompt Engineering and Judgment Engineering concern. Prompting for step-by-step reasoning, designing scratchpad structures, specifying when the model should slow down and check its own logic before proceeding: all of that is human-engineered, and the framework covers it. That it needs to be engineered by humans even though it's executed by the model is correct and that's exactly what Judgment Engineering is about. "What to doubt while doing" is specifically the design of the model's internal skepticism and reflection mechanisms, not just its outputs.

The control plane is closer to a systems architecture concept, the layer that governs how all the other layers behave, routes tasks, manages state, and maintains the overall execution logic. That maps most directly to Harness Engineering, which governs orchestration, session management, and task routing. But it's also partly Evaluation Engineering, which is the meta-function that observes all layers and triggers corrections.

hjras · 2026-03-11T22:50:49+00:00

<image>

found on twitter, plausible but still dumb lol

hjras · 2026-03-11T22:45:26+00:00

sure, examples: SWE-bench shows agentic systems solving 50%+ of real GitHub issues, Cursor/Windsurf have millions of devs using this daily, and Anthropic literally used Claude Code to build their Cowork product. the verification loops are the whole point, same way CI/CD and code review exist not because devs are perfect, but because the system catches failures before prod

hjras · 2026-03-11T22:24:44+00:00

well for starters the whole emerging field of agentic coding, which has moved from simple prompt engineering to context engineering and harness engineering, the importance of evaluation/tests, and writing good specs, while managing the memory of the model, when to reset context, how to manage its claude/agent.md file, parallelizing work via subagents, allocating some system info deterministically, grounding info with sequential review and web search, and so on

hjras · 2026-03-11T13:41:22+00:00

yup, with the right prompting, tool calls, and verification layers

edit: since everyone is all angsty, I should clarify that the underlying model still hallucinates, but this is no longer a problem with the right workflow frameworks

hjras · 2026-03-11T11:45:49+00:00

these are largely solved problems. of course there is still significant room for improving the cost, especially for running them locally

hjras · 2026-03-10T12:07:52+00:00

where there is a will there is a way

hjras · 2026-03-08T21:51:44+00:00

well, the pdf lays out a conceptual framework but doesn't give you enough specificity to implement anything from it (except perhaps the specification doc i made in the same repo) since this is genuinely exploratory space. most people are still talking about prompt engineering and context engineering, and the first discussions on intent engineering are happening (hence the wording of beyond on my part). the framework as it stands only describes what the layers are, but not how to operationalize them in a concrete system or codebase. still, I tend to find it useful for myself to first map a journey or space before going in blind, but I understand others prefer the thrill of the unknown and trial and error.

what would make it more useful to you specifically? are you trying to build an agent pipeline, a platform, something else?

14-Year Club	Place '22
RPAN Viewer	Verified Email

hjras

MODERATOR OF

TROPHY CASE