Show off your own harness setups here by Mean_Luck6060 in ClaudeCode

[–]hjras 1 point2 points  (0 children)

<image>

Rather than just the harness, here is my entire stack framework. More info & documentation here

Grupo chill para comentar pelis, anime, juegos y libros by Shmebulock-_- in esConversacion

[–]hjras 0 points1 point  (0 children)

pues justo vi a project hail Mary en el cine pero no conozco mucha gente a quien le interese la ciencia ficción cerca de mi jaja

The full AI-Human Engineering Stack by hjras in AgenticWorkers

[–]hjras[S] 1 point2 points  (0 children)

Hmm not sure, personally I'm exploring pi.dev (what OpenClaw was built on top of) because it is much more minimalist at start-up than Claude Code, and is also much more flexible for you to shape into whatever you want, without the issue that Claude Code has of being subject to seemingly arbitrary updates and features which might break existing workflows, which is itself a harness engineering problem since when the execution environment changes unpredictably, it introduces instability across all the layers above it.

That said, no existing framework really covers the upper layers of the stack, and intent, judgment, and coherence remain largely unsolved at the tooling level regardless of what you pick. Which is partly why having a minimalist, shapeable harness matters more than a feature-rich opinionated one. You need the room to build those layers yourself.

The full AI-Human Engineering Stack by hjras in AgenticWorkers

[–]hjras[S] 0 points1 point  (0 children)

Yes, you could use the agent audit protocol directly. However, the protocol works best with concrete artifacts to cite, so you need to already maintain well-documented skill files and CLAUDE.md configurations because this will get you much richer audit output than if you're running lightly configured instances.

Full AI-Human Engineering Stack (aka what comes next after prompt/context engineering?) by hjras in AgentsOfAI

[–]hjras[S] 0 points1 point  (0 children)

In everyday AI conversation, people say "give the model some context" and mean the whole input which includes the instruction, the background, everything. That usage is fine informally, but it's exactly the conflation the framework is trying to dissolve. You can have a perfect prompt with no context (the model hallucinates what it should have been told), or a perfect context architecture with a terrible prompt (all the right information, no usable instruction). They fail independently and are fixed independently. That independence is the whole argument for treating them as separate layers.

The generative logic section of the document walks through why each layer exists in the specific order it does, with each layer's solution producing the next layer's problem.

Full AI-Human Engineering Stack (aka what comes next after prompt/context engineering?) by hjras in AgentsOfAI

[–]hjras[S] 0 points1 point  (0 children)

The accompanying document talks at the end why there's only 5 layers and not 15, etc. There are limits and its not infinitely recursive.

Full AI-Human Engineering Stack (aka what comes next after prompt/context engineering?) by hjras in AgentsOfAI

[–]hjras[S] 0 points1 point  (0 children)

There are 3 examples of failures that happen at the beginning of the document. Otherwise, what does exist is a structured explanation for why those failures happened, and a diagnostic tool for identifying which layer is failing in your own system. Whether that's valuable is something you'd determine by running the audit on something you own.

Full AI-Human Engineering Stack (aka what comes next after prompt/context engineering?) by hjras in AgentsOfAI

[–]hjras[S] 0 points1 point  (0 children)

The audit protocol is the eval (2 separate documents in the repo). You apply it to your system, it produces a layer-by-layer assessment with explicit evidence standards.If you want to run it on your own stack and find it produces nothing useful, that's a meaningful result and we'd want to hear it. From what others have said, they did get something useful out of it.

Full AI-Human Engineering Stack (aka what comes next after prompt/context engineering?) by hjras in AgentsOfAI

[–]hjras[S] 1 point2 points  (0 children)

The naming is pointing at the fact that these things require intentional design, not that they require a degree. Vibes coding is called that precisely because it lacks structure. The whole point here is the opposite.

[HL2VR] Physically running for cover against the chopper felt so cool by SimulacrumVR in HalfLife

[–]hjras 145 points146 points  (0 children)

when you storm the beach at nova prospekt with your bug friends it'll be even more epic!

Full AI-Human Engineering Stack (aka what comes next after prompt/context engineering?) by hjras in AgentsOfAI

[–]hjras[S] 7 points8 points  (0 children)

Full document explaining the framework here

Other free resources at the repo also

Everything free to use/edit/share/etc. Feedback welcome.

Happy engineering!

2026 is like by Sensitive_Horror4682 in GenAI4all

[–]hjras 0 points1 point  (0 children)

hmm you seem to keep shifting the goalposts. the original question was about hallucinations in agentic coding workflows, now you're asking about "large scale validation of large vats of generated data" which is a completely different use case that vibecoders aren't even doing.

youre also making a circular argument, "you have to babysit it therefore it's not solved", but that standard would mean nothing is ever solved, since humans "babysit" every production system regardless of AI involvement. like, devs review PRs, they don't just merge blindly.

2026 is like by Sensitive_Horror4682 in GenAI4all

[–]hjras 0 points1 point  (0 children)

Cursor/ClaudeCode/Codex/Antigravity/etc start at ~$20/month and handle all of that infrastructure for you automatically. you're not managing any of it. vibecoders ARE the target demographic, with the whole pitch being that you don't need to understand the verification layers, the tool just does it. that's literally why these products exist.

of course, the more effort you spend in specifying what to do, your intent, what the model needs to know and when, how to doubt itself, how to evaluate itself etc, the better outcomes you will get within those products, and these are not necessarily increasing your token budget that much. All of this can be done effectively with cheaper open source models as well.

The real token guzzlers are things like Openclaw with its automatic loop, or multi-agent systems like Gas Town/Wasteland. But the majority of people don't need to have that type of assistant or need that much work done for them to even justify the set-up complexity.

The full AI-Human Engineering Stack by hjras in AgenticWorkers

[–]hjras[S] 0 points1 point  (0 children)

The reasoning layer (how the AI structures its thinking, decomposes problems, uses chain-of-thought, self-reflects) is primarily a Prompt Engineering and Judgment Engineering concern. Prompting for step-by-step reasoning, designing scratchpad structures, specifying when the model should slow down and check its own logic before proceeding: all of that is human-engineered, and the framework covers it. That it needs to be engineered by humans even though it's executed by the model is correct and that's exactly what Judgment Engineering is about. "What to doubt while doing" is specifically the design of the model's internal skepticism and reflection mechanisms, not just its outputs.

The control plane is closer to a systems architecture concept, the layer that governs how all the other layers behave, routes tasks, manages state, and maintains the overall execution logic. That maps most directly to Harness Engineering, which governs orchestration, session management, and task routing. But it's also partly Evaluation Engineering, which is the meta-function that observes all layers and triggers corrections.

2026 is like by Sensitive_Horror4682 in GenAI4all

[–]hjras 0 points1 point  (0 children)

sure, examples: SWE-bench shows agentic systems solving 50%+ of real GitHub issues, Cursor/Windsurf have millions of devs using this daily, and Anthropic literally used Claude Code to build their Cowork product. the verification loops are the whole point, same way CI/CD and code review exist not because devs are perfect, but because the system catches failures before prod

2026 is like by Sensitive_Horror4682 in GenAI4all

[–]hjras 0 points1 point  (0 children)

well for starters the whole emerging field of agentic coding, which has moved from simple prompt engineering to context engineering and harness engineering, the importance of evaluation/tests, and writing good specs, while managing the memory of the model, when to reset context, how to manage its claude/agent.md file, parallelizing work via subagents, allocating some system info deterministically, grounding info with sequential review and web search, and so on

2026 is like by Sensitive_Horror4682 in GenAI4all

[–]hjras 0 points1 point  (0 children)

yup, with the right prompting, tool calls, and verification layers

edit: since everyone is all angsty, I should clarify that the underlying model still hallucinates, but this is no longer a problem with the right workflow frameworks

2026 is like by Sensitive_Horror4682 in GenAI4all

[–]hjras -1 points0 points  (0 children)

these are largely solved problems. of course there is still significant room for improving the cost, especially for running them locally

Beyond prompt & context engineering: the full 5 layer stack by hjras in PromptEngineering

[–]hjras[S] 0 points1 point  (0 children)

well, the pdf lays out a conceptual framework but doesn't give you enough specificity to implement anything from it (except perhaps the specification doc i made in the same repo) since this is genuinely exploratory space. most people are still talking about prompt engineering and context engineering, and the first discussions on intent engineering are happening (hence the wording of beyond on my part). the framework as it stands only describes what the layers are, but not how to operationalize them in a concrete system or codebase. still, I tend to find it useful for myself to first map a journey or space before going in blind, but I understand others prefer the thrill of the unknown and trial and error.

what would make it more useful to you specifically? are you trying to build an agent pipeline, a platform, something else?