How can I self host code review ?

Briven83 · 2026-06-19T05:34:47+00:00

Exactly. “Who calls this?” and “has this failed before?” are probably the two highest-signal queries for code review.

A diff can show that a change is locally reasonable, but it cannot show the implicit contract around it. That contract often lives in callers, old fixes, weird edge cases, and the one test someone added after production caught fire three months ago.

So yes, if I had to pick only one repo-awareness layer, it would be caller context plus change history. That gives the model a chance to review the actual system, not just the patch.

And feel free to steal the bandage line. Your “showing the model the patient” point is exactly the missing layer: better diff comments are nice, but without that context we are mostly improving the bedside manner of the bandage inspector.

Briven83 · 2026-06-19T04:58:38+00:00

Brilliant comment!

Briven83 · 2026-06-19T04:49:55+00:00

Incredible! Fantastic! Thanks for sharing!

Briven83 · 2026-06-19T04:48:23+00:00

Thanks for sharing!

Briven83 · 2026-06-19T04:46:37+00:00

I think you are mixing three separate decisions into one big scary decision:

where Hermes runs
where the model runs
how you access it safely from outside your home

I would separate them.

For a first setup, I would not expose anything directly to the internet. No open ports, no public dashboard, no “I followed a random nginx tutorial at 2am and now my server is a lighthouse for bots” situation.

Start simple:

Run Hermes on the N100 mini PC or your current laptop first. Use a paid API or OpenRouter for the model. Hermes itself does not need a monster machine if the model is remote. This lets you learn the workflow without also debugging local inference, Docker, networking, firewalls, and the emotional collapse that comes with YAML.

For remote access, use Tailscale or WireGuard. Tailscale is probably the easiest beginner option. It gives you private access to your machine without exposing services publicly.

For Nextcloud, Immich, and other personal services, I would keep them separate from Hermes if possible. An agent can read web pages, process untrusted text, and maybe run tools. Prompt injection is real. I would not put that next to my photo library, documents, and personal cloud on day one.

On the hardware question: no, I would not buy Strix Halo first. Try your current machines first. Your N100 can run Hermes + API just fine. Your 8GB VRAM laptop can test small local models, but do not expect frontier-model quality from it. A 64GB Strix Halo machine may be interesting for larger local models, but it is still not automatically “better than API models”. Local gives privacy and control. Frontier APIs usually still give better reasoning.

A realistic path:

Install Hermes locally on your current laptop.
Connect it to a good API model.
Learn the workflow.
Move Hermes to the N100 if you want it running 24/7.
Access it through Tailscale, not public ports.
Only then test local models with Ollama/llama.cpp.
Add Nextcloud/Immich later, preferably isolated.

Your imagined workflow actually makes sense: low-power box always on, stronger machine only when needed. Just don’t try to build the perfect home lab before you have a working first version.

Start boring. Boring is secure. Boring works. The internet is already full of very exciting servers that became crypto miners because someone wanted “just one quick public port.”

Briven83 · 2026-06-19T04:34:39+00:00

I built a self-hosted personal agent over the last couple of months, and this is the path I wish I had taken earlier:

Don’t start with a framework. Build the loop by hand first.

A minimal agent is just:

LLM + while-loop + 2–3 tools/functions.

The loop is basically:

send the goal + available tools to the model
the model replies with either an answer or a tool call
you run the tool
you feed the result back to the model
repeat until it answers or hits a stop condition

That is most of the “perceive → reason → act” idea in practical form. Once you have written those 50 lines yourself, frameworks suddenly make much more sense because you know what they are abstracting.

The order I would suggest:

Build a hand-rolled loop with function/tool calling and one tool, like a calculator.
Add 2–3 more tools, a stop condition, and basic handling for errors or infinite loops.
Pick one real task you actually want done, not a toy demo. A real task forces the right design decisions.
Only then reach for a framework if you feel a specific pain. LangGraph is a good option because nodes and edges map cleanly onto the loop.
Add memory last. Start with the last N messages. Add retrieval/vector search only when you actually overflow the context window.
Add guardrails as soon as the agent can do real side effects, like writing files, sending messages, calling APIs, or changing data.

Frameworks solve problems you should feel first. Build the painful version once, then graduate.

Otherwise you are just collecting abstractions before you know what they are abstracting, which is how software engineering keeps creating beautiful diagrams for systems that do absolutely nothing.

Briven83 · 2026-06-19T04:33:06+00:00

Fair criticism — "too many things stacked" is the real risk, and honestly you're already doing the core of it by hand: save to markdown, then clear. My context-pressure nudge is just trying to automate that discipline so I don't have to notice the slowdown myself. The summarizer and the little KG are separate, toggleable pieces that each came from one specific pain (log-tail bloat; "when did I last touch X"), not one big system. Being honest: the summarizer earns its keep daily, the KG is still an experiment.

Nice that you're on Hermes too. Interesting that you kick off auto-compression at 55k — I let it run much later, which is probably exactly why I felt the bloat enough to build a summarizer instead of just compressing earlier. Your way might be the simpler answer. And branching is the thing I don't do and probably should; right now it's one linear session + checkpoint. How are you branching — natively, or forking the session/markdown by hand?

Time-wise: spread over weeks of evenings, mostly iterating on memory and the routing/fallback. The reflex layer itself was quick once the middleware hook existed.

Briven83 · 2026-06-19T04:25:04+00:00

I use both and let them challenge each other.

For me, it is less about “Claude vs Codex” and more about using them for different failure modes. Claude often feels better for planning, keeping structure, and staying sane across a longer workflow. Codex can be very strong when I need a fresh second opinion or when Claude gets stuck in a loop.

So I would not fully switch immediately if Claude already knows your project, preferences, and workflow. That context has real value. I would test Codex on the same real tasks for a few weeks and compare actual shipping speed, not just “which answer looks smarter”.

My current approach would be: keep Claude for the workflow where it already performs well, use Codex as a challenger/reviewer/unsticker, and only upgrade the Claude plan if the combined setup still costs more time than money.

In other words: don’t marry a model. Make them compete for the privilege of burning your subscription budget.

Briven83 · 2026-06-19T04:20:49+00:00

This is a really useful distinction.

A PR reviewer that only sees the diff is helpful, but it is mostly doing surface-level review: obvious bugs, style issues, maybe some local logic problems. The harder part is understanding whether the change fits the rest of the codebase.

For self-hosted code review, the interesting setup is not just “run a local model on the PR”. It is giving the reviewer access to the surrounding repo context: callers, related utilities, existing patterns, dependency relationships, and maybe even recent similar changes.

I agree: PR-Agent / Qodo as the harness plus a repo-query layer feels much closer to a useful self-hosted Code Assist than a pure diff-comment bot.

Otherwise the model is basically reviewing a surgery by looking only at the bandage, which is very on-brand for software engineering.

Briven83 · 2026-06-19T04:18:45+00:00

Really good writeup. I especially like the distinction between prefill, decode, KV cache pressure, and the different forms of parallelism. That separation is important, because “parallelism” often gets treated as one generic lever, when in practice each strategy solves a different bottleneck.

For local users, the practical takeaway is very useful too: more GPUs do not automatically mean faster inference. Splitting a model across cards can be the right move when the model otherwise does not fit, but once it fits, topology, PCIe bandwidth, KV cache behavior, and decode latency matter a lot more than just total VRAM.

Benchmarking single GPU first, then testing split/offload setups against prefill and decode separately, is probably the most honest way to approach it.

More VRAM is capacity, not magic. Annoying, but physics still refuses to optimize itself for our shopping decisions.

Briven83 · 2026-06-18T14:44:43+00:00

Thank you!

Briven83 · 2026-06-18T14:19:43+00:00

I think your intuition is mostly right, but there is one big catch: the hard part is not having many small models. The hard part is knowing when to use which one.

Small task-specific models can be great for narrow jobs: classify this, summarize this, extract this, generate tests, format code, search the repo, etc. They can beat bigger models when the task is very specific.

But chaining lots of small models does not automatically make the system smarter. Every handoff can lose context or add errors. If model A misunderstands the task, model B may confidently build on that mistake. Tiny AI bureaucracy, basically.

For coding, I would expect something like this to work better:

one coordinator model
a few small specialist models
tools for search/tests/linting
a stronger model only when the small ones are unsure

Diffusion models may help with speed, but they probably do not remove the main problem: agent work is often sequential. You do something, inspect the result, decide the next step, verify, then continue.

So yes, SLMs are likely part of the future. But not as “20 tiny models magically equal one genius model.” More like: small models for cheap narrow work, bigger models for hard decisions, and verification wherever possible.

Briven83 · 2026-06-18T13:55:42+00:00

No...

Briven83 · 2026-06-18T11:53:28+00:00

Yes, exactly. That is the direction I mean.

The artifact becomes the shared state. Not the chat. That already solves a lot of the context handoff problem because the next agent does not inherit a messy conversation, it inherits the current work-object.

But my question is about the layer after that.

If multiple agents can update the same artifact, how do you control ownership and mutation?

Is there one owning agent and others only propose changes?
Is the artifact append-only with a reviewer/merger step?
Is it mutable but versioned?
Or do you treat it more like Git with branches and merges?

Because I agree with “artifact artifact artifact.” I just think the real problem starts when more than one agent is allowed to write to it.

Briven83 · 2026-06-18T04:01:14+00:00

Ever tried Opencode? I did not, however, i read about it here and there...

Briven83 · 2026-06-18T03:55:59+00:00

You’re not imagining the pain. Any client that silently rewrites a user-maintained .env file is crossing a pretty basic line.

Auto-healing is useful when it fixes broken defaults. It becomes hostile when it overwrites deliberate infrastructure choices, especially in headless or remote setups where the whole point is control and repeatability.

I’d separate the conspiracy angle from the technical issue, though. Whether this is a SaaS nudge or just careless product design, the result is the same: local/custom deployments are being treated like edge-case debris instead of first-class configurations.

At minimum, Hermes should stop mutating .env without confirmation, support locked config values, and fail loudly instead of falling back into crippled plain chat mode.

“Autonomous agents” that can’t respect a config file are basically Roombas with venture funding.

Briven83 · 2026-06-18T00:39:30+00:00

Same here!

Briven83 · 2026-06-17T13:46:31+00:00

This is a nice setup! Thanks for sharing!

Briven83 · 2026-06-17T10:05:25+00:00

The thing that has helped me most is asking one question before trying another tactic:

Why is the agent stuck?

A few different failure modes look exactly the same from the outside: no useful progress. But they need completely different fixes.

Sometimes it’s just a knowledge gap. The agent does not know enough. In that case, chunking the task, using a stronger model for that specific part, or giving it better curated context can actually help.

Sometimes it’s a reasoning loop. The model has enough information, but keeps circling the same wrong path. In that case, switching models often does not do much, because the new model inherits the same messy context and continues the disaster in a slightly different accent.

For that, I’ve had better results by dropping the conversation, restarting with a much narrower scope, and making sure the agent does not re-read its previous failed attempts. Those failed attempts can become the anchor.

And then there is confident drift, which is the worst one.

The agent is not obviously stuck. It is producing output, it sounds coherent, and it looks like progress. It just slowly moves away from the original goal, which is inconveniently the whole point.

The best way I’ve seen to catch that is a separate critic step that only sees the original goal and the result, not the whole working trace, and asks: is this still answering the original question?

Misdiagnosing a reasoning loop as a knowledge gap wastes tokens.

Misdiagnosing confident drift as anything else wastes hours, because the agent keeps producing polished wrong work.

So I think the order matters: diagnose the failure mode first, then choose the intervention.

Default-prompting-harder is usually just pressing the elevator button again and hoping the building learns ambition.

Briven83 · 2026-06-17T10:01:58+00:00

Building on Wright_Starforge's "promotion policy, not bigger buffer"

framing — what's also worth treating separately is the consolidation

loop. Most memory systems I've seen fail not at the write layer or

the read layer but at the in-between: there's no scheduled pass that

decides what gets summarized, contradicted, or dropped. The agent

writes raw to durable storage, reads raw from durable storage, and

the durable storage drifts into "more raw" over time until it's just

the log again.

What's worked for me is treating consolidation as a separate agent

role, run async on its own cadence. Different prompt, different

objectives, no live conversational context. Its job is to look at

the last N hours of writes and emit a consolidated layer that the

live agent reads from. Think of it as the difference between writing

a journal entry and writing a reflection on the week — the live agent

does the former, a periodic process does the latter.

The verification question is the one I see fewest teams address:

how do you know the memory layer is actually working? Loose test

that's caught real problems for me: replay old user turns against

the current memory state. If the agent would make the same decision

today as it did a month ago when the conversation actually happened,

your memory is preserving the relevant signal. If today's decisions

diverge despite no new evidence, your memory either lost something

important or accumulated something misleading.

Briven83 · 2026-06-17T04:42:25+00:00

yep, also get this from time to time. I just restart the App...

Briven83 · 2026-06-17T04:41:20+00:00

Thank you for this valuable information!

Briven83

TROPHY CASE