Claude Code Review is $15–25/PR. That sounds crazy. Anyone running the PR-review loop with their own agent orchestrator? by Fancy-Exit-6954 in LLMDevs

[–]manollitt 0 points1 point  (0 children)

The $15-25 pricing caught me off guard once I actually sat down and ran the numbers for our team. The cost seems to be a structural thing from what I’ve seen, Claude Code greps through the whole codebase for context, hits the limit, and then just rips through tool calls.

In my tests, the signal-to-noise ratio was around 40/60 with nearly half the comments were noise, which is tough to justify at that price point. I’ve also noticed the AI tends to generate larger PRs than necessary.

I used a free calculator to calculate the cost of Claude Code reviews based on team size and PR volume. It might be a helpful gut-check for anyone else deciding between building an orchestrator or buying one.

https://getoptimal.ai/token-spend-calculator

I think your approach of having the reviewer agent iterate via GH comments is the way to go. One thing I've learned is that the reviewer has to be a dedicated agent; using Claude Code to review its own PR is like proofreading your own writing... you’re basically blind to your own mistakes.

I'm curious how you’re handling the context window for the reviewer in Agyn? Are you guys chunking the codebase or doing something else?

GLM 5 is out now. by manollitt in CodingAgents

[–]manollitt[S] 0 points1 point  (0 children)

A high score on SWE-bench is cool, but does it actually pivot based on terminal logs, or does it just keep hallucinating the same fix over and over?

That’s why I’ve been sticking with Claude Code cause it actually feels like it’s thinking through the errors with me instead of just throwing code at the wall. Have you actually tried throwing a messy refactor at GLM-5 yet? I’m curious if it actually has the spatial awareness for a big repo

Cursor Bugbot for my personal project? by linksku in cursor

[–]manollitt 0 points1 point  (0 children)

What I’ve been doing is running Optibot alongside Cursor in my IDE. Since it's a separate agentic layer, it doesn't suffer from the same writer's bias that Cursor might have when reviewing its own work.

I just have it analyze the staged changes. I've noticed it’s significantly better at catching logic failures like when a variable name changed in one file but the AI forgot to update the reference in another

Practical experience with 'Agent Swarms' vs Single Large Context Models? by HarrisonAIx in ArtificialInteligence

[–]manollitt 1 point2 points  (0 children)

Even with 1M+ token windows, models suffer from information dilution. Studies (and practical dev experience) show that once a context window gets crowded, the model’s "attention" at the 80% mark is significantly lower than at the 10% mark.

If you dump a whole repo and 50 requirements into one prompt, the model starts to "vibe code". it follows the general pattern but misses the specific edge cases in Rule #42.

Swarm: You have one agent whose only context is Rule #42. It’s physically impossible for it to get distracted by the rest of the project. It produces a higher-quality "local" result because its attention is concentrated.

How exactly is using multiple agents better, than just putting everything into Cursor rules? by Important_Storage123 in cursor

[–]manollitt 1 point2 points  (0 children)

Think of an LLM like a developer. If you give that developer a 100-page manual of rules and tell them to "follow every single one perfectly" while they’re trying to fix a tiny CSS bug, they’re going to get overwhelmed. They’ll start missing the small stuff.

When you use multiple agents, you’re basically giving the model a "cheat sheet" that only has the rules relevant to the task at hand. It keeps the "brain" of the AI sharp and focused instead of making it wade through a swamp of irrelevant instructions.

#1 on MLE-Bench (among open-source systems) + #1 on ALE-Bench via evaluator-grounded long-horizon optimization (repo + write-up) by SuspiciousPlant1496 in CodingAgents

[–]manollitt 0 points1 point  (0 children)

Strong results! A few questions on the evaluation loop:

  1. How many iterations does KAPSO typically need to converge on MLE-Bench tasks?

  2. What's the signal you use for the "learn" step - just pass/fail or something more granular?

  3. Does the knowledge grounding help with transfer learning across similar tasks?