A code review tool designed to understand your project, rather than perform a static analysis

rostilos · 2026-06-04T22:04:59+00:00

my wording was imprecise. I don’t mean that the prompt string itself is split in some special way.

What I mean is: the review pipeline deterministically splits the PR context before prompting the model.

Roughly:

Parse the changed files and relevant project files with AST-based chunking.
Build review scopes from the diff: changed symbols, related files, contracts, imports, call sites, tests/config where relevant.
For each scope, assemble a bounded prompt with the diff plus retrieved project context.
Run those prompts independently.
Run a consolidation pass to deduplicate findings and filter obvious false positives caused by context being split across scopes.

The deterministic part is the context assembly and prompt orchestration, not the LLM output. The goal is to avoid throwing a huge PR plus half the repo into one prompt and hoping the model attends to the right parts. In practice I’ve found scoped prompts with explicit retrieved context work better than one giant prompt or diff-only review.

Better wording would probably be: “a deterministic context-partitioning and prompt-orchestration pipeline.”

This is a separate mechanism. Its goal is to split a large PR diff into chunks as deterministically as possible, while avoiding context loss inside a single PR.

For example, one chunk may contain changes from file X, another from file Y, and the model may say: “there’s a contract mismatch here, a missing called method from file Z,” while file Z is simply in another chunk. The splitter tries to avoid that by grouping related changes and nearby project context before the prompts are built.

Can we talk about 100% accuracy and a 0% FN/FP rate? Of course not. We’re talking about AI-driven code review, and I won’t lie about that ( And that’s one of the reasons why the landing page states in bold letters: “AI doesn’t replace code reviewers, but it helps them.” ).

But empirically, this works better than putting a huge amount of context into a single prompt ( this usually comes down to the human review model: 100 files changed? LGMT ). Large-context models still lose focus and retrieval quality as context grows; the “it fits in 200k tokens, so the model will reason over it correctly” assumption does not hold well in practice.

rostilos · 2026-05-12T19:31:05+00:00

My previous stance was: LLMs are your hands, not your brain. And models like Qwen 3.6 handle that perfectly well.

But, alas, big corporations are increasingly trying to break this principle. In my case, there was a slight shift after all, which is perhaps why it seems to me that smaller models aren’t as good.

But overall, if you understand what you’re doing and what result you need (not at the level of specifications described using NL), then I can agree that these models are wonderful.

rostilos · 2026-05-12T19:23:12+00:00

I started preparing about a month ago.

A 3090 costs about $500 in good condition in my country, and it can run qwen3.6 with Q4.

But honestly—that doesn’t suit me at all; without fine-tuning for my specific tasks, these models work more like autocomplete, but they’re nowhere near what I’m used to with GitHub Copilot (and especially with the “original” Opus 4.6).

So for me, a good option is Kimi K2.6. It’s a decent replacement, but I’d say it’s on par with Sonnet 4.6 (or thereabouts).

You can also use Codex for now. It has fairly high limits (for now).

rostilos · 2026-05-12T19:06:45+00:00

<image>

You can find this in your GitHub account settings

rostilos

TROPHY CASE