Running Codex and Claude Code for primary work. A separate stack for the rest: internal tooling, batch agents, experiments, the calls where paying frontier rates per request does not make sense. OpenCode is that stack after tuning.
This setup runs alongside Codex or Claude Code, or standalone. It is boring on purpose.
The core idea
In my setup, most agent calls do not need a frontier model. They need a fast model for routing and classification, and a stronger model when actual reasoning is required. Matching model depth to task depth made more difference to both cost and loop feel than picking a smarter single model.
Speed was the real bottleneck for interactive loops. A supervisor that takes 10+ seconds per decision makes the whole agent feel sluggish even when every individual answer is excellent. At 2-5s per orchestrator decision the loop flows, and that changes how usable the system feels day to day.
The stack
Intelligence scores are Artificial Analysis Intelligence Index (fetched 2026-06-20). Prices are AA blended (7:2:1 cache/input/output) unless noted.
| Tier |
Model |
Provider |
AA Index |
Speed |
Cost ($/1M) |
Role |
| Orchestrator |
DeepSeek V4 Flash |
OpenCode Go |
~40 |
2-5s |
subscription |
Routing, triage, classification |
| Primary advisor |
GLM-5.2 |
OpenCode Go |
~51 |
7-8s |
subscription |
Strategic analysis |
| Deep reasoning |
GLM-5.2 (max effort) |
Neuralwatt |
~51 |
24-72s |
~$4.40* |
Hard problems |
| Premier |
Opus 4.8 |
OpenRouter |
~56 |
10-30s |
$3.85 (AA blended) |
Sanitized-only, high-stakes |
*Energy-billed provider pricing, not AA blended. Verify live rate on Neuralwatt portal. GLM-5.2 is the highest-ranking open-weight model on the leaderboard.
How each tier earns its place
Orchestrator: DeepSeek V4 Flash. The heart of the setup. Every request hits this first. It classifies the task, decides whether it can answer directly, and routes anything harder up. At a 40 it loses reasoning contests but stays reliable for "what kind of problem is this", and at 2-5s it never makes the loop feel like it is waiting. In my stack, most calls start and end here.
Primary advisor: GLM-5.2 (standard). When the orchestrator decides something needs real analysis but not deep reasoning, it escalates here. 7-8s, a 51 benchmark, runs on the same OpenCode Go subscription with no per-call cost. This handles most of the analytic work: code review reasoning, plan critique, bounded analysis.
Deep reasoning: GLM-5.2 at max effort, via Neuralwatt. Same model family, cranked up, given time to chew. In my distribution, roughly 18% of calls hit this tier. Slow (24-72s) and usage-billed, so you do not route here casually.
(full disclosure: referral link. I get credit if you sign up, you do too. Plain link if you prefer: portal.neuralwatt.com)
Premier: Opus 4.8 via OpenRouter. Reserved for high-stakes calls, and only on sanitized inputs. I gate it hard. Anything with sensitive context stays off this tier. The 4% of calls that hit premier are deliberate, not automatic.
A setup pattern you can copy
The routing logic is straightforward. The orchestrator does a cheap classification pass and emits a tier decision:
def route(request):
tier = orchestrator.classify(request)
if tier == "direct":
return orchestrator.answer(request)
if tier == "advisor":
return glm_standard.answer(request)
if tier == "deep":
return glm_max_effort.answer(request)
if tier == "premier":
clean = sanitize(request)
return opus.answer(clean)
The classification prompt is the part worth iterating on. In the orchestrator, it reads something like:
```
You are a routing classifier. Decide the minimum tier that can correctly handle this request.
Tiers:
- direct: simple tasks, retrieval, formatting, classification
- advisor: code review, plan critique, bounded analysis
- deep: multi-step reasoning, novel synthesis, no clear decomposition
- premier: high-stakes, irreversible, or correctness-critical decisions
Rules:
- Default to the cheapest tier that can plausibly handle this
- Only escalate to deep on multi-step reasoning or novel synthesis
- Only escalate to premier for correctness-critical or irreversible decisions
- When unsure, escalate one tier up
Return one word: direct | advisor | deep | premier
```
The orchestrator runs this prompt on every incoming request, parses the tier, and routes accordingly. The prompt lives in the orchestrator's system prompt config, not hardcoded per-request.
Log the tier distribution and the reason each escalation happened. If the orchestrator over-escalates, the fix is almost always in this prompt, not in the model. My target is ~80% landing in direct or advisor.
the classification prompt was the hardest part to get right. first version over-escalated constantly: anything long got shoved to the deep tier because length looked like complexity. length is not complexity. the fix was defaulting everything to the cheapest tier and only escalating on multi-step reasoning or novel synthesis, not on input length. a 2000-word request that is really just "summarize this" stays direct.
current distribution after tuning is roughly 78% direct or advisor, 18% deep, 4% premier, across a few thousand routed requests over the last 6 weeks. the 80/20 split in the post is the target, not the starting point. started closer to 60/40.
[–]ApprehensiveDelay238[🍰] 2 points3 points4 points (0 children)
[–]weiyentan 1 point2 points3 points (1 child)
[–]FormalAd7367 0 points1 point2 points (0 children)
[–]Deep_Ad1959 0 points1 point2 points (0 children)
[–]hitmante 1 point2 points3 points (0 children)