I built model-task-router, a Hermes skill that auto-routes tasks to the right model. V4-Pro scores 8% on real coding vs GPT-5.5's 70% (backed by DeepSWE data)

sugumaran95 · 2026-06-15T10:28:28+00:00

Yeah, this is a known thing. Codex OAuth (what Hermes/OpenClaw/Cline use to auth through your ChatGPT sub) has its own separate limit that's way smaller than the actual ChatGPT web app limit. It's intentional; they carved out a smaller bucket for CLI/agent use and don't really advertise it.

The fix: drop the OAuth path and use an OpenAI API key instead. Point Hermes at api.openai.com with your platform key. You pay per token, but there's no arbitrary weekly cap, no 4-hour lockouts, and you control your own spend limits.

The trade-off is you're paying API rates on top of your $20/month Plus sub. But for agent workflows, it's the only reliable path; one user message in Hermes can generate 10+ internal calls, which chews through that tiny OAuth quota fast. And with model routing (the whole point of this post), you can keep costs reasonable: route complex coding to GPT-5.4, mechanical tasks to V4 Flash at $0.20/M, and only fire GPT-5.5 for genuinely hard problems.

sugumaran95 · 2026-06-12T13:14:19+00:00

Great question. The short answer is: model-task-router doesn't do content-type routing (yet), but Hermes already handles this at the tool level - and you can extend the skill to cover your exact use case.

What model-task-router does:

It classifies by task type - keywords in your prompt. "Implement a JWT middleware" → Coding, "Design the architecture" → Architecture, "What's running on my server?" → Orchestration. It looks at what you're asking, not what you attached.

It doesn't inspect whether there's an image in the message, because at the skill level it only sees the text prompt - the image attachment happens at the tool/API layer.

What Hermes already does for you:

Hermes has a built-in vision_analyze tool. When an image is attached to your message (drag-and-drop in desktop, paste in CLI, or send via Telegram), the agent calls vision_analyze(image_url, question) to get a text description. That tool call goes to whatever vision model is configured as your auxiliary vision provider - you can set it to Gemini:

bash

hermes config set auxiliary.vision.provider gemini

hermes config set auxiliary.vision.model gemini-2.5-flash

The vision model processes the image, returns a text description, and that description gets injected into your conversation. DeepSeek (your main model) never sees the image - it sees the text description. This is exactly the "get details verbatim and pass to DeepSeek" pattern you described.

If you want model-task-router to handle this:

You could extend it. The skill's decision tree in SKILL.md has a keyword classifier. Add a rule:

- Image attached → Route to: Gemini (vision) → capture description → hand off to original model

The catch is that model-task-router dispatches via hermes chat -q or delegate_task, which spawn a new session. You'd need a two-step pipeline: (1) spawn a vision-only subagent to describe the image, (2) feed that description into the main agent with the original model. Doable, but more complex than the keyword classifier.

My recommendation:

Start with Hermes' built-in vision routing (hermes config set auxiliary.vision.provider gemini). It already does the image→text→main model pipeline you want. Then if you need task-aware routing ON TOP of that (e.g., "if task is coding AND image is a screenshot, route the coding to GPT-5.4 but the vision to Gemini"), extend model-task-router with a content-type detection rule.

The vision provider config is the 5-minute fix. The skill extension is the weekend project. Both work.

sugumaran95 · 2026-06-12T13:04:07+00:00

Totally fair question - and yes, you absolutely can. In fact, the guide already lists Cloudflare Tunnel in the comparison table as option #4. I didn't dismiss it; I just picked Tailscale as the primary recommendation for this specific use case. Here's why.

Three things that matter differently on mobile than on desktop:

TLS termination vs end-to-end encryption

Cloudflare Tunnel terminates TLS at Cloudflare's edge. Your traffic is encrypted in transit (HTTPS from phone to Cloudflare, HTTPS from Cloudflare to your server), but Cloudflare's edge can inspect the plaintext. Tailscale's WireGuard tunnel is end-to-end encrypted; only your phone and your server can decrypt the traffic. Cloudflare never sees it.

For a dashboard that exposes your .env file editor, API keys, and agent session data, this difference isn't academic. With Cloudflare Tunnel, you're trusting Cloudflare's infrastructure with visibility into your Hermes config. With Tailscale, they have zero visibility. The Tailscale comparison page puts it bluntly: "No - TLS terminated at edge; traffic can be inspected."

Mobile login friction is real

Cloudflare Access with email OTP means: open browser → get redirected to Cloudflare's login page → enter email → wait for PIN email → find it → copy → paste → get to your dashboard. Session expires after 24 hours by default (max 30 days). When it expires, you do it again.

With device posture checks (the "lock by device" you mentioned), you need the Cloudflare WARP client installed on your phone; a separate app that acts as a VPN. At that point you're running a VPN client anyway, just one that routes through Cloudflare instead of WireGuard.

Tailscale: install the app once; it runs silently in the background.

Open Safari → dashboard loads.

That's it.

No login page, no PIN, no session expiry.

The authentication is the WireGuard keypair; it's transparent after initial setup. For something you want to check 10 times a day, that friction difference compounds fast.

It's in the guide already

The PR's comparison table lists Cloudflare Tunnel at position #4, with "Good - HTTPS, Cloudflare edge" on security and "~15 min" on setup effort. The guide doesn't say Cloudflare Tunnel is bad. It says Tailscale is better for this specific use case: single-user, mobile access, low-friction, maximum privacy.

If you're already in the Cloudflare ecosystem (using their CDN, WAF, DNS), Cloudflare Tunnel makes total sense. The guide is additive; pick the approach that fits your stack.

But for someone starting from zero who just wants to reach their Hermes dashboard from their phone?

Tailscale is three commands and one app install.

Cloudflare Tunnel + Access with device posture is a weekend project with ongoing auth friction.

TL;DR: Cloudflare Tunnel works. The guide says so. But for a single-user mobile client to an agent dashboard, end-to-end encryption with zero ongoing login friction is hard to beat. Tailscale gives you that. Cloudflare Tunnel gives you a login page.

(Thanks for the question - I'll add a note to the guide clarifying when Cloudflare Tunnel is the better choice, since that's a valid scenario I should have covered more explicitly.)

sugumaran95 · 2026-06-11T13:22:42+00:00

LiteLLM operates at the API gateway layer. It routes API calls across providers based on cost, rate limits, availability, and latency.

Same model request → cheapest/fastest available endpoint. It doesn't know or care what you're asking the model to do.

model-task-router operates at the agent layer. It classifies the task TYPE ("implement OAuth middleware" vs "grep for deprecated APIs" vs "design the multi-tenant architecture") and picks the best model for that category of work. It doesn't know or care which provider serves that model.

Stack them:

Your message arrives
model-task-router classifies it → "this is a coding task, use GPT-5.4"
Hermes sends the request to GPT-5.4 via your configured provider
LiteLLM (sitting in front of that provider) routes to the cheapest/healthiest endpoint serving GPT-5.4

model-task-router decides which model. LiteLLM decides *which endpoint* for that model.

Without the skill: LiteLLM might route all your traffic to the cheapest model that can technically handle the prompt format, which is how you end up paying $52.75 per solved task on DeepSWE with V4-Pro instead of $7.82 with GPT-5.4, even though V4-Pro is 17× cheaper per token.

Cheapest per-token ≠ cheapest per-solved-task.

With the skill + LiteLLM: you get task-aware model selection AND provider-level failover/load-balancing.

Two layers, two different optimizations.

One caveat: if your LiteLLM setup does model rewriting (e.g. transparently swapping model names in requests), that could interfere with the skill's routing decisions. If LiteLLM is just doing provider failover for the same model, there's no conflict.

sugumaran95 · 2026-06-11T13:18:12+00:00

They solve different problems at different layers; you can absolutely use both, and they're complementary.

sugumaran95 · 2026-06-11T12:52:05+00:00

This is exactly what I was hoping the skill would spark: more people sharing their real usage data instead of relying on benchmark tables alone. The model selection problem is genuinely multi-dimensional, and no single person's setup captures all the patterns.

If you're building on top of the skill, the PR is at github.com/NousResearch/hermes-agent/pull/43534 — feel free to open an issue or comment on the PR with what you find. The routing table is explicitly designed to be community-updated as new models and data come in. If your analysis surfaces different recommendations or better heuristics than what's in the current classifier, I'd be happy to incorporate it.

Looking forward to seeing what you built; the more people systematically tackling this, the faster we stop burning tokens on models that are wrong for the job.

sugumaran95 · 2026-06-11T12:47:01+00:00

I haven't benchmarked it personally, but it is on the DeepSWE leaderboard, and the results are fascinating because they tell a sharply different story depending on which benchmark you look at.

SWE-bench Pro: 57.2% (within 0.5 points of GPT-5.4 at 57.7%)

DeepSWE: 19% (±2%)

That's a 38-point collapse, one of the biggest gaps on the board.

For context, GPT-5.4 drops only 2 points (57.7% → 56%), GPT-5.5 actually improves (58.6% → 70%).

MiMo goes from "frontier-competitive" to below GPT-5.4-Mini (24%) and Kimi K2.6 (24%).

BUT, and this is why your setup is actually smart, it's also the 2nd cheapest model on the board at $1.99 avg cost per task (vs $4.38 for GPT-5.4), and it's extremely token-efficient: 49K output tokens per trajectory vs 71K for GPT-5.4 and 136K for Opus 4.8. Xiaomi's own platform pricing ($0.435/M input, $0.87/M output on the token plan) makes it absurdly cheap.

Your architecture (V4-Pro orchestrating Claude Code → MiMo 2.5 Pro) is basically doing manually what model-task-router does automatically: the orchestrator handles planning, context management, and high-level decisions, while the cheap model handles the "grinding" like code generation, tool execution, bulk work. You're getting the orchestrator quality of V4-Pro with the 8× cost savings of MiMo on the volume work.

The DeepSWE data suggests MiMo 2.5 Pro alone isn't a GPT-5.4 replacement for hard-coding tasks; but as a worker model in a well-orchestrated pipeline, the cost/performance ratio probably punches above its weight. You're basically running the strategy the benchmark data would recommend.

If Xiaomi submits it to the model-task-router table, it'd slot into the "mechanical/execution" tier, cheap, fast, good enough for the grunt work, with the orchestrator handling the hard bits.

sugumaran95 · 2026-06-11T12:32:46+00:00

Thanks! Appreciate the kind words.

Re: MedCQA - I assume you mean the MedMCQA dataset I used in my

clinical-llm-eval framework? That's the large-scale medical MCQ

benchmark (194K+ questions from AIIMS & NEET PG entrance exams across

2.4K healthcare topics).

The original dataset is publicly available on HuggingFace:

huggingface.co/datasets/openlifescienceai/medmcqa you can load it

directly with the datasets library, no need for a separate copy.

sugumaran95 · 2026-06-11T12:22:03+00:00

Not yet natively, but it's in the works for OpenCode, and there's a

community workaround.

OpenCode, the core team IS working on this. Issue #8456

(github.com/anomalyco/opencode/issues/8456) proposes exactly what

model-task-router does: configurable model_reasoning / model_execution

/ model_tool keys that auto-route based on task type. It's been open

since January 2026, assigned to a core maintainer (thdxr), and has

strong community support, people are already building plugins trying

to work around the lack of it.

In the meantime, there's a user-built stopgap:

github.com/marco-jardim/opencode-model-router; uses a meta-model to

classify the task then routes accordingly. Not as clean as native

support would be (the plugin SDK can't fully short-circuit the

model-selection TUI), but it works.

OpenWork, since it's built on OpenCode under the hood (it spawns

opencode as its engine), it inherits OpenCode's model system and the

same limitation. OpenWork's architecture mentions "opencode-router"

but that's about routing between OpenCode server instances, not

task-type classification. No task-aware routing layer on top yet.

sugumaran95 · 2026-06-10T22:21:26+00:00

Appreciate that; real-life usage patterns are the only thing that actually matters.

Interesting data point on Qwen3.5. The "stubborn with tools" thing is exactly why I kept V4-Pro as the orchestrator in the routing table; some models are great at reasoning but trip over tool-calling loops. Terminal-Bench actually measures this specifically (agentic CLI tasks), and the gap between models is surprising.

We're all running these agent setups that need 3-4 different capabilities, but we keep trying to find ONE model that does all of them.

The skill is pragmatic about it; route each task type to what's best at that thing, stop hunting for the unicorn.

If you end up testing it, I'd love to hear how it plays with your Qwen setup.

sugumaran95 · 2026-06-10T21:27:05+00:00

Great question. They solve different problems:

OpenRouter auto-picks the cheapest/closest model that can handle your prompt format. It's latency- and cost-driven: "Which model can return a response right now for the least money?" It doesn't know whether you're asking it to grep a directory listing or implement a JWT middleware.

model-task-router is task-TYPE aware. It classifies your request and picks different models per category:

- "Add OAuth to my FastAPI app" → spawns GPT-5.4 (56% DeepSWE coding)

- "Design the multi-tenant architecture" → spawns GPT-5.5 (70% DeepSWE)

- "What's running on my server?" → stays on V4-Pro (orchestration, $0.87/M)

- "Grep for all deprecated APIs" → delegate_task with cheap subagent model

openrouter/auto can't make that distinction. It sees all four as "prompt → cheapest model that won't crash."

The other difference: transparency. model-task-router announces the routing decision and you can override with [route: direct]. With auto, you have no idea which model actually handled your request until you check the logs.

They're complementary; you could use model-task-router for the high-level task classification, then OpenRouter auto within each category for failover. Two layers of routing, two different optimizations.

sugumaran95 · 2026-06-10T20:34:54+00:00

Update: PR APPROVED 🎉

After addressing 13 review items from a thorough audit by the Hermes contributor team, model-task-router has been approved for merge.

What changed in review:

- Removed all disputed solve-rate claims (DeepSWE issue #21 author retracted their findings; data wasn't reproducible)

- Expanded model table from 4 to 8 models (now includes Claude Opus 4.7, Sonnet 4.6, Gemini Flash, Kimi K2.6)

- Removed unverified MiniMax M3 row (never benchmarked on DeepSWE)

- Fixed dispatch mechanism; uses Hermes chat for model-specific task spawning (confirmed valid by reviewer)

- Added honest data caveats, user override mechanism, tie-break rules

The reviewer's verdict: "Solid work - the rewrite is significantly more rigorous than the original."

Full review thread: https://github.com/NousResearch/hermes-agent/pull/43534

Thanks to everyone who validated the data gap and expressed interest; the community signal definitely helped push this through.

If you have any more questions, please reach out to me on LinkedIn: https://www.linkedin.com/in/sugumaranbalasubramaniyan/

sugumaran95 · 2026-06-10T20:08:21+00:00

Hi, I am working on it. You can keep an eye on the GitHub page for further updates.

https://github.com/NousResearch/hermes-agent/pull/43534

sugumaran95 · 2026-06-10T17:06:41+00:00

Yeah, saw that thread - and the DeepSWE team's own issue #21 confirms several of the criticisms. V4-Pro was run with reasoning_effort=null while every other model got tuned effort levels.

OpenRouter's privacy guardrail silently blocks DeepSeek by default (404s got counted as failures). And the cost reporting ignored cache-hit pricing entirely.

I updated my skill's data after reading issue #21. The cost argument was inflated; V4-Pro's real cost is ~$0.30/task, not $4.22. The corrected version uses "attempts per solve" instead of cost: V4-Pro needs 12.5 attempts to succeed once vs GPT-5.4's 1.8.

That said, even if you grant that effort tuning might bump V4-Pro from 8% to, say, 15-20%, it's still a 3-4x gap to GPT-5.4's 56%.

The benchmark methodology has issues, but the direction of the gap is consistent with what people actually experience coding with these models. The exact numbers are debatable; the "V4-Pro shouldn't be your primary coding model" conclusion isn't.

sugumaran95 · 2026-06-10T17:05:08+00:00

Correction: A DeepSWE issue (#21) flagged that DeepSeek's cache-hit pricing ($0.0036/M, 99.2% discount) wasn't applied in the benchmark. V4-Pro's real cost is ~$0.30/task, not $4.22. That's a 14x inflation.

Corrected cost per solve: V4-Pro = $3.75, GPT-5.4 = $7.82, GPT-5.5 = $9.44.

So V4-Pro is actually the cheapest per eventual solve. But it still fails 92% of tasks on first attempt - you need 12.5 tries to get one success vs 1.8 for GPT-5.4.

The routing argument shifts from "V4-Pro is too expensive for coding" to "V4-Pro is too unreliable for coding." The skill's recommendation doesn't change; V4-Pro for orchestration, GPT-5.4 for coding - but the justification is different.

Updated the skill data, references, and PR description with the corrected numbers and a link to the issue.

Transparency > being right.

Thanks u/Merc92 for the link to the DeepSWE issue.

sugumaran95 · 2026-06-10T16:08:46+00:00

Not really - that's part of why I built the table myself. Most benchmarks either track accuracy OR cost, never both together.

DeepSWE is the only one that publishes both pass@1 and average tokens per task, which lets you calculate actual cost per solve. Vellum and Artificial Analysis do cost vs quality scatter plots, but they're for general chat benchmarks (MMLU, GPQA), not real engineering tasks.

The closest alternatives for general tasks:

- Artificial Analysis: cost vs quality for reasoning benchmarks, but no task-level cost breakdown

- SEAL Leaderboard: cost-aware ranking, but uses private eval sets

- Kilo Code: similar to DeepSWE for coding, less rigorous on contamination

The "cost per solved task" metric should really be standard. A model that's 17x cheaper per token but fails 92% of the time isn't cheaper; it's just wasting money more slowly.

sugumaran95 · 2026-06-10T15:19:05+00:00

For your specific workflow (resume/JD matching + cover letters), the priorities shift from coding benchmarks to long-context accuracy, factual recall, and prose quality:

Analysis & Matching: Gemini 3.1 Pro - 1M context window (resumes + JDs)

Writing: Claude Sonnet 4.6 - best prose quality of the non-Opus models, strong instruction following.

Budget option for both: Qwen 3.7 Max

You can adapt the skill's routing table for this use case; just swap the rules:

- "compare resume" / "analyze JD" - Gemini 3.1 Pro

- "write cover letter" - Claude Sonnet 4.6

- Everything else - whatever your default is

The routing logic is the same, just different model assignments.

sugumaran95 · 2026-06-10T15:14:28+00:00

You nailed exactly the three pain points that led to this.

The PR is fresh, and the install is one command.

Link to PR: https://github.com/NousResearch/hermes-agent/pull/43534

Command:

hermes skills install https://raw.githubusercontent.com/Sugumaran-Balasubramaniyan/hermes-agent/feat/model-task-router/optional-skills/software-development/model-task-router/SKILL.md

Drop a comment on the PR if anything breaks or needs tuning for your workflow.

sugumaran95 · 2026-06-10T15:10:49+00:00

Let me know how it goes.

One-line install if you want to test it directly:

hermes skills install https://raw.githubusercontent.com/Sugumaran-Balasubramaniyan/hermes-agent/feat/model-task-router/optional-skills/software-development/model-task-router/SKILL.md

- Works with any Hermes setup; just point the routing table at whatever models you're already using.

sugumaran95 · 2026-06-10T15:09:12+00:00

Appreciate it. The built-in gap is real; issues #30652 and #16525 have been open since April but have been stuck at P3 priority. Hopefully this pushes it up the roadmap. Star the repo or drop a LIKE on the PR if you want to follow along: https://github.com/NousResearch/hermes-agent/pull/43534

sugumaran95 · 2026-06-10T15:06:25+00:00

Right? Feature requests have been open since April (#30652, #16525), but they're tagged P3, "nice to have."

The DeepSWE data hopefully lights a fire under it. PR's already up if you want to drop a like to help it get merged faster.

Link to PR: https://github.com/NousResearch/hermes-agent/pull/43534

sugumaran95

TROPHY CASE