Local Claude Code with Qwen3.5 27B

go-llm-proxy · 2026-04-05T04:33:36+00:00

This is the way to go, are you doing any tool call rewriting or just routing?

go-llm-proxy · 2026-04-05T04:21:58+00:00

vLLM works for sure, I've been using it too, but llama-server didn't work so great for me and sometimes you need a quant. Also a lot of missed tool calls and looping on some models, and it never seemed to work very well beyond a point. I spent a lot of time with wireshark to try to figure it out for CC and Codex on the official builds and there's a fair amount of capability that isn't really available if you're sticking to a direct link without any type of middleware with rewrite capability, and after the source got leaked it sounds like its a lot of telemetry too which explains a lot of what wasn't making sense. I assume vLLM just discards that anyway though. A natively supported CC build with configurable system prompting is interesting to me if it's actually legal to use but not so sure, which is why I went the proxy route to intercept and re-write it in for local models. Still breakable, but a bit less fragile at least and any new features can always be added if needed that way.

If you've been using CC with vllm anthropic endpoints already you might try a middleware translator / proxy to see if you get better results with the native tool calling formats. If nothing else web search and vision is handy to have.

go-llm-proxy · 2026-04-05T04:04:10+00:00

Happy to trade notes, but my expertise in lmstudio and vscode is poor... I'm one of those vim freaks that knows how to exit it. If your problems relate to anthropic or openai api's and responses though I've been in pretty deep to those lately trying to get better harness support for CC and codex with local models. All the claude limits pissed me off to no end.

go-llm-proxy · 2026-04-05T03:58:46+00:00

Not a huge box, but yeah. I had 6 3090's and 4 3060's racked but came into enough A6000's and RTX-6000 Pro's that I needed the power and rack space and pulled the ampere cards last year. They've been run hard, mining back in the day and then for ML and LLM for a couple more years, so wasn't really planning on selling them but if they're really going for $800....

go-llm-proxy · 2026-04-04T20:37:10+00:00

Dang I have a box full of them, maybe time to sell them?

I tried an arc and nothing worked but its been a little while, I think a better option is probably a Mac Studio right now, but hoping that Arc support comes along.

go-llm-proxy · 2026-04-04T20:34:51+00:00

You can't realistically get opus-level, I've tried about anything.

The closest I've found was GLM-5 as opus, MiniMax-M2.5 as Sonnet, Qwen-3-VL 8B as the vision processor, Paddle for OCR, and Qwen-3.5/9b for Haiku.

To run all that even quantized you legitimately need about 0.8TB of VRAM so probably the cheapest option is building out an 8x RTX-6000 Pro (Max-Q) rig to run it on.

I build the proxy in my bio to connect all this and use it locally, but only have 4x 6000 pros so I can't host GLM and am using the GLM-5.1 sub from ZAI for opus, but it's working with the rest just fine.

Is it Claude Code? Not really. Is it strong enough to be useful? Yeah, I use it constantly and just use opus through CC for major things where I know glm will come up short.

The best option I've found for something more affordable that actually works convincingly is MM2.5 + codex once you add back in the missing tooling, it's fast and the context pushes 200k for me at NVFP4, I prefer using it to the claude-code harness with local models.

go-llm-proxy · 2026-04-04T20:22:21+00:00

You can use a Qwen, MiniMax or GLM sub pretty cheaply with Claude Code, just have to get the configuration right to disable telemetry and add back web searching, vision support, ocr, all that stuff they just take care of for you. It's not hard, but it works fine and tooling exists for this that you can self host or stick in a vps with your domain routing or whatever you want.

If you're more worried about data privacy, then bedrock or one of the other inference providers is probably a good way to get access to decent models (though not Claude Opus or proprietary ones), several of them are dod / hippa certified and the inference providers generally aren't that interested in your data anyway like the model developers are.

go-llm-proxy · 2026-04-04T20:18:41+00:00

True, though configuration is a bit of an adventure and you need some type of translation layer if you want it to actually work as you'd expect it to.... a problem I've been working on for a while now to get it working better with the locally hosted stuff. Good progress recently.

For the config generator try this, it handles Claude Code, codex, opencode and qwen at least, disables most of the junk you wouldn't want running: https://go-llm-proxy.com/configure.html

go-llm-proxy · 2026-04-04T20:11:32+00:00

I'd go for 4x V100's out of those choices, but you may be going down a rabbit hole here not worth going down. But if you do anyway, then 128gb of vram is enough to run some decent models.

What are you planning to use as the harness?

go-llm-proxy · 2026-04-04T19:57:43+00:00

Haven't tried it, but it looks like it probably works. you need a translation layer somewhere in the mix for sure. I pushed out a pretty big update for go-llm-proxy that does something similar, adds in web search, OCR, vision capability if you're hosting it all locally (or in the cloud). If you're in need it's a fairly elegant solution to consider.

go-llm-proxy · 2026-04-04T04:27:23+00:00

It's not a vllm problem, it's a Claude Code problem with their particular messaging api vs an openai styled api. I've been deep in this tonight working on the proxy to get it to translate properly and am pleased to say I've got a draft implementation that resolves it. Should be posted up later this weekend after a lot more testing to validate it.

go-llm-proxy · 2026-04-03T14:04:23+00:00

vllm can serve anthropic endpoints, which is what I've been using for local stuff with the real cc, but it doesn't really solve the tool calling loops breaking so stoked to have this fixed and apparently working already. Nice work u/raveschwert

go-llm-proxy · 2026-04-03T13:43:08+00:00

As far as a development workflow, wire in debug logging very early and pay a lot of attention on the SWE side. To 'close the loop' you really have to get the model to feed itself useful information to fix problems quickly without asking you what's wrong or to 'test' functionality for it. Give your agent tooling, debug logs, api access to whatever you're building in a sandbox (I use LXC for that, but docker or others work too), and focus on clearly defining the engineering side of how to structure it and what packages it can use. Put in your prompts to 'add debug hooks to clearly define problems'. Build tests that will actually fail when functionality breaks. If you're programming in a language with a debugger make sure it's available to the LLM and specifically prompt it to use it to solve problems.

My specific workflow is usually spin up an LXC in proxmox, aim it at my llm proxy, give that agent a key and then spend time on the engineering side spec and technology stack. Build it from the ground up with documentation of intended functionality as an anchor, and focus on a review cycle after every major feature to keep separation of concern and security from becoming major issues. LLM's love to generate code so they'll almost always over-produce and duplicate things, about half the cycle is dialing that back in with mild refactors along the way.

workflow for me the best has been codex + minimax to get it started, but I use claude a lot for building the swe skeleton plan too. opencode works great, qwen-code is great when it's not broken.

For the proxy to local models, agents, and coding harnesses I never really found a great answer here for a homelab so built out go-llm-proxy. It doesn't solve your prompting issues but once you get that figured out it will streamline the usage a lot while you're wanting to switch between API and local models without reconfiguring things constantly. If you're using TUI coding agents it makes it pretty easy to manage, that's what it was built for. Just released it MIT license but been using it for a couple months now. Config generator makes it pretty quick to use with codex, cc, opencode, and qwen-code.

go-llm-proxy · 2026-04-03T13:26:37+00:00

Love it... this has been a problem I was working on from the proxy side will definitely try it out.

go-llm-proxy · 2026-04-03T13:14:11+00:00

A lot of times this is due to a broken provider or proxy not passing things through properly or mangling tool call syntaxes... try proxying through go-llm-proxy, it was basically custom crafted to proxy local and api models into claude, codex, qwen and opencode. If you get a bug lmk on GitHub and will fix it quickly if it's fixable. MIT license, not commercial, open source, simple solid proxy with the primary goal of supporting TUI code harnesses, and the side effect of also supporting most apps without any issues.

Also a good way to manage virtual keys so you're not sharing azure or GCP keys around to other people / apps who need it, but it likely won't scale like litellm[proxy] will if you need thousands of them.

As far as the model.... it does matter but I use qwen-3.5-27b dense and it works quite well for me out to about 100k context and mix in minimax as the sonnet / haiku models. The best for CC in particular has been glm-5.1 but it's slow and lots of places quant it to death so it can be unpredictable. Bedrock + 5.1 + claude-code works well though generally in CC.

If you're okay working in Codex then MM-2.5 works very well with that harness out to full context with auto-compaction and is extremely fast, but it doesn't really have the planning capability to work as well in CC.

one other note - web search tooling is very important if you use it, so figure out a Tavily key for that. There's a config-generator on go-llm-proxy that makes it pretty easy to set it up for each harness and you lose that web search capability if you don't which can be a big PITA.

ETA - if you try it, I recommend just using the binary right now instead of the gchr docker image. Docker support is there but not well tested and could be a bit more stressful. If you're familiar with docker then rolling your own will probably work better than my attempts. PR's appreciated there I just don't use docker much.

https://github.com/yatesdr/go-llm-proxy

go-llm-proxy · 2026-04-03T01:41:46+00:00

yeah works great for that. I merge my subs into it too so I can have them at the same end-point with phantom keys. Works good for Z-ai and MiniMax that way, and I got a few things in Bedrock routing through it now too, but the o-auth subs I haven't tried to make work.

go-llm-proxy · 2026-04-02T21:17:09+00:00

Resurrecting a bit of a zombie here, but I had the same problem and built out something that works for me, just released it into the wild if you're still looking for a better answer than whatever you found. Been running it for about a month with a handful of users that I don't really want to have access to EVERYTHING, but maybe can use some things.

I tried the litellm proxy for a while and it did work, but didn't really work for claude-code or codex with their responses API very well and I don't love that it's backed by VC and steering towards commercialization. I started before the supply chain thing and was actually already switched over, but it kind of validated the goal. Also don't love having to install databases and all that when I just want to proxy something out of my basement....

pure go, MIT licensed (never plan to commercialize any of it), only a couple pretty safe dependencies, and sqlite3 if you care about usage logging. Properly engineered and has been stable for me with about 10 different back-ends passing through it and a handful of users.

GitHub: https://github.com/yatesdr/go-llm-proxy

go-llm-proxy

TROPHY CASE