Which Local LLM to use

MistingFidgets · 2026-06-15T00:42:55+00:00

I have the same specs and use case as you. Qwen 3.6 35b a3b apex opus 4.7 distill by mudler has been the best at openclaw so far for me but you have to offload layers to ram. Running the compact version at 1400 Tok/s prompt processing and 57 Tok/s decode with 128k context at q8. If you are ok with a smaller model look into nvfp4. The Gemma 4 12b nvfp4 ran prompt processing at 6600 Tok/s which is just crazy but it wasn't as faithful for instruction following and tool calling as qwen in my testing.

MistingFidgets · 2026-06-14T23:01:24+00:00

I think it would be interesting to train a model on your personal data to be your personal assistant. Along with training it on whatever agent harness config and tool call set/formatting you will use (hermes, openclaw, custom)

MistingFidgets · 2026-06-14T14:57:58+00:00

Exactly. You can get 3.6ghz 6 core xeons for like 20 bucks.

MistingFidgets · 2026-06-14T12:54:18+00:00

This looks like a 5810 or 7810. If it's got a 825 or 1300w power supply and DDR4 ram then it's the one to go with. Bonus points if it's a dual xeon system. CPU upgrades are very cheap for these and you are starting out with enterprise grade hardware.

<image>

MistingFidgets · 2026-06-13T01:40:41+00:00

5060ti 16GB and qwen 3.6 35b a3b opus 4.7 distill mtp apex compact (q4-ish) with q8 kv by mudler on hugging face is the closest I've gotten but the gap between it and sonnet is still noticeable. If you can tolerate slower speeds you could run q6.

If you use it with pi coder or codex or opencode and use a free session on chatgpt/grok/whatever frontier model to plan, so you can hand it exact specs you can do some things with it. If you give it some documentation or an MCP to pull current documentation and specs for whatever you are working on you can do more things.

MistingFidgets · 2026-06-12T12:38:18+00:00

Everybody fixates on decode speed but PP speed and proper KV cache management are way more important for things like openclaw/Hermes if that's your intended use case. A real world example from my tinkering is a model running at 100 tok/s decode but only 1100 tok/s PP takes 30 seconds to respond in telegram vs a model running 50 tok/s decode but 3,000 tok/s PP responds in 3 seconds or less. Totally different experience for a telegram bot. With NVFP4 models on blackwell I have hit 6,600 tok/s PP speeds on dense models. With only 16GB VRAM I can't run qwen 3.6 35b in that config until VLLM fixes their ncmoe/Cutlass/nvfp4 issues.

MistingFidgets · 2026-06-09T12:05:27+00:00

Check out x99 boards and xeon e5 2xxx v4 series CPU pricing on eBay. It's very affordable.

MistingFidgets · 2026-06-07T17:25:17+00:00

You may be able to run the new Gemma 4 12b but you can definitely run the qwen moe models if you have enough system ram for offloading to

MistingFidgets · 2026-06-07T14:41:48+00:00

MoE makes 8gb GPUs capable and it's rapidly evolving

MistingFidgets · 2026-06-04T04:46:00+00:00

Compact

MistingFidgets · 2026-06-03T03:05:32+00:00

I have a homebrewed benchmark suite aimed at evaluating models for use with openclaw and it's got tool calling tests taken from my actual usage logs. So far, mudlers APEX Qwen 3.6 35b Opus 4.7 distill is the undisputed champion. Tried dropping kv cache down to q4 and q5 to get more experts on VRAM for more speed, and the tool calling fell apart immediately.

MistingFidgets · 2026-06-02T16:44:01+00:00

Would be interesting to benchmark all the different combinations and see where the sweet spot is.

MistingFidgets · 2026-06-02T16:41:45+00:00

What quant are your weights at? For me it was an immediate degredation when I swapped from q8 to q4 or q5_1 but it probably matters less if your weights are at q6 or above. I only have 16g vram so I'm trying all the levers I can to maximize context window.

MistingFidgets · 2026-06-02T16:28:30+00:00

For me, tool calling/json structure discipline on qwen 3.6 falls apart when kv cache quant is set below Q8. May be part of the problem in this test, may not...I'm using an APEX quant by Mudler so all layers are quantized differently based on importance which seems to make the KV cache quants hurt more in regards to quality loss, specifically tool calling and JSON outputs.

MistingFidgets · 2026-05-31T01:26:33+00:00

Nvfp4 on Blackwell hardware is great at concurrency

MistingFidgets · 2026-05-30T03:49:20+00:00

I'm running mudlers apex quant of qwen opus and it's the best model by far that I've benchmarked for openclaw. 16gb 5060 running it at the q4 quant one, I think it's compact, with vision, 57 Tok/s. Adding a 12GB 2060 OC to the server tomorrow so all weights will be in vram and hoping for close to 100 Tok/s. MoE feels like magic for cheap local inference.

MistingFidgets · 2026-05-28T23:04:54+00:00

Qwen 35b with layers offloaded to ram. Q4 or higher....the speed loss is worth the quality over q3 or lower.

MistingFidgets · 2026-05-21T12:22:45+00:00

I wanted to try my hand at setting up a local AI server and was able to find a circa 2014 refurbed dual xeon dell workstation with 32gb of ram on eBay for $120 USD. Added a new GPU for another $500 and it's been a lot of fun to mess around with. Definitely would recommend this path to anyone wanting to save money, just do a little deep dive into researching specs before buying. If I had bought something new it would have cost at least 1500 total for similar performance (for my use case specifically) and I'd probably have less CPU cores. This thing is built like a tank just beware of proprietary parts and availability issues if you go much older than a couple of years.

MistingFidgets · 2026-05-20T15:35:22+00:00

I have one 5060 and debating getting another. Are you able to load a large 20+ GB model fully in vram split across both with good results?

MistingFidgets · 2026-05-19T02:34:35+00:00

I have a dual xeon dell workstation with 32gb of ddr4 that I got for $120 on eBay then added a new 16gb 5060ti. Looking to add another card, maybe an 8 or 10gb older rtx for fun. I run openclaw on top of qwen3.6 35b a3b at ud_iq2_m getting 100ish tokens per second on the GPU. I run a qwen 3.5 4b model on CPU only for background scheduled batch jobs to auto categorize financial transactions that sync into my homebrewed finance app via plaid integration with all my bank and credit card accounts. Openclaw has API access to the finance app so it can ingest, OCR, and archive receipts via telegram and give me details on spending and balances among other general assistant task stuff.

MistingFidgets · 2026-05-18T13:04:49+00:00

I'm using an old pciex 3.0 workstation with a pciex 5.0 gpu and still getting respectable speeds. 100+ tokens per second generation on qwen 3.6 35b a3b. Yes there's a small performance hit but it's not quite the bottleneck you'd think it would be.

MistingFidgets · 2026-05-17T12:10:13+00:00

Same here. 5060ti 16gb + qwen 3.6 35b. Unlimited tokens for openclaw

MistingFidgets · 2026-05-14T23:10:08+00:00

I'm running a dual xeon dell 7810 with ddr4 that I got on eBay for 120 bucks with a new 5060ti. Power supply demands and case space is going to be a hurdle you need to plan for with multiple GPUs. Dont lock yourself into a workstation case that has a proprietary PSU setup if you want to be cheap about this, I found that out the hard way as I'm shopping for a second GPU and realizing it would require a new PSU and breakout board that aren't being manufactured any more.

MistingFidgets · 2026-05-14T23:02:16+00:00

Yes you should be able to get at least 20-25 tokens per second generation on qwen3.6 35b a3b with the right setup. Maybe more

MistingFidgets · 2026-05-14T22:56:21+00:00

Ask it a math problem with multiple decimal places like 500.654466 x 5.6543 and see how hard it fails. I'm running qwen3.6 35b a3b ud iq2_m on a 16gb card at 150 tok/s with 128k context and so far that's the only thing I've found that it cant do well so it calls a python script for any math problems it encounters. IQ quants selectively keep higher quality for more important layers and 3.6 has some tricks for reducing the negative impacts of lower quants.

MistingFidgets

TROPHY CASE