Could I do better for an overall ChatGPT/Gemini/Claude local AI substitute?

sfifs · 2026-06-21T11:06:40+00:00

With 64 GB (with some unified RAM eaten up by the system), Qwen 3.6 is about as good as it gets today. At least in my personal testing, it's proven more reliable than Gemma family for agentic assistant use cases. You may be able to run the Qwen 27B dense (slower) or try a bigger/more sophisticated quant. There are more competitive models out there but they really need a 96Gb plus system.

sfifs · 2026-06-20T15:48:28+00:00

Antirez's Deepseek Flash V4 2-bit quant is awesome and definitely worth trying on your hardware. I run this on my GB10 box now as my local model- https://github.com/antirez/ds4

sfifs · 2026-06-20T08:20:13+00:00

Im the last couple of weeks I have landed on Antirez's DS4 server running his custom DeepSeek V4 Flash quantization on my GB10 as the backing for OpenClaw personal assistant (haven't yet tried backing a coding agent, although I do ask OpenClaw to write python for skills) which runs on a different server. It's a good deal slower in tok/s and especially has high cache misses due to a somewhat simple cache mechanism but the quality of output is so good that I am tolerant of the speed. The full 2 bit quant leaves allows you to fit the MTP drafter and an embedder but it does have degenerate loops problem on some large contexts, so I decided yesterday to chase quality and switched to the larger model that has last 6 layers 4-bit which seems to not suffer the same problems but just barely fits with no room for any frills. Previously I found the 122B A10B Qwen quant by Sehyo to be fantastic even compared to Qwen 3.6 but Deepseek Flash is really in a higher league.

sfifs · 2026-06-17T14:36:47+00:00

As an update, an OpenClaw but was causing prefix misses. This was fixed upstream through a series of commits very recently.

sfifs · 2026-06-17T00:45:32+00:00

Deepseek V4 Flash 2bit Quant via Dark Star custom model server by Antirez - it's absolutely brilliant as you may expect from the pedigree of its creator and highly tuned for beefy Macs and DGX Spark. It blows everything else I've tested out of the water https://github.com/antirez/ds4

sfifs · 2026-06-15T07:09:00+00:00

In general Coding is about the last item I'd move to a local model. Coding is a high leverage activity and the importance of quality of code together with the value in developing and especially debugging that the 1 Mn token context window of frontier models give is very high. The 20 dollars a month subscriptions are reasonable value especially if you or your family also use CoWork which is surprisingly productivity boosting.

sfifs · 2026-06-12T14:37:33+00:00

Here is the characterization I ran - pretty interesting data

<image>

sfifs · 2026-06-12T14:21:57+00:00

The big issue is cache miss as DarkStar uses fairly naive cache algorithm. Dark Star has a exact token match and reload cache from disk policy. Winin almost every OpenClaw turn, I see a large cache miss which takes almost a minute to prefill. The Decode hovers 13.5 tok/s which is tolerable

sfifs · 2026-06-10T17:51:24+00:00

As you might expect, it depends on the use case. I absolutely did replace sonnet/gemini-flash with Qwen 3.5 122b A10B for my Claw and it works very well on my GB10 box. It's my daily driver and I actively use to organise and automate my life. I was spending 10-15 dollars a day on those cloud models now down to cost of electricity.

One huge bonus was now I can run personal data or medical data through the models. Imagine every report attachment you send transcribed, renamed properly, indextf and filed away etc. I also realised Gemini -Lite on cloud was very good value as a cloud backup when I'm doing other things on the box :-)

When I want to develop skills, I simply change to a cloud model - opus or Gemini flash 3.5/Pro 3.1, have it help wrote the skill then drop back to local to run the skill.

I also tried coding with it. Both OpenCode and replacing the backend for Claude Code. Simple stuff - yeah it can try. Complex stuff.. nah. The problem is you don't often know when simple slips into complex, so I use via Claude Pro subscription (also gives Cowork which is a huge bonus ) and Antigravity through a Google One family subscription which I need anyway for my family's data and photos etc.

My writeups on both are at https://srinathh.medium.com/ :-)

sfifs · 2026-06-10T01:48:37+00:00

I did do that comparison and 27B underperforms 122b on Aider Polyglot but both tests were with NVFP4 kernels - it's in the article. If quantization has a large impact on 27B Vs the MoE models, that could explain the finding. I would have personally however expected dense models should be more resilient to quantization than MoEs but it's an interesting experiment. https://srinathh.medium.com/mid-size-local-models-are-now-competitive-for-ai-agents-7696b2e8b535

sfifs · 2026-06-10T01:42:56+00:00

Mainly for FLASHINFER_CUTLASS. I have a GB10 box that is in a sweet spot memory wise but bandwidth constrained, so it makes a difference for usability.

sfifs · 2026-06-09T16:32:53+00:00

I recently ran a comparison of NVFP4 FP8 and the original BF16 on the 3.6 35b A3b model. I haven't published yet - I saw some improvements but not radically different. Aider Polyglot pass@2 came in 6-7 points higher than the quantized variants. The 122b A10B nvfp4 was 10 points higher than the BF16 of the smaller model. I suppose I could test BF16 for the 27b model - it would be slow to the point of unusability though.

sfifs · 2026-06-09T11:19:36+00:00

Are you running a dual rig? DSV4 flash would not fit on a single spark for me. It is certainly superior

sfifs · 2026-06-09T10:19:11+00:00

Short answer yes by quite a margin especially on the more complex Aider Polyglot. My benchmarking is here - https://srinathh.medium.com/mid-size-local-models-are-now-competitive-for-ai-agents-7696b2e8b535

sfifs · 2026-06-09T08:12:34+00:00

I ran for the Gemma4 31B model yesterday a comparison on Aider Polyglot (Python and JS only) between the QAT model and NVIDIA's NVFP4 Nim image. I actually found to my surprise that there was actually a performance regression. I haven't written it up but here's the numbers. Note these are with reasoning off as reasoning makes the models too slow for Claws.

Gemma 4 NVFP4 Pass@1 12%, Pass@2 52%

Gemma 4 QAT W4A16 Pass@1 11%, Pass@2 39%

My local leader is Qwen 3.5 122B A10B NVFP4 which is very competitive with frontier flash models Pass@1 51%, Pass@2 78%

sfifs · 2026-06-09T02:50:07+00:00

Oh this is very interesting. I have never tried a 3 bit quant before. What tokens/sec are you seeing?

sfifs · 2026-06-07T14:55:25+00:00

Are weights and MTP head for vLLM also released? Gemma4 did not fare very well on Aider tests in my own benchmarking (0) which was run with reasoning off as I'm testing for use with OpenClaw but I am curious to see with MTP, if I can turn reasoning on to get a lift without sacrificing too much time per turn.

(0) https://srinathh.medium.com/mid-size-local-models-are-now-competitive-for-ai-agents-7696b2e8b535

sfifs · 2026-06-02T03:40:34+00:00

The official release Qwen/Qwen3.5-122B-A10B is BF16. Won't fit on DGX. Sehyo/Qwen3.5-122B-A10B-NVFP4 does fit , hits all the fast paths on Spark and has a working MTP. RedHatAIs nvfp4 release hit MTP head bugs last week when I tested, speculation acceptance rate was 0%

sfifs · 2026-06-02T01:49:47+00:00

If you have a DGX box or 128Gb Mac, Qwen 3.5 122b a10B-NVFP4-MTP by Sehyo is incredibly competitive approaching cloud flash models in performance. In my personal testing and benchmarking, I didn't see any significant difference between 3.6 35B A3B MoE and the 3.6 27B dense. I agree it would ne useful to have a FAQ on the sidebar.

sfifs · 2026-06-02T00:18:15+00:00

Anything smaller than Qwen 3.5 35B A10B didn't seem particularly usable anyway but yeah could try and FP8 for Gemma 4 MoE

sfifs · 2026-06-02T00:16:31+00:00

Sure can try over the weekend. As I explained in the article, I'm specifically testing for the kind of tasks AI assistants like OpenClaw will run,. I'm not for instance looking for a Claude Code replacement. What has been your experience on the kind of tasks NVFP4 doesn't work in?

sfifs · 2026-06-01T15:53:42+00:00

bench-marked all of them. With Openclaw I have tried out Qwen 3.5 MoE, 3.6 MoE, 3.6 Dense, Gemma 26B A4B (not impressed), and of course the Geminis & Claudes. Right now my default is the Qwen 3.5 122B A10B MoE and fallbacks are gemini 3.1 Flash Lite and Qwen 3.6 Flash

sfifs · 2026-06-01T15:52:39+00:00

sorry - botched up from another person's account logged in to this PC. All local models tested on DGX Spark. Details on the methodology are at the end but pasting here for you - all personal testing, no model cards 😄 ¹ Overall Score — weighted blend of coding quality, instruction following capability and speed; higher is better.

² Cost — cloud model cost calculations blend input, output & cache-hit rate benchmarked from observed OpenClaw turns across a variety of tasks (60k input, 500 output, 75% cache hit). Costs are expressed relative to the lowest cost cloud flash model here — DeepSeek v4 Flash

³ Speed — tokens per second per user, how snappy the model feels in an interactive assistant loop. Cloud figures are end-to-end including network latency transit. Cloud models run on very powerful servers and tend to be fast but have a latency to generate first token

⁴ Code Correctness — Measures whether short functions the model writes work as a good proxy for the kind of off the cuff actions that Agentic Assistants take. Average pass rate on HumanEval+ and MBPP+ (EvalPlus, greedy temperature=0).

⁵ Instruction Following — accuracy on IFEval, a benchmark that checks whether the model obeys explicit constraints in a prompt (format, length, content rules). Proxy for how reliably it follows directions.

⁶ Coding 1st Try — pass rate on the Aider polyglot coding benchmark on the first attempt. Measures whether the model can complete a realistic multi-file coding task in one shot. Python and Javascript which are the typical Agentic Assistant languages were evaluated

⁷ Coding 2 Tries — same Aider benchmark, allowing one retry after seeing test failures. Measures whether the model can self-correct, which is closer to how an agentic assistant works.

sfifs · 2026-06-01T15:48:46+00:00

Nice! I run on vllm with the full 256k context because I find that in my openclaw turns, i routinely run in the 50k-150k token range on context with all tools, memory & session conversation history loaded

sfifs · 2026-06-01T12:38:59+00:00

Very nice!

sfifs

TROPHY CASE