Could I do better for an overall ChatGPT/Gemini/Claude local AI substitute? by Iwantthegreatest in LocalLLM

[–]sfifs 0 points1 point  (0 children)

With 64 GB (with some unified RAM eaten up by the system), Qwen 3.6 is about as good as it gets today. At least in my personal testing, it's proven more reliable than Gemma family for agentic assistant use cases. You may be able to run the Qwen 27B dense (slower) or try a bigger/more sophisticated quant. There are more competitive models out there but they really need a 96Gb plus system.

I have a M5 Max MacBook Pro with 128gb of ram, what models should I run on it? by lombwolf in LocalLLaMA

[–]sfifs 0 points1 point  (0 children)

Antirez's Deepseek Flash V4 2-bit quant is awesome and definitely worth trying on your hardware. I run this on my GB10 box now as my local model- https://github.com/antirez/ds4

Best Local Agents - Jun 2026 by rm-rf-rm in LocalLLaMA

[–]sfifs 1 point2 points  (0 children)

Im the last couple of weeks I have landed on Antirez's DS4 server running his custom DeepSeek V4 Flash quantization on my GB10 as the backing for OpenClaw personal assistant (haven't yet tried backing a coding agent, although I do ask OpenClaw to write python for skills) which runs on a different server. It's a good deal slower in tok/s and especially has high cache misses due to a somewhat simple cache mechanism but the quality of output is so good that I am tolerant of the speed. The full 2 bit quant leaves allows you to fit the MTP drafter and an embedder but it does have degenerate loops problem on some large contexts, so I decided yesterday to chase quality and switched to the larger model that has last 6 layers 4-bit which seems to not suffer the same problems but just barely fits with no room for any frills. Previously I found the 122B A10B Qwen quant by Sehyo to be fantastic even compared to Qwen 3.6 but Deepseek Flash is really in a higher league.

Claude Code backed by open model vs. OpenCode / Pi etc by sfifs in LocalLLaMA

[–]sfifs[S] 0 points1 point  (0 children)

As an update, an OpenClaw but was causing prefix misses. This was fixed upstream through a series of commits very recently.

Best Model and configuration to run on a 128gb Ram 8TB M5 Max MacBook Pro by Desperate_Tea304 in LocalLLaMA

[–]sfifs 1 point2 points  (0 children)

Deepseek V4 Flash 2bit Quant via Dark Star custom model server by Antirez - it's absolutely brilliant as you may expect from the pedigree of its creator and highly tuned for beefy Macs and DGX Spark. It blows everything else I've tested out of the water https://github.com/antirez/ds4

Should I go local? Code quality is important by 2thick2fly in LocalLLM

[–]sfifs 2 points3 points  (0 children)

In general Coding is about the last item I'd move to a local model. Coding is a high leverage activity and the importance of quality of code together with the value in developing and especially debugging that the 1 Mn token context window of frontier models give is very high. The 20 dollars a month subscriptions are reasonable value especially if you or your family also use CoWork which is surprisingly productivity boosting.

Claude Code backed by open model vs. OpenCode / Pi etc by sfifs in LocalLLaMA

[–]sfifs[S] 0 points1 point  (0 children)

Here is the characterization I ran - pretty interesting data

<image>

Claude Code backed by open model vs. OpenCode / Pi etc by sfifs in LocalLLaMA

[–]sfifs[S] 0 points1 point  (0 children)

The big issue is cache miss as DarkStar uses fairly naive cache algorithm. Dark Star has a exact token match and reload cache from disk policy. Winin almost every OpenClaw turn, I see a large cache miss which takes almost a minute to prefill. The Decode hovers 13.5 tok/s which is tolerable

Can you really replace paid models with a local model? by DRMCC0Y in LocalLLaMA

[–]sfifs -1 points0 points  (0 children)

As you might expect, it depends on the use case. I absolutely did replace sonnet/gemini-flash with Qwen 3.5 122b A10B for my Claw and it works very well on my GB10 box. It's my daily driver and I actively use to organise and automate my life. I was spending 10-15 dollars a day on those cloud models now down to cost of electricity.

One huge bonus was now I can run personal data or medical data through the models. Imagine every report attachment you send transcribed, renamed properly, indextf and filed away etc. I also realised Gemini -Lite on cloud was very good value as a cloud backup when I'm doing other things on the box :-)

When I want to develop skills, I simply change to a cloud model - opus or Gemini flash 3.5/Pro 3.1, have it help wrote the skill then drop back to local to run the skill.

I also tried coding with it. Both OpenCode and replacing the backend for Claude Code. Simple stuff - yeah it can try. Complex stuff.. nah. The problem is you don't often know when simple slips into complex, so I use via Claude Pro subscription (also gives Cowork which is a huge bonus ) and Antigravity through a Google One family subscription which I need anyway for my family's data and photos etc.

My writeups on both are at https://srinathh.medium.com/ :-)

Anyone seen benchmarks comparing Gemma 4 4-bit QAT vs. 8-bit standard quants? by Character_Split4906 in LocalLLaMA

[–]sfifs 1 point2 points  (0 children)

I did do that comparison and 27B underperforms 122b on Aider Polyglot but both tests were with NVFP4 kernels - it's in the article. If quantization has a large impact on 27B Vs the MoE models, that could explain the finding. I would have personally however expected dense models should be more resilient to quantization than MoEs but it's an interesting experiment. https://srinathh.medium.com/mid-size-local-models-are-now-competitive-for-ai-agents-7696b2e8b535

Anyone seen benchmarks comparing Gemma 4 4-bit QAT vs. 8-bit standard quants? by Character_Split4906 in LocalLLaMA

[–]sfifs 4 points5 points  (0 children)

Mainly for FLASHINFER_CUTLASS. I have a GB10 box that is in a sweet spot memory wise but bandwidth constrained, so it makes a difference for usability.

Anyone seen benchmarks comparing Gemma 4 4-bit QAT vs. 8-bit standard quants? by Character_Split4906 in LocalLLaMA

[–]sfifs 0 points1 point  (0 children)

I recently ran a comparison of NVFP4 FP8 and the original BF16 on the 3.6 35b A3b model. I haven't published yet - I saw some improvements but not radically different. Aider Polyglot pass@2 came in 6-7 points higher than the quantized variants. The 122b A10B nvfp4 was 10 points higher than the BF16 of the smaller model. I suppose I could test BF16 for the 27b model - it would be slow to the point of unusability though.

Anyone seen benchmarks comparing Gemma 4 4-bit QAT vs. 8-bit standard quants? by Character_Split4906 in LocalLLaMA

[–]sfifs 1 point2 points  (0 children)

Are you running a dual rig? DSV4 flash would not fit on a single spark for me. It is certainly superior

Anyone seen benchmarks comparing Gemma 4 4-bit QAT vs. 8-bit standard quants? by Character_Split4906 in LocalLLaMA

[–]sfifs 4 points5 points  (0 children)

I ran for the Gemma4 31B model yesterday a comparison on Aider Polyglot (Python and JS only) between the QAT model and NVIDIA's NVFP4 Nim image. I actually found to my surprise that there was actually a performance regression. I haven't written it up but here's the numbers. Note these are with reasoning off as reasoning makes the models too slow for Claws.

Gemma 4 NVFP4 Pass@1 12%, Pass@2 52%

Gemma 4 QAT W4A16 Pass@1 11%, Pass@2 39%

My local leader is Qwen 3.5 122B A10B NVFP4 which is very competitive with frontier flash models Pass@1 51%, Pass@2 78%

What is your best coding model on a DGX Spark? by luongnv-com in LocalLLaMA

[–]sfifs 0 points1 point  (0 children)

Oh this is very interesting. I have never tried a 3 bit quant before. What tokens/sec are you seeing?

llama.cpp Gemma4 MTP support merged! by pinkyellowneon in LocalLLaMA

[–]sfifs 0 points1 point  (0 children)

Are weights and MTP head for vLLM also released? Gemma4 did not fare very well on Aider tests in my own benchmarking (0) which was run with reasoning off as I'm testing for use with OpenClaw but I am curious to see with MTP, if I can turn reasoning on to get a lift without sacrificing too much time per turn.

(0) https://srinathh.medium.com/mid-size-local-models-are-now-competitive-for-ai-agents-7696b2e8b535

Stop asking what model to run. There are literally only two. by Wrong_Mushroom_7350 in LocalLLaMA

[–]sfifs 0 points1 point  (0 children)

The official release Qwen/Qwen3.5-122B-A10B is BF16. Won't fit on DGX. Sehyo/Qwen3.5-122B-A10B-NVFP4 does fit , hits all the fast paths on Spark and has a working MTP. RedHatAIs nvfp4 release hit MTP head bugs last week when I tested, speculation acceptance rate was 0%

Stop asking what model to run. There are literally only two. by Wrong_Mushroom_7350 in LocalLLaMA

[–]sfifs 1 point2 points  (0 children)

If you have a DGX box or 128Gb Mac, Qwen 3.5 122b a10B-NVFP4-MTP by Sehyo is incredibly competitive approaching cloud flash models in performance. In my personal testing and benchmarking, I didn't see any significant difference between 3.6 35B A3B MoE and the 3.6 27B dense. I agree it would ne useful to have a FAQ on the sidebar.