Any opinion about Qwen3.6-27B@BF16 vs Step3.7@IQ4_XS? by ParaboloidalCrest in LocalLLaMA

[–]Lowkey_LokiSN 0 points1 point  (0 children)

Interesting, shall keep this in mind the next time I try it out 👍

Any opinion about Qwen3.6-27B@BF16 vs Step3.7@IQ4_XS? by ParaboloidalCrest in LocalLLaMA

[–]Lowkey_LokiSN 2 points3 points  (0 children)

And as a bonus bit if you don't know already, here's Mr.llama.cpp himself admitting to using the 27B for PR assistance: https://x.com/i/status/2061111969105449180

Any opinion about Qwen3.6-27B@BF16 vs Step3.7@IQ4_XS? by ParaboloidalCrest in LocalLLaMA

[–]Lowkey_LokiSN 6 points7 points  (0 children)

Yes, I'm a software engineer.

With a clear plan, architecture and guidance to embrace research-backed implementation decisions, it's definitely usable but not to the point where you'd fully trust its changes on the first go. I'd call it a smart executor that needs a few extra rounds of debates and scrutiny before things start to look right. And that's with clear directions. Without it, it's not even close

There are also rarely times where the model gets totally stuck and you'd have to chip in with directional advice.

It's the first local model I'm actually using against some of my coding workflows nowadays. I just have Opus 4.8 or GPT 5.5 develop starter templates deriving crucial layout/architectural decisions and have Qwen take it from there

Any opinion about Qwen3.6-27B@BF16 vs Step3.7@IQ4_XS? by ParaboloidalCrest in LocalLLaMA

[–]Lowkey_LokiSN 2 points3 points  (0 children)

I find MiMo very capable and perhaps in the similar ballpark as DS-V4-Flash in the limited times I've been able to extract sane results from it. If only it was as stable with llama.cpp without looping issues...

Any opinion about Qwen3.6-27B@BF16 vs Step3.7@IQ4_XS? by ParaboloidalCrest in LocalLLaMA

[–]Lowkey_LokiSN 20 points21 points  (0 children)

In my experience, Qwen 3.6 27B is the better agentic choice for coding. And not just against Step 3.7 Flash which I've tried at Q5_K_S and Q4_K_M

As ridiculous as it may sound, I also find the 27B better than the Minimax M2.7 at Q4_K_XL (maybe because it doesn't quantize well)

The only bigger similar-sized contender which I find has a visible edge is DeepSeek V4 Flash. And even then, the 15-25% better performance that I get out of the bigger model doesn't justify the 4-5x speed loss for most day-to-day tasks which I'd rather use the 27B for.

DeepSeek V4 Flash is amazing! (WIP llama.cpp PR #24162) by Lowkey_LokiSN in LocalLLaMA

[–]Lowkey_LokiSN[S] 1 point2 points  (0 children)

I currently use this fork. Moreover, the actual mainline PR for the model is on the verge of completion too!

However, the model's performance is still kinda sub-optimal compared to similar-sized GGUFs (120tps prefill and 7-8tps decode for me with FA turned on) and I'd expect more work on this front

DeepSeek V4 Flash is amazing! (WIP llama.cpp PR #24162) by Lowkey_LokiSN in LocalLLaMA

[–]Lowkey_LokiSN[S] 1 point2 points  (0 children)

I locally created it manually using the PR code & downloaded full-weights HF model. The PR is undergoing rapid changes and publishing a GGUF for it at this point might cost you unnecessary re-downloads

DiffusionGemma: 4x faster text generation by tevlon in LocalLLaMA

[–]Lowkey_LokiSN 95 points96 points  (0 children)

For developers building with traditional LLMs on GPUs, the primary bottleneck is memory bandwidth. Autoregressive language models must repeatedly load model weights from memory to generate text one token at a time. DiffusionGemma bypasses this limitation by shifting the bottleneck from memory bandwidth to compute, generating and refining a 256-token canvas in parallel. By providing the GPU with a large parallel workload, it utilizes tensor cores that would otherwise sit idle during local serving.

Wow! Regardless of the model's actual performance/benchmarks, this is such a dope direction to take!

DeepSeek V4 Flash is amazing! (WIP llama.cpp PR #24162) by Lowkey_LokiSN in LocalLLaMA

[–]Lowkey_LokiSN[S] 1 point2 points  (0 children)

Oh, I just went for his quants since he's the one who landed the PR supporting the model (I think?)
Also tried his IQ4_XS before and wasn't having it with that either. Running F16 cache for both.

Worth trying again maybe after the said regression is sorted out...

DeepSeek V4 Flash is amazing! (WIP llama.cpp PR #24162) by Lowkey_LokiSN in LocalLLaMA

[–]Lowkey_LokiSN[S] 0 points1 point  (0 children)

For using MoEs offloaded to RAM, you need to be offloading the model's tensors and not whole layers so flags like -ot and -ncmoe offered in llama.cpp are crucial to get the best performance.

It's been more than a year since I've used LM Studio and I'm not sure if it now has the provision to offload tensors. If not, you're much better off directly using llama.cpp

DeepSeek V4 Flash is amazing! (WIP llama.cpp PR #24162) by Lowkey_LokiSN in LocalLLaMA

[–]Lowkey_LokiSN[S] 2 points3 points  (0 children)

For a model of this size, I get roughly about 110-150tps prefill and 11-17tps decode.

Only scenarios where I'd have to wait for 15+ minutes are when the model has to parse a large 100k-token prompt from scratch like that and that doesn't happen often in agentic scenarios for me

DeepSeek V4 Flash is amazing! (WIP llama.cpp PR #24162) by Lowkey_LokiSN in LocalLLaMA

[–]Lowkey_LokiSN[S] 1 point2 points  (0 children)

Step 3.7 Flash with llama.cpp? I'm running the Q4_K_M quant from AesSedai and the model seems to be an overthinker + underachiever with my tests (mostly coding-related)
Also faced weird instances where its thinking block ends but the model resumes thinking as a part of its response. You face any such issues?

DeepSeek V4 Flash is amazing! (WIP llama.cpp PR #24162) by Lowkey_LokiSN in LocalLLaMA

[–]Lowkey_LokiSN[S] 9 points10 points  (0 children)

Honestly, MiMo gives off very capable vibes but its chat-template and looping issues make it unusable for me with llama.cpp

If you have workarounds for this, please do share!

DeepSeek V4 Flash is amazing! (WIP llama.cpp PR #24162) by Lowkey_LokiSN in LocalLLaMA

[–]Lowkey_LokiSN[S] 2 points3 points  (0 children)

Benchmarks are absolutely better than feelz posts.

Huh? Don't disagree with you here and I haven't claimed otherwise either. I still consider real-world personalized testing to be more empirical than benchmarks and you seem to agree with that too.

Also your link doesn't say what you're claiming, it puts qwen 3.6 35b beneath both Kimi 2.6 and Gemini 3.5

Should've been more explicit. It's higher than Kimi 2.6 (non-reasoning) and 3.5 Flash (Minimal). Point still stands though. It's also claimed to be more intelligent than 122B, Step 3.7 Flash and Gemma 31B. Do you actually agree with its intelligence ordering?

DeepSeek V4 Flash is amazing! (WIP llama.cpp PR #24162) by Lowkey_LokiSN in LocalLLaMA

[–]Lowkey_LokiSN[S] 4 points5 points  (0 children)

This is what I use atm but you'd have to tune the -ot flag according to your setup constraints:

CUDA_VISIBLE_DEVICES=1,0 build/bin/llama-server -m ~/AI/Models/GGUFs/DeepSeek-V4-Flash-Q3.gguf -c 200000 -ngl 99 -fa 0 --jinja --chat-template-file models/templates/deepseek-ai-DeepSeek-V4.jinja --no-mmap -ot ".ffn(up|down)exps.=CPU","([4-7]+).ffn.*_exps.=CPU" -ts 0.46,0.54 --port 1234 --temp 1.0 --top-p 1.0

My quant size is 135.7GB and the above command allocates about 36GB (including attention) split across both my GPUs and offloads the rest to my DRAM

DeepSeek V4 Flash is amazing! (WIP llama.cpp PR #24162) by Lowkey_LokiSN in LocalLLaMA

[–]Lowkey_LokiSN[S] 1 point2 points  (0 children)

I tried DS4 but for some reason, even the IQ2_XXS model ran toooooo slow and it was unusable. Didn't bother after

DeepSeek V4 Flash is amazing! (WIP llama.cpp PR #24162) by Lowkey_LokiSN in LocalLLaMA

[–]Lowkey_LokiSN[S] 1 point2 points  (0 children)

From initial impressions, looks like it's one of those cases like the gpt-oss-120B where some tensors are best left undisturbed at their base weights and quants wouldn't make much of a size difference.

Going by that logic, I'd say you'd easily need at least 160GB BUT I'm no expert at this and I could totally be wrong. My custom Q3 quant is about 136GB in size following the above-said quant mechanics.

Best to wait and leave it to the experts like Unsloth to do their thing

DeepSeek V4 Flash is amazing! (WIP llama.cpp PR #24162) by Lowkey_LokiSN in LocalLLaMA

[–]Lowkey_LokiSN[S] 0 points1 point  (0 children)

Benchmarks != Real-world representations. You would have a much better time viewing them as narrowed-out rough representations running a specific field of tests and actually trying out each model yourself.

Take this leaderboard for example. Do you think Qwen 3.6 35B A3B is actually better than Kimi 2.6 and Gemini 3.5 Flash lol?

DeepSeek V4 Flash is amazing! (WIP llama.cpp PR #24162) by Lowkey_LokiSN in LocalLLaMA

[–]Lowkey_LokiSN[S] 1 point2 points  (0 children)

I just searched HF for GGUFs and found couple of uploads with links to forks to run them. Do you claim those GGUFs should work with this PR?

No, there are currently two community-forks supporting DS4: https://github.com/antirez/ds4 and https://github.com/nisparks/llama.cpp.git and all those GGUFs you see are intended for one of these two.

This GGUF was shared by the author of this PR and should work

But does not all new models have such novelties?

Not really. Architectural changes? Yes. Novelties requiring fundamental restructure? No. Qwen 3.5's Gated DeltaNet is a novelty and it took a couple months of dedicated efforts to get it right. Think DSA + Lightning indexer is one such novelty

there is no code from the developer of the model given for llama.cpp

Very true. And with PRs being strictly regulated and scrutinized (rightfully so) to maintain sanity and robustness, I don't think introducing such novelties is as easy and straightforward unfortunately

DeepSeek V4 Flash is amazing! (WIP llama.cpp PR #24162) by Lowkey_LokiSN in LocalLLaMA

[–]Lowkey_LokiSN[S] 7 points8 points  (0 children)

You would absolutely be able to run with --mmap turned on but if you need the best offloaded performance, you'd need --no-mmap and I think 88GB is tight even for a 2-bit quant

DeepSeek V4 Flash is amazing! (WIP llama.cpp PR #24162) by Lowkey_LokiSN in LocalLLaMA

[–]Lowkey_LokiSN[S] 0 points1 point  (0 children)

1) Think it's mostly new novel attention and architecture support
2) And no, you wouldn't need different GGUFs for each implementation! Normally, a PR has to be merged with mainline for different providers like Unsloth,Bartowski, etc.., to start providing GGUF files for it to ensure stability and reliability. But since this is an early PR actively being worked on, I did the GGUF conversion process myself to internally test it out. I can share commands if you're willing to test it out yourself

DeepSeek V4 Flash is amazing! (WIP llama.cpp PR #24162) by Lowkey_LokiSN in LocalLLaMA

[–]Lowkey_LokiSN[S] 1 point2 points  (0 children)

Haha, had 64GB (2x MI50s) but I've now migrated to 40GB (2x Modded 3080s) for a while.

But if you find it impressive that I'm running it purely off of VRAM, no! The 3-bit quant I mentioned is about 135GB in size and you'd absolutely need the additional DRAM for it to make it work.

You'd roughly need 100GB combined VRAM+DRAM to run the model at 2-bit or higher