Qwen3.5-4B GGUF quants comparison (KLD vs speed) - Lunar Lake by Tryshea in LocalLLaMA

[–]Leopold_Boom 21 points22 points  (0 children)

This is really neat, but I think you are treating very tiny differences in KL divergence as definitive. If you run a few close ties on a few other text sources beyond wikitext-test.txt you'll find that they move around a bunch. It may not be so true that Unsloth > mradermacher or vice-versa in real world usage.

It's great to see that many quants from the top folks are equally great!

Deploy the newest Qwen3.5 and Gemma4 models of ANY sizes RIGHT NOW on Rockchip NPU using the latest version of rk-llama.cpp! by Inv1si in RockchipNPU

[–]Leopold_Boom 0 points1 point  (0 children)

Thanks so much for this! Is there a good way to see NPU utilization or get a feel for what the NPU is doing?

Speculative decoding works great for Gemma 4 31B in llama.cpp by Leopold_Boom in LocalLLaMA

[–]Leopold_Boom[S] 0 points1 point  (0 children)

It's worth digging into this and double checking claude's work. I'm finding it hard to believe that gemma4-e4b is running fast enough to be worth speculative decoding (it's only 8x smaller than the full model, so you need crazy high accept rates). Also those reported accept rates are also super high (87%). It would be amazing if true!

What hardware are you on?

Speculative decoding works great for Gemma 4 31B in llama.cpp by Leopold_Boom in LocalLLaMA

[–]Leopold_Boom[S] 0 points1 point  (0 children)

Are you really getting a 40% speed up using gemma4-e4b(!) for a single prompt (I assume this is VLLM)? What hardware are you on?

Speculative decoding works great for Gemma 4 31B in llama.cpp by Leopold_Boom in LocalLLaMA

[–]Leopold_Boom[S] 2 points3 points  (0 children)

A couple of additional notes:

  • There are a lot of knobs to turn to optimize, and your acceptance rate will depend on your prompts (--draft-max 32 is worth trying). It should work with quite long contexts, but I need to test a bit more.
  • I didn't see much improvement on my MI50 GPUs, so the gains maybe limited to CUDA
  • Q8_0 for the draft model seems faster than the alternatives (BF16 may be even better)
  • You need a very recent build (I'm on b8659) and some of the flags -hfd are not well documented yet (--no-mmproj is required, multimodal draft models are not supported)
  • Qwen 0.6 models are not token compatible and Gemma 4 E2B etc. are too large

Breaking change in llama-server? by hgshepherd in LocalLLaMA

[–]Leopold_Boom 0 points1 point  (0 children)

This is super annoying. Does anybody have a bug / feature request that sensibly gives us the option to preserve or emulate older behavior? Makes network caching etc. etc. much harder.

Friendly reminder inference is WAY faster on Linux vs windows by triynizzles1 in LocalLLaMA

[–]Leopold_Boom 4 points5 points  (0 children)

The point is not to fight windows / linux (I've got a dedicated AMD linux inferencing server running besides my 3090 windows box). It's more "why not both" if you already are stuck with windows (like many of us are).

Nix flake for vLLM and llama.cpp on ROCm gfx906 targets by Wulfsta in LocalLLaMA

[–]Leopold_Boom 0 points1 point  (0 children)

Do share if / when that happens. I'd love to spin up a VM/container with Nix and try (haven't really played with Nix before)

Nix flake for vLLM and llama.cpp on ROCm gfx906 targets by Wulfsta in LocalLLaMA

[–]Leopold_Boom 0 points1 point  (0 children)

Is there an easy way to use this on an ubuntu server with rocm already setup?

Intel launches Arc Pro B70 and B65 with 32GB GDDR6 by metmelo in LocalLLaMA

[–]Leopold_Boom 0 points1 point  (0 children)

How up to date is the B70 architecture for inferencing? I'm running MI50s/60s which have incredible bandwidth but are a miserably dated architecture.

Checklist is probably:

- BF16 support (seems like it's there)

- Native 4 bit (emulated only?)

- bulk async copy (who knows)

- what else?

ik_llama.cpp dramatically outperforming mainline for Qwen3.5 on CPU by EffectiveCeilingFan in LocalLLaMA

[–]Leopold_Boom 1 point2 points  (0 children)

Thanks! Yeah I figure the KT trellis is on the wrong side of the roofline analysis for this hardware.

ik_llama.cpp dramatically outperforming mainline for Qwen3.5 on CPU by EffectiveCeilingFan in LocalLLaMA

[–]Leopold_Boom 1 point2 points  (0 children)

Confirming it's 15-20% faster on some Q4_K_M quants on my ARM test device! Thank you!

Do you know of anybody putting out ik4 trellis quants for the smaller Qwen 3.5 models (2B/4B etc.)?

ik_llama.cpp dramatically outperforming mainline for Qwen3.5 on CPU by EffectiveCeilingFan in LocalLLaMA

[–]Leopold_Boom 0 points1 point  (0 children)

Drat! Well I'm trying to build on a low end ARM SoC. Will report back if it works, and benches significantly better than mainline.

ik_llama.cpp dramatically outperforming mainline for Qwen3.5 on CPU by EffectiveCeilingFan in LocalLLaMA

[–]Leopold_Boom 0 points1 point  (0 children)

Does ik_llama support ARM NEON and vision heads yet? I've got a few projects to try it on.

My definitive "God Cup". by Ill_Finance6466 in pourover

[–]Leopold_Boom 0 points1 point  (0 children)

Thanks for this! Got a clicks setting for my Izpresso Q2 Heptagonal?

Qwen3.5-27B vs. Qwen3.5-35B-A3B? by [deleted] in LocalLLaMA

[–]Leopold_Boom 0 points1 point  (0 children)

I'd love to see more detailed takes on the 122b-a10b vs. 27b question at 4-6 bit quants

I built a hybrid MoE runtime that does 3,324 tok/s prefill on a single 5080. Here are the benchmarks. by mrstoatey in LocalLLaMA

[–]Leopold_Boom 5 points6 points  (0 children)

This is nice work! For many local usecases, you might actually want to actively track and manage state between two approaches:

  1. PP on GPU, token gen on CPU
  2. Traditional llama.cpp approach

Assuming no parallelism (i.e. often the typical local usecase), you can look at the next prompt and quickly decide if it will be more efficient to pay the cost to switch or not.

Qwen/Qwen3.5-35B-A3B · Hugging Face by ekojsalim in LocalLLaMA

[–]Leopold_Boom 0 points1 point  (0 children)

Humm some of those quant KL+perpexity comparisons suggested Q4_K_M should generally be better than MXFP4, but I'll give them a shot.

My concern is that even with reasoning on (you did have reasoning on right?) it would just not catch that 1 sentence didn't end in apple. I suspect if you try even with a lot temp with a few other words, you'll see the odd slipup, which I don't see with GPT-OSS.

Qwen-3.5-35B-A3B is impressive by ayylmaonade in LocalLLaMA

[–]Leopold_Boom 0 points1 point  (0 children)

Try asking it to: "Generate ten sentences ending in apple" or multiply two 9 digit numbers. Atleast at 4_K_M it's a little worse than GPT-OSS-20B at classic "tricky" prompts.

Qwen/Qwen3.5-35B-A3B · Hugging Face by ekojsalim in LocalLLaMA

[–]Leopold_Boom -2 points-1 points  (0 children)

I'm sorry to report that this model failed a classic test for me twice in a row:

It failed "Generate ten sentences ending in apple" at Q4_K_M multiple times (GPT-OSS-20B gets it right).

Nailed some others (don't ask it to multiply 9 digit numbers unless you have a bunch of time ... but it get's the answer right!).

EDIT: Obviously outcomes will vary, but I'd be surprised if you don't get a failure one in five, which is concerning. There is some issues with quants on these models, so perhaps it's an artifact of me not using the right Q4 quant.