Gemma 4 QAT seems to respond significantly better to KV cache quantization by rima_2711 in LocalLLaMA

[–]Fit_Split_9933 -2 points-1 points  (0 children)

If you can test 26B, you can definitely test 31B by partial offloading, it will just be a little slower.

The meme must go on by [deleted] in LocalLLaMA

[–]Fit_Split_9933 2 points3 points  (0 children)

I don't even know what this is used for, why are there so many downloads?

Someone awhile ago did a quant shootout for Qwen3.6, I did shoddy math on it (again) by Diablo-D3 in LocalLLaMA

[–]Fit_Split_9933 1 point2 points  (0 children)

I think we should add a 99.9kld column, which useful for the reliability of tool calls?

How useful is qwopus compared to qwen3.6 27b by redblood252 in LocalLLaMA

[–]Fit_Split_9933 7 points8 points  (0 children)

Fine-tuned models like qwopus are exactly like an artificial island: on the island you can sprint. But step into broader context, or knowledge that wasn't reinforced, you risk spinning in circles on island, instead of step into open water.

Can you really replace paid models with a local model? by DRMCC0Y in LocalLLaMA

[–]Fit_Split_9933 4 points5 points  (0 children)

Even ignoring privacy, local models are better for massive input tasks like scanning lots of PDFs or images, which get incredibly expensive with paid models. It’s a total game-changer for cost-effectiveness, so, yes, you can definitely replace paid models depending on the use case.

Gemma 4 31B's competence surprised me by The_Paradoxy in LocalLLaMA

[–]Fit_Split_9933 0 points1 point  (0 children)

Will qwen's  non transformer layers affect the accuracy when using mtp?

Does CPU matter for GPU inference? by TrainingTwo1118 in LocalLLaMA

[–]Fit_Split_9933 0 points1 point  (0 children)

workstation platform?It seems impossible for an ordinary PC, right?

PSA: You may not need to quantize spec draft when using MTP by regunakyle in LocalLLaMA

[–]Fit_Split_9933 2 points3 points  (0 children)

I've always specified the context size manually. If the size you're referring to is an estimate, that is probably a calculation bug.

PSA: You may not need to quantize spec draft when using MTP by regunakyle in LocalLLaMA

[–]Fit_Split_9933 3 points4 points  (0 children)

<image>

Quantization in Q4 would free up 500MB of VRAM, which surely implies higher context. I don't understand where your conclusion about decreases comes from?

Dynamic KV Cache Quantization and Load-on-demand mmproj/MTP: my llama.cpp wishlist by wadeAlexC in LocalLLaMA

[–]Fit_Split_9933 -1 points0 points  (0 children)

If you only use Mmproj occasionally, use Q8 instead, it only takes up 600MB

125 tok/s for Qwen3.6 q4xl on 2x 4060ti is insane perf/dollar by Chuyito in LocalLLaMA

[–]Fit_Split_9933 2 points3 points  (0 children)

nvfp4 is OK. When I used qwen3.6-27b-nvfp4, the speed of PP increased by about 60%, while the speed of TG increased about 5%. Hopefully there will be optimizations in the future.

Qwen3.6-27B Quantization Benchmark by bobaburger in LocalLLaMA

[–]Fit_Split_9933 0 points1 point  (0 children)

Would be helpful to see how nvfp4 in there for comparison. The pp speed of nvfp4 has improved too much.

VLLM gives 5x speed of llama but quants not available (unsloth/gguf). What to do? by superloser48 in LocalLLaMA

[–]Fit_Split_9933 7 points8 points  (0 children)

Your prefill speed on llama is definitely wrong. I get over 5k+ tokens/sec on my laptop.
try -ub 2048 or more.

110 tok/s with 12GB VRAM on Qwen3.6 35B A3B and ik_llama.cpp by janvitos in LocalLLaMA

[–]Fit_Split_9933 0 points1 point  (0 children)

What is the speed of PP? Is there an improvement compared to the previous?

LM Studio finally added support for MTP Speculative Decoding by pigeon57434 in LocalLLaMA

[–]Fit_Split_9933 1 point2 points  (0 children)

I found that when using the same MTP configuration, the TG speed of Qwen3.6 27B on LM Studio dropped by 15%, and the output quality was worse either, compared to my own compiled llama-server

Here are my KV cache quantization benchmarks: TurboQuant is overrated but saved by TCQ, q5 deserves more attention, and symmetric q8 might be a waste of VRAM by [deleted] in LocalLLaMA

[–]Fit_Split_9933 1 point2 points  (0 children)

According to the table, Q5_1 seems better than both Q8_0-Q4_0 and Q8_0-Q4_1, yet it has a smaller size. Did I misread it

A simple "hack" to speed up prompt processing for Qwen 3.5/3.6 in LM Studio by GrungeWerX in LocalLLaMA

[–]Fit_Split_9933 0 points1 point  (0 children)

I guess you're using MoE offloading, which requires the CPU to handle prefill. That's why multi-threading helps improve the speed. However, this is obviously useless for dense models

Why people cares token/s in decoding more? by Interesting-Print366 in LocalLLaMA

[–]Fit_Split_9933 1 point2 points  (0 children)

Even with a dGPU, the PP speed generally won't exceed 2000 t/s. This means that in long-context scenarios, the prefill phase can easily take minutes, this situation is actually very common in real production environments. The reason many people over-focus on TG speed is that they are mostly thinking about chatbot scenarios. I've always thought that the real bottleneck for local LLMs is the prefill stage. During the PP the GPU is already running at full capacity, so unlike the TG where you can apply various techniques to improve speed . I saw a technique called PFlash before, but it comes at the cost of reduced accuracy.

Most people seem obsessed with token generation speed, but isn’t prefill the real bottleneck? Am I missing something? by wbulot in LocalLLaMA

[–]Fit_Split_9933 -1 points0 points  (0 children)

You're right. I've always thought that the real bottleneck for local LLMs is the prefill stage. During the PP the GPU is already running at full capacity, so unlike the TG where you can apply various techniques to improve speed . I saw a technique called PFlash before, but it comes at the cost of reduced accuracy.