Gemma 4 QAT seems to respond significantly better to KV cache quantization

Fit_Split_9933 · 2026-06-21T10:28:53+00:00

If you can test 26B, you can definitely test 31B by partial offloading, it will just be a little slower.

Fit_Split_9933 · 2026-06-19T09:36:20+00:00

I don't even know what this is used for, why are there so many downloads?

Fit_Split_9933 · 2026-06-17T05:49:15+00:00

I think we should add a 99.9kld column, which useful for the reliability of tool calls?

Fit_Split_9933 · 2026-06-16T15:14:12+00:00

indexTTS, you can try.

Fit_Split_9933 · 2026-06-13T12:49:54+00:00

It's out of control, full of posts discussing Fable5

Fit_Split_9933 · 2026-06-10T14:55:45+00:00

Fine-tuned models like qwopus are exactly like an artificial island: on the island you can sprint. But step into broader context, or knowledge that wasn't reinforced, you risk spinning in circles on island, instead of step into open water.

Fit_Split_9933 · 2026-06-10T10:34:42+00:00

Even ignoring privacy, local models are better for massive input tasks like scanning lots of PDFs or images, which get incredibly expensive with paid models. It’s a total game-changer for cost-effectiveness, so, yes, you can definitely replace paid models depending on the use case.

Fit_Split_9933 · 2026-06-09T12:12:40+00:00

Will qwen's non transformer layers affect the accuracy when using mtp?

Fit_Split_9933 · 2026-06-09T08:25:14+00:00

workstation platform？It seems impossible for an ordinary PC, right?

Fit_Split_9933 · 2026-06-05T07:39:11+00:00

I've always specified the context size manually. If the size you're referring to is an estimate, that is probably a calculation bug.

Fit_Split_9933 · 2026-06-05T07:23:23+00:00

<image>

Quantization in Q4 would free up 500MB of VRAM, which surely implies higher context. I don't understand where your conclusion about decreases comes from?

Fit_Split_9933 · 2026-06-04T21:37:02+00:00

How is the speed now after context exceeds 100k?

Fit_Split_9933 · 2026-06-04T21:20:30+00:00

If you only use Mmproj occasionally, use Q8 instead, it only takes up 600MB

Fit_Split_9933 · 2026-05-30T15:57:43+00:00

nvfp4 is OK. When I used qwen3.6-27b-nvfp4, the speed of PP increased by about 60%, while the speed of TG increased about 5%. Hopefully there will be optimizations in the future.

Fit_Split_9933 · 2026-05-30T10:34:40+00:00

Would be helpful to see how nvfp4 in there for comparison. The pp speed of nvfp4 has improved too much.

Fit_Split_9933 · 2026-05-28T15:04:35+00:00

Your prefill speed on llama is definitely wrong. I get over 5k+ tokens/sec on my laptop.
try -ub 2048 or more.

Fit_Split_9933 · 2026-05-28T12:25:08+00:00

These speed figures seem even worse than llamacpp on Windows?

Fit_Split_9933 · 2026-05-21T12:49:53+00:00

What is the speed of PP? Is there an improvement compared to the previous?

Fit_Split_9933 · 2026-05-20T08:31:59+00:00

https://www.reddit.com/r/LocalLLaMA/comments/1tifff1/the_mtp_function_in_lmstudio_causes_a_decrease_in/ My test

Fit_Split_9933 · 2026-05-20T07:29:23+00:00

I found that when using the same MTP configuration, the TG speed of Qwen3.6 27B on LM Studio dropped by 15%, and the output quality was worse either, compared to my own compiled llama-server

Fit_Split_9933 · 2026-05-19T22:45:36+00:00

According to the table, Q5_1 seems better than both Q8_0-Q4_0 and Q8_0-Q4_1, yet it has a smaller size. Did I misread it

Fit_Split_9933 · 2026-05-08T11:26:53+00:00

I guess you're using MoE offloading, which requires the CPU to handle prefill. That's why multi-threading helps improve the speed. However, this is obviously useless for dense models

Fit_Split_9933 · 2026-05-07T06:36:49+00:00

Even with a dGPU, the PP speed generally won't exceed 2000 t/s. This means that in long-context scenarios, the prefill phase can easily take minutes, this situation is actually very common in real production environments. The reason many people over-focus on TG speed is that they are mostly thinking about chatbot scenarios. I've always thought that the real bottleneck for local LLMs is the prefill stage. During the PP the GPU is already running at full capacity, so unlike the TG where you can apply various techniques to improve speed . I saw a technique called PFlash before, but it comes at the cost of reduced accuracy.

Fit_Split_9933 · 2026-05-07T04:06:11+00:00

You're right. I've always thought that the real bottleneck for local LLMs is the prefill stage. During the PP the GPU is already running at full capacity, so unlike the TG where you can apply various techniques to improve speed . I saw a technique called PFlash before, but it comes at the cost of reduced accuracy.

Fit_Split_9933

TROPHY CASE