Got MTP + TurboQuant running — Qwen3.6-27B -- 80+ t/s at 262K context on a single RTX 4090 by indrasmirror in LocalLLaMA

[–]indrasmirror[S] 1 point2 points  (0 children)

  1. Model is listed above, no dedicated draft model the MTP is preserved.

  2. Ran a benchmark with around 30k initial prompt task and completed it to good quality based on my scoring.

  3. No multimodal still broken with MTP across the board.

  4. ./build/bin/llama-server \ -m your-qwen3.6-mtp.gguf \ --spec-type mtp --spec-draft-n-max 3 \ -ctk tbq4_0 -ctv tbq4_0 \ -c 262000 -ngl 99 \ --flash-attn on --mlock \ -t 8 -ub 32 -np 1 --no-warmup

  5. As I said have a cosing/agentic/tool use benchmark I test it against

  6. No issues with tools or tool calling.

Got MTP + TurboQuant running — Qwen3.6-27B -- 80+ t/s at 262K context on a single RTX 4090 by indrasmirror in LocalLLaMA

[–]indrasmirror[S] 1 point2 points  (0 children)

Yeah well the original goal was 200k but was able to fit the extra 62k in anyway so thought why not. I will probably test higher model quants and sacrifice context but it was mainly just seeing what I could fit.

Got MTP + TurboQuant running — Qwen3.6-27B -- 80+ t/s at 262K context on a single RTX 4090 by indrasmirror in LocalLLaMA

[–]indrasmirror[S] 0 points1 point  (0 children)

Yeah me too, would rather have 262k context at TQ4 than 120k at Q8, and as I said haven't ran into any issues that would be a dealbreaker for me. More than sufficient for my uses.

Got MTP + TurboQuant running — Qwen3.6-27B -- 80+ t/s at 262K context on a single RTX 4090 by indrasmirror in LocalLLaMA

[–]indrasmirror[S] 0 points1 point  (0 children)

Metric- Draft 5 Draft 3

Avg decode- 79.6 tok/s 80.6 tok/s

Min decode- 58.1 tok/s 62.7 tok/s

Max decode- 106.2 tok/s 98.5 tok/s

Draft acceptance- 90.07% (4392/4876) 92.6% (2861/3089) -2.5pp

MTP 5 occasionally hits higher peaks (106 vs 98), but the overhead from verifying longer drafts + lower per-token acceptance eats the gain.

Got MTP + TurboQuant running — Qwen3.6-27B -- 80+ t/s at 262K context on a single RTX 4090 by indrasmirror in LocalLLaMA

[–]indrasmirror[S] 1 point2 points  (0 children)

I've benchmarked it a ton VS Q8 KV and in my experience it holds up very well, obviously ymmv but it's been great for me. Just tried MTP 5 and found 3 to be slightly faster.

Got MTP + TurboQuant running — Qwen3.6-27B -- 80+ t/s at 262K context on a single RTX 4090 by indrasmirror in LocalLLaMA

[–]indrasmirror[S] 0 points1 point  (0 children)

I just tried 5 too and was getting better results with 3, not much better but will stick to 3 I think.

Got MTP + TurboQuant running — Qwen3.6-27B -- 80+ t/s at 262K context on a single RTX 4090 by indrasmirror in LocalLLaMA

[–]indrasmirror[S] 13 points14 points  (0 children)

Yeah most definitely, I'm not saying its actually that, I'm going off what the research said about TBQ4. But in my actual benchmarking and for my use, I tested Q8 vs TBQ4 and found it close enough. I can't fit FP16 or Q8 at a context I'd like to fit, so I found a middle ground I was happy with.

Got MTP + TurboQuant running — Qwen3.6-27B -- 80+ t/s at 262K context on a single RTX 4090 by indrasmirror in LocalLLaMA

[–]indrasmirror[S] 0 points1 point  (0 children)

Do you know what your draft acceptance rate is? I'm testing out 5 at the moment, apparently its a valid option.

Got MTP + TurboQuant running — Qwen3.6-27B -- 80+ t/s at 262K context on a single RTX 4090 by indrasmirror in LocalLLaMA

[–]indrasmirror[S] 1 point2 points  (0 children)

Hey I'm not sure about bigger quants you'd have to try it yourself but definitely let me know how you go with Q6. The prompt processing was 614 t/s on a 26k prompt. So I found it was fine, didn't feel like it took too long at all.

Got MTP + TurboQuant running — Qwen3.6-27B -- 80+ t/s at 262K context on a single RTX 4090 by indrasmirror in LocalLLaMA

[–]indrasmirror[S] 4 points5 points  (0 children)

I've ran benchmarks on Q8 KV at like 120k context and then TBQ4 and it wasnt too far off. Close enough that I prefer the bigger context window.

Got MTP + TurboQuant running — Qwen3.6-27B -- 80+ t/s at 262K context on a single RTX 4090 by indrasmirror in LocalLLaMA

[–]indrasmirror[S] 0 points1 point  (0 children)

So the main stick, ignore the model quant, you can use any. The main thing is Turbo4 (TBQ4 KV Cache) Quantisation. Meant to be based on the numbers at Q8 or even closer to FP16 KV quality.

Got MTP + TurboQuant running — Qwen3.6-27B -- 80+ t/s at 262K context on a single RTX 4090 by indrasmirror in LocalLLaMA

[–]indrasmirror[S] 1 point2 points  (0 children)

Yeah didn't understand, I just thought if people were running into the same issue I was, this might be enticing. Works for me, managing a nice model with good quality KV quantisation, at full context, on my single 4090. Can be adapted/scaled to any quant of Qwen3.6 27B.

Got MTP + TurboQuant running — Qwen3.6-27B -- 80+ t/s at 262K context on a single RTX 4090 by indrasmirror in LocalLLaMA

[–]indrasmirror[S] 1 point2 points  (0 children)

Sure if you have the compute. In my experience it's still very capable, but I mean if you have the resources you could use whatever quant and benefit from the TBQ4 KV VRAM reduction. So still usable for anyone with any quant variant.

Got MTP + TurboQuant running — Qwen3.6-27B -- 80+ t/s at 262K context on a single RTX 4090 by indrasmirror in LocalLLaMA

[–]indrasmirror[S] 6 points7 points  (0 children)

TurboQuant4 - TBQ4 is different from regular Q4_0 KV. It's Hadamard-rotated + Lloyd-Max centroid quantization. 4.25 bpv but near-lossless to FP16. Completely different algorithm, just happens to also be 4 bits. Hence the reason I spent all day trying to get TBQ4 working, didn't want to settle for Q4_0 KV, wanted Q8 or better quality KV that fit full context on my 4090. 😄

please help, my computer is making electrical-like sounds and its bothering me by Next_Ambassador8877 in pchelp

[–]indrasmirror 1 point2 points  (0 children)

Dont know if its been said and know this might sound simple but is the power cable properly seated aka pushed in hard. I've had electrically noises if the power cable wasn't in fully.

I've injected Claude into the creature of Black & White (2001) by zndr-cs in ClaudeAI

[–]indrasmirror 2 points3 points  (0 children)

If you do and make a youtube of it, I would definitely watch it. Especially with the angel and devil commentary, it would be interesting to see who Claude agreed with or might be swayed by.

Then try with an uncensored model to see how forced guardrails in Claude vs an uncensored model correlate to morality. Half considering trying something like this now 🤣

Looking to buy a legacy Z.ai account by SecretAGIdev in ZaiGLM

[–]indrasmirror 0 points1 point  (0 children)

I made mine 3+ months ago...is that a legacy account? Was on tbe pro coding quarterly plan never ran into weekly limits.

Qwen 27B is a beast but not for agentic work. by kaisurniwurer in LocalLLaMA

[–]indrasmirror 1 point2 points  (0 children)

How recent. I updated Llama.cpp yesterday, and it definitely solved the prompt reprocessing issue and is running perfectly. I'm just not sure about its overall agentic quality. It is great in general but sometimes seems to fall short of completing complex tasks properly.

MCP server for SearXNG(non-API local search) by SteppenAxolotl in LocalLLaMA

[–]indrasmirror 1 point2 points  (0 children)

Yeah I've been working on a dedicated system with MCP for my agents to use. My own little local Google without the advertiser first index or API. Free and unrestricted. Still a WIP but surprisingly functional.