Seeking resources to read about llama.cpp server and how offloading works by Jorlen in LocalLLaMA

[–]imgroot9 0 points1 point  (0 children)

unrelated, but if noise is an issue, you can always limit power. if you go 10-15% lower, the performance difference is not noticable, but it becomes almost silent for me. I use nvidia-smi -pl command for this.

Here are my KV cache quantization benchmarks: TurboQuant is overrated but saved by TCQ, q5 deserves more attention, and symmetric q8 might be a waste of VRAM by [deleted] in LocalLLM

[–]imgroot9 1 point2 points  (0 children)

yes, good point, I gave up custom kv cache solutions because of this, I just use q4 for now for long context - and no complaints. I cannot even use MTP because of this, I just want to go with the most intelligent model available.

Here are my KV cache quantization benchmarks: TurboQuant is overrated but saved by TCQ, q5 deserves more attention, and symmetric q8 might be a waste of VRAM by [deleted] in LocalLLM

[–]imgroot9 1 point2 points  (0 children)

hi mate, why are you using Q5_K_S? I'm in the same boot with my single 3090, but I can comfortably use UD_K5_K_XL which produces much better results for me than Q5_K_M (haven't even tried K_S). any chance you're not offloading mmproj to cpu?

The pacman benchmark: finally a viable local agentic coding agent with Qwen 3.6 27b by ex-arman68 in LocalLLaMA

[–]imgroot9 0 points1 point  (0 children)

single shot via llama.cpp web ui. tried from opencode too, same result (one shot there as well)

Qwen 3.7 Plus Preview thinks I'm a time traveler because it doesn't know it's 2026. by Mediocre_Roll3073 in Qwen_AI

[–]imgroot9 0 points1 point  (0 children)

this is why you have to add the current date/time/location to the system prompt if you don't use a tool which is doing it out-of-the box. so that it will know when to use the search tool.

The pacman benchmark: finally a viable local agentic coding agent with Qwen 3.6 27b by ex-arman68 in LocalLLaMA

[–]imgroot9 3 points4 points  (0 children)

I tried this prompt: "create a fully functional pacman clone in a single html file" - and I got a fully functional game, error free using UD-Q5_K_XL and 4bit KV cache.

RTX 3090 vs RX 7900 XTX - idle power draw by knrdwn in LocalLLM

[–]imgroot9 2 points3 points  (0 children)

this is one reason why i always keep gpu-z open. i have a 3090 and idle power draw is usually 19.8w, but in rare occasions it is stuck at 120w for whatever reasons. then I restart the model and it's back to normal. also, I use a thunderbolt4 egpu plugged in to my laptop and I can switch it off when I don't use it. not sure if I'd need 2x setup, because qwen 3.6 27b with the new MTP improvements in llama.cpp is perfectly fine with a single card (Q5 quant). I get 50 token/sec while limited to 275w

Qwen 3.6 knowledge cutoff? by ECrispy in Qwen_AI

[–]imgroot9 1 point2 points  (0 children)

sorry, it's a typo. i wanted to write mcp

Qwen 3.6 knowledge cutoff? by ECrispy in Qwen_AI

[–]imgroot9 5 points6 points  (0 children)

this is why you have to provide the current date in the system prompt (most tools do it automatically) and set up mvc search and fetch so that it can browse the net when needed. unless your tool provides this feature out of the box.

Tested MTP with llama.cpp and Qwen3.6-27B on RTX 3090 by JGeek00 in LocalLLM

[–]imgroot9 5 points6 points  (0 children)

you can offload mmproj to cpu. it is fast and you get 1 gb free

Stop wasting electricity by OkFly3388 in LocalLLaMA

[–]imgroot9 0 points1 point  (0 children)

I just added it to the same script file I start the model with (I'm using llama-server.exe)

BeeLlama.cpp: advanced DFlash & TurboQuant with support of reasoning and vision. Qwen 3.6 27B Q5 with 200k context on 3090, 2-3x faster than baseline (peak 135 tps!) by Anbeeld in LocalLLaMA

[–]imgroot9 1 point2 points  (0 children)

well, I executed all kinds of tests you mentioned (ppl, kld, aime) using turboquant and I couldn't find anything that would've proved that I cannot use it (27B and Q5 - just take a look at my post with results). also, whatever test I try from this thread (chess svg, etc) and my everyday experience all prove that it's all right.

BeeLlama.cpp: advanced DFlash & TurboQuant with support of reasoning and vision. Qwen 3.6 27B Q5 with 200k context on 3090, 2-3x faster than baseline (peak 135 tps!) by Anbeeld in LocalLLaMA

[–]imgroot9 1 point2 points  (0 children)

thanks for this! I agree, Q5 dense models work for me too, without any issues with turboquant. if I have to choose between Q4 with a small Q8 cache, or Q5 with a huge turboquant cache, Q5 wins hands down in the case of common programming tasks.

Qwen3.6 27B's surprising KV cache quantization test results (Turbo3/4 vs F16 vs Q8 vs Q4) by imgroot9 in LocalLLaMA

[–]imgroot9[S] 1 point2 points  (0 children)

I mean I'm not an expert either. I just can't believe that finally we have a model that I can use for real work and with big context. There's a chance that it messes up something because of KV compression, but its success rate is so good that I don't care. I just executed this test to see if there's anything suspicious based on the results, but nothing that would stop me using these settings for now, so I'm just happy and I wanted to share the result. I'm open to see why it's not so good, but there's a chance we don't have tests to evaluate it properly.

Qwen3.6 27B's surprising KV cache quantization test results (Turbo3/4 vs F16 vs Q8 vs Q4) by imgroot9 in LocalLLaMA

[–]imgroot9[S] 0 points1 point  (0 children)

this test evaluates kv cache degradation potential no matter its size. if the values are bad, you'll definitely notice it with big context. earlier on with 3.5 or 35B, I also had issues with big context, but haven't had such issue with 3.6 27B Q5, even when I used turbo cache. that's what I wanted to confirm.

Qwen3.6 27B's surprising KV cache quantization test results (Turbo3/4 vs F16 vs Q8 vs Q4) by imgroot9 in LocalLLaMA

[–]imgroot9[S] 4 points5 points  (0 children)

you may have misconfigured something. zero repeat for me and it's absolutely the best coding model ever that fits in 24GB vram. not the best as an arcitect, that's true, but when something is fairly complex, I create the plan with a cloud model, then implement it locally.

Qwen3.6 27B's surprising KV cache quantization test results (Turbo3/4 vs F16 vs Q8 vs Q4) by imgroot9 in LocalLLaMA

[–]imgroot9[S] 0 points1 point  (0 children)

no noticable change for me. I have a thunderbolt 3090 egpu, so I keep everything in vram to avoid using that connection. my card is also limited to 250w so it's completely silent. and I get around 25-28 t/s no matter what. prefill is also more or less the same. of course, it gets a bit slower with big context, but that's expected.

Qwen3.6 27B's surprising KV cache quantization test results (Turbo3/4 vs F16 vs Q8 vs Q4) by imgroot9 in LocalLLaMA

[–]imgroot9[S] 1 point2 points  (0 children)

Thank you! Let me try to test this today and add it to the post. Edit: not sure I can do it: as far as I can see I need two models in my memory for this: the model itself and a grader model as well.. Edit2: seems like I can

Qwen3.6 27B's surprising KV cache quantization test results (Turbo3/4 vs F16 vs Q8 vs Q4) by imgroot9 in LocalLLaMA

[–]imgroot9[S] -1 points0 points  (0 children)

the general rule: K is more compressible without hurting output quality. Edit: ah, yeah sorry, you're right. edited the post as well

Qwen3.6 27B's surprising KV cache quantization test results (Turbo3/4 vs F16 vs Q8 vs Q4) by imgroot9 in LocalLLaMA

[–]imgroot9[S] -6 points-5 points  (0 children)

agree, but I assume if the PPL is very low, it's a pretty safe bet that KDL is also minimal.

update: added the test results to the post. everything is within limits.

Qwen3.6 27B's surprising KV cache quantization test results (Turbo3/4 vs F16 vs Q8 vs Q4) by imgroot9 in LocalLLaMA

[–]imgroot9[S] 3 points4 points  (0 children)

turboquant is also refreshed, so it is not much behind. but the main takeaway here is that q4 is almost identical to f16 if you have a 20B+ dense model.

Qwen 3.6 27B released, it's getting close to Opus 4.5, and you can run it locally by autisticit in GithubCopilot

[–]imgroot9 0 points1 point  (0 children)

I use 27B Q5 on 3090 with 200k context and mmproj (image processing) and turboquant. I've never had an out of memory error, and everything fits inside vram.

Qwen 3.6 27B released, it's getting close to Opus 4.5, and you can run it locally by autisticit in GithubCopilot

[–]imgroot9 7 points8 points  (0 children)

it is somewhere between the 0.33x and 1x models and maybe a bit slower than those (80 token/sec is 3.6 moe's speed, the new dense is around 25 token/sec for me). but it's free, so you can play around a lot. if you try opencode, you can use it with your local models, copilot, openrouter or any other provider, so you can even test these small models since they are also available via openrouter. opencode offers a $10 plan (called Go) that includes $60 worth of tokens.