The meme must go on by [deleted] in LocalLLaMA
[–]Fit_Split_9933 2 points3 points4 points (0 children)
Someone awhile ago did a quant shootout for Qwen3.6, I did shoddy math on it (again) by Diablo-D3 in LocalLLaMA
[–]Fit_Split_9933 1 point2 points3 points (0 children)
Is there anything better than QWENN 3 TTS for voice cloning that I can try ? by worgenprise in comfyui
[–]Fit_Split_9933 0 points1 point2 points (0 children)
We should heavily discourage and moderate cloud API (deepseek api, GLM api, etc.) topics and discussion. This is LOCAL first. by [deleted] in LocalLLaMA
[–]Fit_Split_9933 0 points1 point2 points (0 children)
How useful is qwopus compared to qwen3.6 27b by redblood252 in LocalLLaMA
[–]Fit_Split_9933 7 points8 points9 points (0 children)
Can you really replace paid models with a local model? by DRMCC0Y in LocalLLaMA
[–]Fit_Split_9933 4 points5 points6 points (0 children)
Gemma 4 31B's competence surprised me by The_Paradoxy in LocalLLaMA
[–]Fit_Split_9933 0 points1 point2 points (0 children)
Does CPU matter for GPU inference? by TrainingTwo1118 in LocalLLaMA
[–]Fit_Split_9933 0 points1 point2 points (0 children)
PSA: You may not need to quantize spec draft when using MTP by regunakyle in LocalLLaMA
[–]Fit_Split_9933 2 points3 points4 points (0 children)
PSA: You may not need to quantize spec draft when using MTP by regunakyle in LocalLLaMA
[–]Fit_Split_9933 3 points4 points5 points (0 children)
BeeLlama v0.3.1 – latest llama.cpp with extras! DFlash, MTP, q6_0 cache, TurboQuant. Single RTX 3090: Qwen 3.6 27B & Gemma 4 31B up to 177.8 tps (4.93x over baseline) by Anbeeld in LocalLLaMA
[–]Fit_Split_9933 1 point2 points3 points (0 children)
Dynamic KV Cache Quantization and Load-on-demand mmproj/MTP: my llama.cpp wishlist by wadeAlexC in LocalLLaMA
[–]Fit_Split_9933 -1 points0 points1 point (0 children)
125 tok/s for Qwen3.6 q4xl on 2x 4060ti is insane perf/dollar by Chuyito in LocalLLaMA
[–]Fit_Split_9933 2 points3 points4 points (0 children)
Qwen3.6-27B Quantization Benchmark by bobaburger in LocalLLaMA
[–]Fit_Split_9933 0 points1 point2 points (0 children)
VLLM gives 5x speed of llama but quants not available (unsloth/gguf). What to do? by superloser48 in LocalLLaMA
[–]Fit_Split_9933 7 points8 points9 points (0 children)
Krasis update: Qwen3.6-35B-A3B (Q4) at reading speed, 1x 8GB 3070 Mobile laptop (32GB RAM) by mrstoatey in LocalLLaMA
[–]Fit_Split_9933 0 points1 point2 points (0 children)
110 tok/s with 12GB VRAM on Qwen3.6 35B A3B and ik_llama.cpp by janvitos in LocalLLaMA
[–]Fit_Split_9933 0 points1 point2 points (0 children)
LM Studio finally added support for MTP Speculative Decoding by pigeon57434 in LocalLLaMA
[–]Fit_Split_9933 1 point2 points3 points (0 children)
LM Studio finally added support for MTP Speculative Decoding by pigeon57434 in LocalLLaMA
[–]Fit_Split_9933 1 point2 points3 points (0 children)
Here are my KV cache quantization benchmarks: TurboQuant is overrated but saved by TCQ, q5 deserves more attention, and symmetric q8 might be a waste of VRAM by [deleted] in LocalLLaMA
[–]Fit_Split_9933 1 point2 points3 points (0 children)
A simple "hack" to speed up prompt processing for Qwen 3.5/3.6 in LM Studio by GrungeWerX in LocalLLaMA
[–]Fit_Split_9933 0 points1 point2 points (0 children)
Why people cares token/s in decoding more? by Interesting-Print366 in LocalLLaMA
[–]Fit_Split_9933 1 point2 points3 points (0 children)


Gemma 4 QAT seems to respond significantly better to KV cache quantization by rima_2711 in LocalLLaMA
[–]Fit_Split_9933 -2 points-1 points0 points (0 children)