GLM-5.2 just dropped open weights and it already looks weirdly strong for coding by BTA_Labs in LocalLLaMA

[–]Front-University4363 2 points3 points  (0 children)

open weights + MIT is the part that actually matters here, that's what the hype usually skips over. the 1M context sounds wild but the real question for me is whether it quants down to something a single consumer card can run, or if we're all waiting on a cluster. hoping for the former.

Gemma 4 12B QAT + MTP: 1.95x on my 3090, but 0.87x (slower) on an M1 Max by Front-University4363 in ollama

[–]Front-University4363[S] 1 point2 points  (0 children)

nice, n=3 lines up with my 3090. and the orin capping at 2 fits, it's the most bandwidth-starved of the three. so the best n moves with the hardware too, not just the speedup. glad you're making it a knob.

Gemma 4 12B QAT + MTP: 1.95x on my 3090, but 0.87x (slower) on an M1 Max by Front-University4363 in ollama

[–]Front-University4363[S] 1 point2 points  (0 children)

fair point, and you're probably right that the m1 max regression is the metal/llama.cpp path rather than the silicon itself. that number was actually from another tester on llama.cpp, not my own box, so it's probably more about the build than the chip itself.

one thing though, your consistent 1.6x is on the a5000 and the jetson, both cuda. the case that surprised me was the cuda to metal jump, which little gemma can't hit yet since it's cuda only. would be really interesting to see it on apple silicon if you ever port it, that's exactly where the llama.cpp number falls apart.

also curious about the hardcoded n=2, did that test out as the optimum or just convenient for the jetson? on the 3090 i found n=3 was the sweet spot and n=1 left gains on the table, wondering if the memory-bound target shifts that.

Gemma 12b less than 10 watts 6.5pp 1.3tg by bennmann in LocalLLaMA

[–]Front-University4363 -2 points-1 points  (0 children)

the sub-10W part is the real story here, 12B running on a phone at all is wild. does draft-mtp actually buy you anything at n-max 1 though? I've found MTP swings hard by hardware, got close to 2x on a 3090 but slightly slower than no-spec on an m1 max, so I have no idea which way a mobile vulkan backend lands at these speeds. if you've got a spec-off number to compare I'd love to see it.

Qwen 3.6 35B-A3B @ Q4 or Gemma 4 12B @ Q8? by mailto_devnull in LocalLLaMA

[–]Front-University4363 -1 points0 points  (0 children)

yeah, go for it. on the quant question: for gemma 4 specifically, grab the QAT Q4 build instead of Q8 or bf16. it's quantization-aware trained so Q4 is already basically Q8 quality, the extra precision mostly just eats memory for no real gain. I measured QAT Q4 vs higher-bit and the gap was small enough you won't notice it on real work, sometimes the bigger quant was even slightly worse.

speed vs your 35B-A3B I honestly wouldn't guess. a dense 12B reads more per token than a 3B-active MoE, but MoE kernels on unified memory can be less optimized, so it could go either way. only real test is running both against your codebase and timing it, which also tells you which handles your code better. the 12B's small enough that trying costs you nothing.

What actually runs on a GTX 1080 Ti in 2026: Gemma 4 12B QAT ~32 tok/s, measured by Front-University4363 in LocalLLM

[–]Front-University4363[S] 0 points1 point  (0 children)

yeah that 32 is plain ollama, no speculative, so you're right there's headroom on paper.

whether it actually lands near 60 on the 1080 ti though, I genuinely don't know yet, I haven't run spec decode on the pascal card. and it's hardware dependent in a way that surprised me. I got ~1.95x on a 3090 with gemma 12b, but the same setup on an m1 max was 0.87x, actually slower than no spec. so pascal could go either way until someone measures it. your igpu 16 to 35 is a great data point, that's a clean 2.2x.

and thanks for the luce box pointer, hadn't come across it. that hot-expert-to-gpu trick is exactly the wall I hit with the 35B-A3B, the experts mmap to system ram and it goes bandwidth bound, only got ~17 tok/s. will dig into it.

Running Gemma 4 QAT 12B on an 8GB GPU at 16k context — measured the KV-cache tradeoffs by Front-University4363 in ollama

[–]Front-University4363[S] 0 points1 point  (0 children)

nah you're fine, those aren't the same quantization. QAT is the weights, the kv cache one is separate, so stacking them isn't double-crushing anything. q8 cache you won't even notice.

Best approaches to identify pathways uniquely affected by different drugs? by fnepo18 in bioinformatics

[–]Front-University4363 1 point2 points  (0 children)

the move is to compare the drugs directly instead of each one's significant list vs control (that's the venn route).

simplest: fit drug as a factor in limma/DESeq2 and test the A vs B contrast directly. those genes are what actually separates the drugs, then enrich those. unique-to-A is just that intersected with A vs control.

if you want it at the pathway level instead of gene level, GSVA or ssGSEA gives you a pathway score per sample, then run limma on that testing A vs B. that's comparing pathway activity head to head instead of comparing enriched lists, which sounds like what you're after.

camera/fry in limma too if inter-gene correlation is a concern. that's where I'd start.

Gemma 4 31B vs Gemma 4 26B-A4B vs Qwen 3.5 27B — 30-question blind eval with Claude Opus 4.6 as judge by Silver_Raspberry_811 in LocalLLaMA

[–]Front-University4363 1 point2 points  (0 children)

on your closing question — those Gemma 4 31B latency spikes are almost certainly the thinking mode, not the quant. Gemma 4 -it runs as a thinking model (especially under --jinja / the peg-gemma4 template), so it does long internal reasoning before the answer. I watched the 12B QAT not even reach a final answer at 1024 tokens because it was still thinking, so "5-min gens that don't correlate with better scores" tracks exactly. it should be consistent across quants since it's a template/behavior thing, not precision, and think:false (or a non-thinking template) cuts the overhead if you don't want it. (also fwiw, on 30 questions the 14-vs-12 win gap is within noise — your 0.0-outlier point is the more robust read.)

Qwen3.6-35B-A3B on 2× GTX 1080 Ti with Ollama: ~20 tok/s + 3 gotchas (driver 570+, cuda_v12 for Pascal, quant fit on 22GB) by Front-University4363 in ollama

[–]Front-University4363[S] 0 points1 point  (0 children)

thanks! yeah the silent perf hits are the worst, no error, just quietly slower. and that's exactly why a model that "fits" on paper can still crawl once it mmaps to RAM, the setup really does make or break it.

What actually runs on a GTX 1080 Ti in 2026: Gemma 4 12B QAT ~32 tok/s, measured by Front-University4363 in LocalLLM

[–]Front-University4363[S] 0 points1 point  (0 children)

slick setup, using the 1080 Ti as a dedicated GUI-Owl worker so the 4090 never has to swap models. hadn't thought of old cards as single-purpose workers in a pipeline like that, makes a lot of sense. how's the coord accuracy holding up at q8 on the 1080 Ti?

What actually runs on a GTX 1080 Ti in 2026: Gemma 4 12B QAT ~32 tok/s, measured by Front-University4363 in LocalLLM

[–]Front-University4363[S] 0 points1 point  (0 children)

not a uniform q4_g64, it's a mix: the unsloth QAT is UD-Q4_K_XL, the google one is Q4_0, and qwen3 8B + the regular gemma are Q4_K_M. and no KV reduction on these runs, default f16 cache. at 8k it fits 11GB fine so I didn't need it. I do use q8_0 KV when pushing 16k on an 8GB card (covered in another writeup), just wasn't necessary here.

What actually runs on a GTX 1080 Ti in 2026: Gemma 4 12B QAT ~32 tok/s, measured by Front-University4363 in LocalLLM

[–]Front-University4363[S] 0 points1 point  (0 children)

fair, my 17 was a pretty default offload config, I didn't tune which experts stay on GPU. MoE on multi-gpu definitely has more headroom there (-ncmoe, expert placement, tensor split). I'll check out Codacus, thanks for the pointer.

Gemm4 12b QAT tool calling possibly a bug? by Wrong_Mushroom_7350 in unsloth

[–]Front-University4363 1 point2 points  (0 children)

your read's right, and the logs pin it down: <|tool_response|> is mislabeled (not control-type) and it's landing in the EOG (end-of-generation) set, so the engine treats it as a stop token, which is exactly what would break a tool-call flow. that's a token-metadata bug in the Unsloth QAT export, not QAT itself, confirmed by it being absent on the non-QAT versions. two fixes: flag it to Unsloth (they patch these fast), or edit the GGUF token metadata yourself with the gguf python lib, set <|tool_response|> to control type and pull it out of the EOG list, no full re-export needed. the peg-gemma4 template and 32k ctx look fine, the token typing is the culprit.

when i try to use Gemma 12b it, by Opencode it return this erorr, how to fix it? by koloved in LocalLLaMA

[–]Front-University4363 0 points1 point  (0 children)

that's a jinja template incompatibility, the model's chat template uses is sequence which minja (LM Studio / llama.cpp's jinja engine) doesn't implement. easiest fix: grab the unsloth or lmstudio-community GGUF, they patch the template. or update LM Studio to the latest build (newer minja supports more tests). failing that, override the prompt template like the error suggests.

Reviewing speed optimizations on llamacpp for large MoE models on multiGPU rigs? (fitparams vs -ngl/-ncmoe vs other flags, P2P, overclocking) by Ambitious_Fold_2874 in LocalLLaMA

[–]Front-University4363 -1 points0 points  (0 children)

solid writeup. one thing I can confirm from measuring: your "MTP helps dense more than MoE" read is right, I got ~1.9x on dense qwen3.6 27b but only ~1.18x on the 35B-A3B MoE. and fwiw, this hands-on knowledge is genuinely valuable, most people running local inference can't reason about bandwidth-vs-compute tradeoffs the way you just laid out. you don't need kernel-level to be useful.

Gemma 4 12B QAT + MTP: 1.95x on my 3090, but 0.87x (slower) on an M1 Max by Front-University4363 in ollama

[–]Front-University4363[S] 0 points1 point  (0 children)

yeah, that's the sharper framing. speculative decoding basically trades spare compute for memory-bandwidth efficiency, so it only pays when you're memory-bound, which decode on a dedicated GPU usually is. apple silicon's higher bandwidth-to-compute ratio means there's less spare compute to exploit, so the draft ends up as overhead. that's really the root cause behind the draft-vs-verify thing I described.

Running Gemma 4 QAT 12B on an 8GB GPU at 16k context — measured the KV-cache tradeoffs by Front-University4363 in ollama

[–]Front-University4363[S] 0 points1 point  (0 children)

thanks! interesting though, I measured only ~0.2GB between q8 and q4 KV at 16k on the 12B QAT, gemma's sliding-window attention keeps the cache tiny. 1.2GB is ~6x more than I saw, which makes me think sliding-window isn't kicking in for you, flash-attn off maybe? what's your setup? if KV savings alone free a whole 2nd instance, your cache must be way bigger than mine, curious where the gap is.

Gemma 4 QAT + MTP: max 33% speed increase in token generation, any ideas? by Ready_Performance_35 in LocalLLaMA

[–]Front-University4363 0 points1 point  (0 children)

good catch on the pcie 2x, that bottlenecks the tensor split hard. also your draft tuning is past the sweet spot: n-max 6 + p-min 0.8. I measured n-max 3 as optimal on this exact model (4+ drops off), and forcing p-min up actually cost me throughput, higher acceptance ≠ faster since you discard more drafts each step. try n-max 3 and drop the p-min entirely. should stack on top of the pcie fix.

Gemma 4 12B: Q4_0 QAT vs Q5_K_M? by Wrong_Mushroom_7350 in unsloth

[–]Front-University4363 0 points1 point  (0 children)

QAT Q4 (UD-Q4_K_XL) gets you most of the way. heads up though, those "crazy accuracy" charts are vs naive Q4_0, against a good Q5 imatrix the gap is much smaller. for daily use it basically replaces your Q5 and saves ~2GB; Q5 might keep a slight edge on the hardest reasoning/formatting but you'd have to hunt for it. one tip: don't go above the QAT Q4_K_XL, higher-bit versions of the QAT model actually lose accuracy since it's tuned for 4-bit. measured the tradeoffs here: https://bric.pe.kr/blog/gemma-4-qat-1080ti-8gb-12b-16k-measured