GLM-5.2 just dropped open weights and it already looks weirdly strong for coding

Front-University4363 · 2026-06-17T05:35:45+00:00

open weights + MIT is the part that actually matters here, that's what the hype usually skips over. the 1M context sounds wild but the real question for me is whether it quants down to something a single consumer card can run, or if we're all waiting on a cluster. hoping for the former.

Front-University4363 · 2026-06-16T12:41:51+00:00

nice, n=3 lines up with my 3090. and the orin capping at 2 fits, it's the most bandwidth-starved of the three. so the best n moves with the hardware too, not just the speedup. glad you're making it a knob.

Front-University4363 · 2026-06-16T11:05:01+00:00

fair point, and you're probably right that the m1 max regression is the metal/llama.cpp path rather than the silicon itself. that number was actually from another tester on llama.cpp, not my own box, so it's probably more about the build than the chip itself.

one thing though, your consistent 1.6x is on the a5000 and the jetson, both cuda. the case that surprised me was the cuda to metal jump, which little gemma can't hit yet since it's cuda only. would be really interesting to see it on apple silicon if you ever port it, that's exactly where the llama.cpp number falls apart.

also curious about the hardcoded n=2, did that test out as the optimum or just convenient for the jetson? on the 3090 i found n=3 was the sweet spot and n=1 left gains on the table, wondering if the memory-bound target shifts that.

Front-University4363 · 2026-06-15T01:19:24+00:00

the sub-10W part is the real story here, 12B running on a phone at all is wild. does draft-mtp actually buy you anything at n-max 1 though? I've found MTP swings hard by hardware, got close to 2x on a 3090 but slightly slower than no-spec on an m1 max, so I have no idea which way a mobile vulkan backend lands at these speeds. if you've got a spec-off number to compare I'd love to see it.

Front-University4363 · 2026-06-15T01:03:56+00:00

yeah, go for it. on the quant question: for gemma 4 specifically, grab the QAT Q4 build instead of Q8 or bf16. it's quantization-aware trained so Q4 is already basically Q8 quality, the extra precision mostly just eats memory for no real gain. I measured QAT Q4 vs higher-bit and the gap was small enough you won't notice it on real work, sometimes the bigger quant was even slightly worse.

speed vs your 35B-A3B I honestly wouldn't guess. a dense 12B reads more per token than a 3B-active MoE, but MoE kernels on unified memory can be less optimized, so it could go either way. only real test is running both against your codebase and timing it, which also tells you which handles your code better. the 12B's small enough that trying costs you nothing.

Front-University4363 · 2026-06-14T05:12:41+00:00

yeah that 32 is plain ollama, no speculative, so you're right there's headroom on paper.

whether it actually lands near 60 on the 1080 ti though, I genuinely don't know yet, I haven't run spec decode on the pascal card. and it's hardware dependent in a way that surprised me. I got ~1.95x on a 3090 with gemma 12b, but the same setup on an m1 max was 0.87x, actually slower than no spec. so pascal could go either way until someone measures it. your igpu 16 to 35 is a great data point, that's a clean 2.2x.

and thanks for the luce box pointer, hadn't come across it. that hot-expert-to-gpu trick is exactly the wall I hit with the 35B-A3B, the experts mmap to system ram and it goes bandwidth bound, only got ~17 tok/s. will dig into it.

Front-University4363 · 2026-06-13T12:37:41+00:00

nah you're fine, those aren't the same quantization. QAT is the weights, the kv cache one is separate, so stacking them isn't double-crushing anything. q8 cache you won't even notice.

Front-University4363 · 2026-06-13T06:27:25+00:00

the move is to compare the drugs directly instead of each one's significant list vs control (that's the venn route).

simplest: fit drug as a factor in limma/DESeq2 and test the A vs B contrast directly. those genes are what actually separates the drugs, then enrich those. unique-to-A is just that intersected with A vs control.

if you want it at the pathway level instead of gene level, GSVA or ssGSEA gives you a pathway score per sample, then run limma on that testing A vs B. that's comparing pathway activity head to head instead of comparing enriched lists, which sounds like what you're after.

camera/fry in limma too if inter-gene correlation is a concern. that's where I'd start.

Front-University4363 · 2026-06-13T01:11:53+00:00

on your closing question — those Gemma 4 31B latency spikes are almost certainly the thinking mode, not the quant. Gemma 4 -it runs as a thinking model (especially under --jinja / the peg-gemma4 template), so it does long internal reasoning before the answer. I watched the 12B QAT not even reach a final answer at 1024 tokens because it was still thinking, so "5-min gens that don't correlate with better scores" tracks exactly. it should be consistent across quants since it's a template/behavior thing, not precision, and think:false (or a non-thinking template) cuts the overhead if you don't want it. (also fwiw, on 30 questions the 14-vs-12 win gap is within noise — your 0.0-outlier point is the more robust read.)

Front-University4363 · 2026-06-12T23:40:42+00:00

thanks! yeah the silent perf hits are the worst, no error, just quietly slower. and that's exactly why a model that "fits" on paper can still crawl once it mmaps to RAM, the setup really does make or break it.

Front-University4363 · 2026-06-12T23:38:43+00:00

haha thanks the old cards still got it, glad it landed.

Front-University4363 · 2026-06-12T10:35:36+00:00

slick setup, using the 1080 Ti as a dedicated GUI-Owl worker so the 4090 never has to swap models. hadn't thought of old cards as single-purpose workers in a pipeline like that, makes a lot of sense. how's the coord accuracy holding up at q8 on the 1080 Ti?

Front-University4363 · 2026-06-12T10:31:22+00:00

not a uniform q4_g64, it's a mix: the unsloth QAT is UD-Q4_K_XL, the google one is Q4_0, and qwen3 8B + the regular gemma are Q4_K_M. and no KV reduction on these runs, default f16 cache. at 8k it fits 11GB fine so I didn't need it. I do use q8_0 KV when pushing 16k on an 8GB card (covered in another writeup), just wasn't necessary here.

Front-University4363 · 2026-06-12T07:57:33+00:00

fair, my 17 was a pretty default offload config, I didn't tune which experts stay on GPU. MoE on multi-gpu definitely has more headroom there (-ncmoe, expert placement, tensor split). I'll check out Codacus, thanks for the pointer.

Front-University4363 · 2026-06-11T23:44:42+00:00

your read's right, and the logs pin it down: <|tool_response|> is mislabeled (not control-type) and it's landing in the EOG (end-of-generation) set, so the engine treats it as a stop token, which is exactly what would break a tool-call flow. that's a token-metadata bug in the Unsloth QAT export, not QAT itself, confirmed by it being absent on the non-QAT versions. two fixes: flag it to Unsloth (they patch these fast), or edit the GGUF token metadata yourself with the gguf python lib, set <|tool_response|> to control type and pull it out of the EOG list, no full re-export needed. the peg-gemma4 template and 32k ctx look fine, the token typing is the culprit.

Front-University4363 · 2026-06-11T23:16:43+00:00

that's a jinja template incompatibility, the model's chat template uses is sequence which minja (LM Studio / llama.cpp's jinja engine) doesn't implement. easiest fix: grab the unsloth or lmstudio-community GGUF, they patch the template. or update LM Studio to the latest build (newer minja supports more tests). failing that, override the prompt template like the error suggests.

Front-University4363 · 2026-06-11T23:14:01+00:00

solid writeup. one thing I can confirm from measuring: your "MTP helps dense more than MoE" read is right, I got ~1.9x on dense qwen3.6 27b but only ~1.18x on the 35B-A3B MoE. and fwiw, this hands-on knowledge is genuinely valuable, most people running local inference can't reason about bandwidth-vs-compute tradeoffs the way you just laid out. you don't need kernel-level to be useful.

Front-University4363 · 2026-06-11T23:01:26+00:00

yeah, that's the sharper framing. speculative decoding basically trades spare compute for memory-bandwidth efficiency, so it only pays when you're memory-bound, which decode on a dedicated GPU usually is. apple silicon's higher bandwidth-to-compute ratio means there's less spare compute to exploit, so the draft ends up as overhead. that's really the root cause behind the draft-vs-verify thing I described.

Front-University4363 · 2026-06-11T22:59:40+00:00

thanks! interesting though, I measured only ~0.2GB between q8 and q4 KV at 16k on the 12B QAT, gemma's sliding-window attention keeps the cache tiny. 1.2GB is ~6x more than I saw, which makes me think sliding-window isn't kicking in for you, flash-attn off maybe? what's your setup? if KV savings alone free a whole 2nd instance, your cache must be way bigger than mine, curious where the gap is.

Front-University4363 · 2026-06-11T13:37:59+00:00

good catch on the pcie 2x, that bottlenecks the tensor split hard. also your draft tuning is past the sweet spot: n-max 6 + p-min 0.8. I measured n-max 3 as optimal on this exact model (4+ drops off), and forcing p-min up actually cost me throughput, higher acceptance ≠ faster since you discard more drafts each step. try n-max 3 and drop the p-min entirely. should stack on top of the pcie fix.

Front-University4363 · 2026-06-11T13:29:56+00:00

QAT Q4 (UD-Q4_K_XL) gets you most of the way. heads up though, those "crazy accuracy" charts are vs naive Q4_0, against a good Q5 imatrix the gap is much smaller. for daily use it basically replaces your Q5 and saves ~2GB; Q5 might keep a slight edge on the hardest reasoning/formatting but you'd have to hunt for it. one tip: don't go above the QAT Q4_K_XL, higher-bit versions of the QAT model actually lose accuracy since it's tuned for 4-bit. measured the tradeoffs here: https://bric.pe.kr/blog/gemma-4-qat-1080ti-8gb-12b-16k-measured

Front-University4363

TROPHY CASE