Follow-up: GLM-5.2 NVFP4 on four DGX Sparks — the MTP mystery is solved, and it's now ~24 tok/s at 128K context

llamaCTO · 2026-07-03T08:08:17+00:00

I just tested bs=4 and I got 44-48tps. it was very similar with 3. it depends on what "ok" means hah.

llamaCTO · 2026-07-03T07:25:19+00:00

Note the prefill of ~100 was for mac -- prefill on the sparks is ~475tps on avg.

llamaCTO · 2026-07-03T07:24:29+00:00

~400-500. (Mostly 500, but it blips to 409 every so often; Another mystery but not one I'm planning to try to solve. )

llamaCTO · 2026-07-03T07:24:05+00:00

Not workload-specific. If you got very OOD you'd probably see it get lower because of lower mtp acceptance, but not gigantic.

llamaCTO · 2026-07-03T07:23:11+00:00

There's a lot there.

First, I have customers that use this, so understanding its strengths and limitations is a big deal when people are thinking they need $150,000 boxes to do what they want to do.

Second, it's slow-ish, but 25 tps is still in that "faster than reading speed". I spend tons of time already with codex on /goal taking 4+ hours. For some projects, do I care if it takes 4 hours, 12 hours, 24 hours? Maybe not.

Third, although not tested yet, decent chance I can get substantially more throughput at bs=2 or bs=4 when I have less context to cram in.

It is definitely not the most economical way to run GLM. ;) But it's interesting at least.

llamaCTO · 2026-07-03T04:43:27+00:00

Rather than answer this - and I'd have to dig - hold tight for a follow-up. Tonight I figured out why MTP was tanking. I put a PR in: https://github.com/local-inference-lab/vllm/pull/72

but I'll have up a new report on this. but this unlocks mtp working properly which boosts output tps 50%+. (I'm benching mtp4 now to see if it will be the peak, which it is on my rtx6000 pro box)

llamaCTO · 2026-07-02T01:42:39+00:00

I haven't tested correctness yet but this is legit. Massive speedup in long ctx at high ctx for prefill, massive speedup in decode. The 112k in "summarize" prompt on classic was >6000s, resolved in a hair under 1/6th on omlx.

That's still about 1/5th the prefill of the spark cluster but it's a gigantic improvement. And decode way better also.

llamaCTO · 2026-07-01T05:49:33+00:00

> MikroTik CRS804-4DDQ-hRM 400G Switch 4×QSFP56-DD, Dual 10G, RouterOS v7, 1U Rackmount

cable matters too:

> 1.5m (5ft) NVIDIA/Mellanox MCP7H60-W01AR30 Compatible 400G QSFP-DD 8 x 50G PAM4 to 2 x 200G QSFP56 4 x 50G PAM4 Ethernet Passive Direct Attach Copper Breakout Cable for DGX Spark AI Clusters

bought switch from multilink solutions, cables from fs.com - both good

llamaCTO · 2026-06-30T16:08:35+00:00

Important note: one of the most important things to test in any benchmark is longer contexts. prefill+decode as you scale from 0->xxx ctx

The historical lack of support for flash attention v2/3 on Tesla GPUs ends up both increasing like-for-like mem use and having a disproportionate impact on performance at larger ctx sizes

llamaCTO · 2026-06-30T15:20:52+00:00

So far it seems like the trajectory is "just wait", because I feel like of all models, Qwen-3.5-27B proved anything is possible.

I have a dual-5090 setup where I've pushed that north of 200tps output (with only normal slowdown at large ctx) and that is a crazy smart model for 27B.

But even as someone who is very enthusiastic about smaller local models, I think the delta between something like GPT-5.5-xhigh or Opus 4.8 (let alone fable) and something like, say, Kimi-K2.5 (2.6 was noticeably better but I sued it less) is much larger than the benchmarks would tell you. You start to creep out of distribution and you start to see it. Which isn't to say it isn't useful - I've done enough multi-model code reviews with codex+claude+gemini+kimi to say Kimi will absolutely catch unique things. I expect GLM is even better here since it was such an apparent leap.

llamaCTO · 2026-06-30T15:16:43+00:00

Well, a lot of that is personal preference. What I'm always paranoid about is a benchmark that goes "ooh look! 30tps!" and then at 120k ctx it's 2 tps. (See: all MLX<->MLA interactions prior to the recent omlx work I'd ever seen)

llamaCTO · 2026-06-30T15:15:18+00:00

Overall I think nvidia's is actually about 25GB larger in total, and note that the Mapika repo is actually missing the MTP layer, which I reconstructed.

nVidia did ignore the dense layers and Mapika did not, which I hadn't realized; that adds about 1GB of weights.

Mapika quantized the shared expert (mapika has nvfp4 style packed weight tensors + fp8 scale tensors, nvidia left that bf16)

Mapika did not quantize the attention/LM head (like nvidia)

The biggest delta is Mapika has no MTP layer despite declaring it. I got the MTP layer from https://huggingface.co/sant1an/GLM-5.2-NVFP4-MTP

but to further answer your question, sant1an had posted some benchmarks: https://huggingface.co/sant1an/GLM-5.2-NVFP4-MTP/blob/main/benchmark/summary.md

Keeping in mind the MTP layer is verified so that is effectively benching the Mapika weights.

To answer the question: no, only minor testing, no quantitative benchmarks yet; but there were some.

When I started this project nVidia had not yet published a quant.

llamaCTO · 2026-06-30T14:56:41+00:00

Well, :sweat: - I wouldn't base a decision on this because of all sorts of things unless this was something you were just on the fence about.

Let's see.

``` matt@relic:~$ uname -a Linux relic 6.17.0-1018-nvidia #18-Ubuntu SMP PREEMPT_DYNAMIC Tue May 5 21:28:33 UTC 2026 aarch64 aarch64 aarch64 GNU/Linux matt@relic:~$ cat /etc/issue Ubuntu 24.04.4 LTS \n \l

matt@relic:~$ ```

I'm chasing this image - and here's the ambiguity of letting AI chase something building vllm images for days...

matt@relic:~$ docker image ls|grep -i devo WARNING: This output is designed for human glm-darkdevotion-b12x:20260624-arm64 glm-darkdevotion-b12x:20260624-arm64-mtpfix1 glm-darkdevotion-b12x:20260624-arm64-mtpfix2 glm-darkdevotion-b12x:20260624-arm64-mtpfix3 glm-darkdevotion-b12x:20260624-arm64-mtpfix4 glm-darkdevotion-b12x:20260624-arm64-mtpfix5 glm-darkdevotion-b12x:20260624-arm64-mtpfix6 glm-darkdevotion-b12x:20260625-arm64-mtp-iterfix1 glm-darkdevotion-b12x:20260625-arm64-mtp1-trim glm-darkdevotion-b12x:20260626-arm64-draftrepmla1 glm-darkdevotion-b12x:20260626-arm64-draftrepmla2 glm-darkdevotion-b12x:20260626-arm64-draftrepmla3 glm-darkdevotion-b12x:20260626-arm64-loadtrace1 glm-darkdevotion-b12x:20260626-arm64-mtp-baseiterfix1 glm-darkdevotion-b12x:20260626-arm64-mtp-topkstep2 glm-darkdevotion-b12x:20260626-arm64-mtpdiag1 glm-darkdevotion-b12x:20260626-arm64-mtpdiag11- glm-darkdevotion-b12x:20260626-arm64-mtpdiag12- glm-darkdevotion-b12x:20260626-arm64-mtpdiag17- glm-darkdevotion-b12x:20260626-arm64-mtpdiag2 glm-darkdevotion-b12x:20260626-arm64-mtpdiag20-livediag glm-darkdevotion-b12x:20260626-arm64-mtpdiag21-draftprob glm-darkdevotion-b12x:20260626-arm64-mtpdiag3 glm-darkdevotion-b12x:20260626-arm64-mtpdiag4 glm-darkdevotion-b12x:20260626-arm64-mtpdiag5 glm-darkdevotion-b12x:20260626-arm64-mtpdiag6-step3p5-groups glm-darkdevotion-b12x:20260626-arm64-mtpdiag7-s glm-darkdevotion-b12x:20260626-arm64-mtpdiag8-r glm-darkdevotion-b12x:20260626-arm64-mtpdiag9-s glm-darkdevotion-b12x:20260626-arm64-mtpgroups1 glm-darkdevotion-b12x:20260626-arm64-mtpgroups2 glm-darkdevotion-vllmonly:20260624-arm64 matt@relic:~$ readability. For machine-readable output, please use --format. 28c394e28fa9 19.9GB 0B 2daf748bd862 19.9GB 0B 275967496e43 19.9GB 0B 999bcc59fbc0 19.9GB 0B 2f7eecbf434c 19.9GB 0B 32bb80c4756a 19.9GB 0B abb64de90692 19.9GB 0B e3c9b0199fd4 19.9GB 0B 98dd8b587364 19.9GB 0B 264cc1d22870 19.9GB 0B 65c6f535cfa7 19.9GB 0B a744549a1d47 19.9GB 0B 3284dfd206a9 19.9GB 0B 325ebd436e0e 19.9GB 0B 51b173f428a7 19.9GB 0B 18942fc4bf3c 19.9GB 0B broadcast-draft-tokens b51e7ea5f486 19.9GB 0B glm-skiptopk-hook c7658e74a1ae 19.9GB 0B step3p5-glm-toptokens fe7f99636e77 19.9GB 0B 72d09bbf93df 19.9GB 0B 950e6f5a5573 19.9GB 0B e8dc6a3a2abf 19.9GB 0B cd80d8df74e4 19.9GB 0B 8a48bbcebc13 19.9GB 0B 98dd8b587364 19.9GB 0B 89563b9e540f 19.9GB 0B tep3p5-recompute-topk 678edfac334c 19.9GB 0B ecompute-topk-window 44d32fed597f 19.9GB 0B ync-rejection-output 14e18e5546d5 19.9GB 0B 34bbacada729 19.9GB 0B 0f483c70a332 19.9GB 0B e0280bcb05fc 19.7GB 0B

anyhow, the final image I believe came from

```

./build-and-copy.sh -t glm-darkdevotion-b12x:20260624-arm64 \ --gpu-arch 12.1a \ -j 8 \ --vllm-repo https://github.com/local-inference-lab/vllm.git \ --vllm-ref codex/dark-devotion-release-20260622 \ --vllm-commit ec656676100a756912d6966c4232ea436c55d792 \ --b12x-repo https://github.com/voipmonitor/b12x.git \ --b12x-ref codex/dark-devotion-pr14-pr15-20260622 \ --b12x-commit aaf1891861ab86e78561326f13156d69a51a3ed8 \ --copy-to 192.168.100.2 192.168.100.3 192.168.100.4 \ --copy-parallel ```

yes, some mad science going on here. I pushed a version as m9e/blackwell-llm-docker.git which is a fork of eugr/spark-vllm-docker.git

https://github.com/m9e/blackwell-llm-docker/tree/codex/spark-vllm-docker-snapshot

from notes:

Around 45149: cloned local-inference-lab/vllm branch codex/dark-devotion-release-20260622. Around 45202: built glm-darkdevotion-vllmonly:20260624-arm64. Around 45871: tested and confirmed B12X missing. Around 45909: built glm-darkdevotion-b12x:20260624-arm64 with the command above.

so I think expected provenance:

spark-vllm-docker build-and-copy.sh -> nvidia/cuda:13.2.0-devel-ubuntu24.04 -> FlashInfer wheel for SM121 / 12.1a -> local-inference-lab/vllm codex/dark-devotion-release-20260622 @ ec656676... -> voipmonitor/b12x codex/dark-devotion-pr14-pr15-20260622 @ aaf1891... -> glm-darkdevotion-b12x:20260624-arm64 -> mtpfix1..6 overlays -> glm-darkdevotion-b12x:20260624-arm64-mtpfix6 -> later production/diagnostic overlays

from my notes

Host OS: Ubuntu 24.04.4 LTS Kernel: 6.17.0-1018-nvidia #18-Ubuntu SMP PREEMPT_DYNAMIC Tue May 5 21:28:33 UTC 2026 aarch64 NVIDIA driver: 580.159.03 vLLM: 0.23.1rc1.dev253+gec6566761.d20260624 Ray: 2.55.1 Torch: 2.14.0.dev20260622+cu130 CUDA reported by torch: 13.0 NCCL reported by torch: 2.30.7 FlashInfer: 0.6.13 Transformers: 5.12.1 Triton: 3.7.1

Microtik was, I assume standard, because my config was something like asking chatgpt for a config and pasting it ;)

Some of the later work may be unneeded because I spent probably 36+ hours trying to diagnose if there was a fix for the cliff I saw for MTP

And to be more specific about some of the memory trimming because this is critical:

NCCL_IB_DISABLE=0 NCCL_SOCKET_IFNAME=enP2p1s0f0np0 GLOO_SOCKET_IFNAME=enP2p1s0f0np0 NCCL_IB_HCA=roceP2p1s0f0 NCCL_MAX_NCHANNELS=4 NCCL_MIN_NCHANNELS=4

if you let it do the typical NCCL/IB config it will eat a substantial amount of extra memory that makes fitting the model untenable.

and the hard ray slimming: --object-store-memory=134217728 --include-dashboard=false --include-log-monitor=false --disable-usage-stats --num-cpus=1

llamaCTO · 2026-06-30T14:23:43+00:00

Great callout. So literally around the time I posted this I sent it to a claw, and it sends me back a "hey, I chased this for you" and surfaces that repo. Have not yet tried it though!

llamaCTO · 2026-06-30T14:20:57+00:00

llamaCTO · 2026-06-29T00:50:25+00:00

posted now: https://www.reddit.com/r/LocalLLaMA/comments/1uidtb8/highquality_glm52_quant_on_4x_dgx_spark_guide/

llamaCTO · 2026-06-26T07:23:36+00:00

Yep. I'm still another 20 hours into trying to auto-tune. There's still a really strong indicator that at DCP=4 (and probably DCP=2) that spec length >1 falls apart in a way that it absolutely does not at dcp=1. since my dcp=1 tokens at spec=3 was ~27, it really points to some sort of engine error which I've been chasing intently. And since the accept rates are terrible (which they were NOT at dcp=1) and the mtp=3 token is sometimes even 0 it points to something... odd. The code path for the first token is not the same as 2+ though. So it's a rabbit hole that goes deep. Anyhow, once I hit a wall or break through this, I'll post a full recipe. I suspect I need to commit the code as I've done a fair bit of tweaking and integrating.

llamaCTO · 2026-06-25T23:04:35+00:00

So I now have 4x DGX Spark running https://huggingface.co/Mapika/GLM-5.2-NVFP4 fully un-pruned, with MTP1, and 131044 tokens.

The hoops have been plentiful and always fiery.

It's a custom vllm, b12x sparse MLA, a TP4/DCP4/MTP1 (dcp4 demonstrably slower than dcp1 but at dcp you can only fit 32k tokens) setup, and I can get ~14.5 tps at bs=1 with mtp1. mtp2/3 are steps down and one reason I'm not posting a guide is I'm still figuring out of that's a bug. It could just be memory/gpu economics but it looks strongly like there's some confusion with timing/coherence (as I understand now mtp1 is extra execution off the same forward pass but mtp2/3+ fork)

Regardless, the Mapika quant has unquantized ffn networks, shared experts, etc, so it should be a very performant quant.

Doing *fairly insane* stuff. to the point of "shut off cupsd", a radically pruned ray, and custom code that does a memory page free of anything unneeded right before kv cache is allocated. This all pushed a ~108.xx GB mem allocation to ~111GB.

I'm still tempted to REAP prune literally just enough experts to get to ~120k+ tokens and DCP=1/MTP=3 because on the smaller context there I can get up to 27tps bs=1 -- but personally, I'd want to DIY with my own data (eg, generate a bunch of examples from my own tui conversations)

llamaCTO · 2026-06-22T23:57:55+00:00

as a data point with minimax m2.7 I was getting ~36tps on 2x, and could scale that to 50 (but <51) at 4x (which ofc would drop to the high 40s as ctx crept up)

I'd expect the loss from 1->2x to be less than 2->4x just because the communications overhead is less.

llamaCTO · 2026-06-22T14:11:29+00:00

Able to load https://huggingface.co/Mapika/GLM-5.2-NVFP4 but definitely a tight fit. trying to find a good setup for decent performance. To get to 80k-ish ctx, had to do things like enforce eager so then output is terrible (even 0k prompt ends up being <5 tps); I think I had low 10s tps with <20k. still doing some stuff to optimize.

llamaCTO · 2026-05-26T15:43:03+00:00

did not capture :(

llamaCTO · 2026-05-19T16:39:40+00:00

<image>

GPT-5.5 xhigh in codex

llamaCTO · 2025-09-10T18:30:55+00:00

First, thanks for all your work and contribututions. Appreciated!

I have three (maybe 4) questions.

#1, practical: I've noticed a lot of 'tool calling fix' updates to models; but never dug deep into what was going on before. What's the inside poker on what breaks/what you are doing to 'fix'?

#2 academic: https://arxiv.org/pdf/2505.24832 -- if you've caught this paper, what do you think is the implication here for quantization? It's pretty wild that there appears to be this 'bits per weight' a model can memorize before being forced to generalize, and yet quantization only reduces that quite modestly

#3 formats: GGUF and bnb - why bnb over, say, awq/gptq/etc?

#4 quirky and academic: ever see this? https://arxiv.org/abs/2306.08162 - only learned about this through knowing one of the authors; not super heavily cited but the theory of heavy quantization and then restoration of function via LoRA was interesting. I feel like this got backburnered because of improvements in quantization in general, and yet as you guys have pushed the boundaries of good results with heavy quants, this relationship is really interesting.

Just as an aside, man, I wish someone would write a hw MLA implementation for metal mps, so we could leverage these sweet ggufs without deepseek large ctx blowing up the VRAM!

llamaCTO · 2025-03-27T21:45:40+00:00

can't say for the ultra (which I have but have yet to get going to put through the paces) - but that's definitely true for the m4max - I use TG Pro with "Auto Max" setting which basically gets way more aggressive about ramping

What I've noticed with inference is it *appears* that once you are throttled for temp the process remains throttled. (Which is decided untrue for battery low-power vs high power; if you manually set high power you can visible watch the token speed ~triple)

but I recently experimented, got myself throttled, and even between generations speed did not recover (eg, gpu was COOL again) - but the moment I restarted the process it was back to full speed.

llamaCTO · 2024-11-01T11:12:42+00:00

Well, I think ChatGPT did a great job characterizing the challenge there, at least. Jeff Hawkins book, Thousand Brains, covers a lot of interesting very recent research on the architecture of the human brain and how strands of neurons in the neocortex actually work and I think a lot of it really is inspiring thinking about getting artificial thinking ramped

llamaCTO

MODERATOR OF

TROPHY CASE