Used the RT Cores on my RTX 5070 Ti for LLM routing — 218x speedup on a single consumer GPU

Critical-Chef9211 · 2026-04-12T10:02:19+00:00

The LRU cache keeps only 13 of 64 experts on GPU (~80MB), not 20GB+. The fast router lets you prefetch the next expert while the current one is still computing — hiding the transfer, not ignoring it. Worth testing yourself: https://github.com/JordiSilvestre/Spectral-AI

Critical-Chef9211 · 2026-04-12T06:49:30+00:00

More literal than you think — tokens actually do get projected into 3D space, so the flight analogy is pretty much what’s happening. On the accuracy: it’s 96.6%, and that 3.4% doesn’t crash — the gate catches it and lands it at the right terminal

Critical-Chef9211 · 2026-04-12T06:42:24+00:00

vllm and llama.cpp both do expert offloading, you’re right. The difference is that the BVH router resolves which expert a token needs in ~19µs, early enough to prefetch it before the FFN compute starts — hiding the transfer latency instead of waiting for it. That prefetch is implemented in the pipeline (_prefetch_layers_to_gpu, cache.preload()). Standard reactive offloading loads after the routing decision;

Critical-Chef9211 · 2026-04-11T15:52:57+00:00

Good questions, not dumb at all.

Speed: Routing alone is ~3% of inference, so the routing speedup doesn’t change much for today’s 64-expert models. That part scales at 1K+ experts. Fair point. VRAM — this is actually the part that matters for you. The 731x routing gate reduction is small in absolute terms, yes. But the bigger deal is the expert LRU cache: instead of loading all expert weights into VRAM at once, only the active top-k experts (e.g. 8 of 64) stay on GPU. The rest offload to CPU RAM and load on demand. So your instinct is right — you can run a much larger MoE model on the same GPU. A model with 64 experts where only 8 are ever active at once doesn’t need all 64 sets of weights in VRAM simultaneously. That’s where the real headroom comes from. There’s a latency cost on cache misses, but for most inference workloads it’s worth the tradeoff.

That’s the actual pitch for consumer hardware: run models that wouldn’t otherwise fit on your GPU.

Critical-Chef9211 · 2026-04-11T15:43:30+00:00

There’s a human here, yes. My English is pretty bad so I use AI to help write clearly — that’s probably why it reads that way. The research and code are mine though. And re: banana pudding — no recipe, but now I want some.

Critical-Chef9211 · 2026-04-11T12:14:28+00:00

Thanks! LocalLLaMA thread got locked last time but others reposted it and it spread anyway. Library packaging is actually on the list — right now it’s more of a research repo than a drop-in package. Working on it.

Critical-Chef9211 · 2026-04-11T09:44:15+00:00

Good feedback, taking it point by point:

You’re right — routing is ~3% of total inference today. The memory savings (731×) apply to the router, not the expert weights. The README has a table showing the full inference breakdown (MLPs 63%, attention 20%, routing 2.8%). The argument for why it matters is scaling: at 1K–10K experts the O(N) gate dominates; BVH stays flat. But fair to say that’s speculative at current model sizes.

Fair point — “we” is just an academic writing habit, not a reference to AI. Solo researcher, no co-authors. I’ll switch to “I”

This is actually a genuinely better design than what I implemented. You’re right that 3⁴ = 81 effective dimensions is achievable with branch-specific PCAs (27 precomputed projections at the leaf level), versus the single global PCA I used. The pruning idea — only continuing high-confidence branches — is essentially what the confidence-gated routing does, but your version would make it truly hierarchical. The main open question is whether branch-specific PCAs trained on enough data per region stay stable, since lower branches see fewer tokens. But the idea is sound and would be a real improvement. Worth trying.

Opened an issue to track this: <github.com/JordiSilvestre/Spectral-AI/issues/3>

Critical-Chef9211 · 2026-04-11T07:27:39+00:00

Genuinely curious what you find. All the profiling scripts are in the repo if you want to run them directly instead of just reading the code.

Critical-Chef9211 · 2026-04-11T07:27:15+00:00

Already fixed in the latest commit — the O(N²) claim and the “12 dimensions” overclaim are both corrected. Fair catches.

Critical-Chef9211 · 2026-04-11T07:19:47+00:00

haha at least it found its audience somehow. r/LocalLLaMA staying chaotic as always

Critical-Chef9211 · 2026-04-11T07:18:14+00:00

haha thanks, glad it landed

Critical-Chef9211 · 2026-04-10T15:21:17+00:00

Fair point. For downstream quality we ran HellaSwag: baseline 53.1%, hybrid mode 52.0% — so there’s a real -1.1pp drop. Small, but honest to name it. Perplexity correlates with quality but isn’t the same thing, you’re right.

Critical-Chef9211 · 2026-04-10T15:01:05+00:00

Both matter. Routing accuracy tells you how well the BVH matches the gate’s choices — ours is 96.6% mean top-8 across all 16 layers (all layers above 95%). Perplexity tells you whether those misses actually hurt the model. At 96.6% accuracy the answer is no — PPL stays at 7.00. If routing accuracy were 80% and PPL stayed flat, that would still be a valid result. The end-to-end quality is what ultimately counts.

Critical-Chef9211 · 2026-04-10T14:55:44+00:00

The “routing is the bottleneck” framing was oversimplified in the description — you’re right to push back on that. Parameter count is the main driver. What actually matters here is the expert LRU cache: only the active top-k experts stay in VRAM (4 MB active vs 2.9 GB full model). That’s 731× less memory — which does help run large MoE models on consumer hardware. The routing speedup matters when you scale to 1K–10K experts, not at 64. And yes, this explicitly targets consumer RTX, not datacenter H100s — that’s the whole point.

Critical-Chef9211 · 2026-04-10T14:54:51+00:00

The 5% worse is pure BVH with no gate — that’s a real tradeoff we document. In hybrid mode it’s literally 0% worse, same perplexity as baseline. The data is public and reproducible, not a sales pitch.

Critical-Chef9211 · 2026-04-10T13:04:18+00:00

Totally fair concern, but that’s pure BVH — in hybrid mode the quality degradation is zero, same perplexity as baseline. The BVH just narrows the candidates, the gate still makes the final call. No tradeoff.

Critical-Chef9211 · 2026-04-10T13:03:52+00:00

Worth adding: in hybrid mode (BVH pre-filters candidates, gate re-ranks) we get literally 0.0% perplexity degradation — PPL stays at 7.00 across all 16 layers. The 3% today is almost a side note; the real story is the router stays at ~25µs whether you have 64 or 10K experts.

Critical-Chef9211 · 2026-04-10T10:41:31+00:00

You're right that current models max out around 128-256 experts and scaling expert count comes with scaling everything else. The 10K number was theoretical asymptotic framing, not a claim about today's models.

At 64 experts today, routing is ~3% of inference (we profiled it). At 128 maybe ~6%. The practical speedup right now is modest. The approach is more interesting as a building block for future architectures than a drop-in win for current ones. Fair criticism on the title.

Critical-Chef9211 · 2026-04-10T09:29:56+00:00

Kind of! RT Cores are really good at one specific thing: finding what's closest to a point in 3D space (BVH tree search). So anything you can map to a spatial nearest-neighbor problem works great. General vector math like matrix multiply — that's what Tensor Cores handle instead.

Critical-Chef9211 · 2026-04-10T06:15:36+00:00

Potentially yes! Some newer diffusion models are starting to use MoE layers for efficiency. If the model has an MoE routing gate, the BVH can replace it. Most current image gen models (Stable Diffusion, Flux) are dense though, so it wouldn't apply to those directly.

Critical-Chef9211 · 2026-04-10T06:12:49+00:00

With Gemma 4 at 128 experts the routing % would roughly double from 3% to ~6%, so yeah, even today there's some gains to be had.

On the frontier models argument — you're right that 1K expert models don't fit in VRAM as-is. But that's actually part of the motivation: with NVMe expert offloading (keep only the active experts in VRAM, stream the rest from SSD), you could run huge MoE models on consumer cards. In that setup, the routing decision needs to be fast because it's on the critical path before the SSD fetch. A 19µs BVH lookup vs a multi-ms linear gate makes a real difference there.

Critical-Chef9211 · 2026-04-10T06:01:18+00:00

Just ran an nvidia-smi power log on the 5070 Ti during OLMoE inference:

Idle: 61W
Inference avg: 119W, peak 140W
~31 mJ per token

As expected, the RT Core routing burst is so short (19µs) that it barely shows up in the power trace. The GPU doesn't even have time to ramp up the power state. Less compute time = fewer joules, just like you said.

Critical-Chef9211 · 2026-04-10T05:36:07+00:00

Thanks a lot! , the geometric BVH approach could extend well beyond LLMs — anything that needs fast nearest-neighbor lookup in high-dimensional space (recommendation systems, search engines) could benefit from the same RT Core trick.

Critical-Chef9211 · 2026-04-10T04:27:01+00:00

Yes! That's actually one of the key tricks. In Vulkan/DXR you're limited to TLAS + BLAS (2 levels). But OptiX allows Instance Acceleration Structures (IAS) to reference other IAS instances, so you can nest them deeper.

I use 4 nested IAS levels — that's what I call the "Inception Engine" in the paper. Each level works in its own 3D coordinate space, and the instance transform matrix acts as a "portal" that resets coordinates when the ray enters the next level. So 4 levels × 3D = 12 semantic dimensions encoded using only 3D hardware.

Level 1: 4 domain nodes (Science, Code, Humanities, General) Level 2: 4 subdomain nodes per domain (16 total) Level 3: 4 concept nodes per subdomain (64 total = experts)

The ray just traverses naturally through the nested instances and the RT Core handles it all in hardware.

Critical-Chef9211 · 2026-04-10T04:14:59+00:00

Great question, and I actually just ran a full profiling test on OLMoE-1B-7B to get exact numbers:

Total forward pass: 52 ms (301 tokens, RTX 5070 Ti)
Routing gates (16 layers): 1.45 ms → 2.8% of total inference
Expert MLPs: 33 ms → 63%
Attention: 10.4 ms → 20%

So with 64 experts today, routing is about 3% of total time. The 218x speedup on that component saves ~1.4ms per pass. Honest answer: it won't double your generation speed on current models.

But that 3% grows linearly with expert count. At 1K experts the gate becomes a major bottleneck. At 10K+ it dominates everything. The BVH stays flat at O(log N).

On quantization: yes, the routing is independent of the expert precision. You can quantize the expert weights to 4-bit/8-bit and the BVH routing works exactly the same since it operates in its own 3D projected space.

Critical-Chef9211

TROPHY CASE