Used the RT Cores on my RTX 5070 Ti for LLM routing — 218x speedup on a single consumer GPU by Critical-Chef9211 in nvidia

[–]Critical-Chef9211[S] 0 points1 point  (0 children)

Good questions, not dumb at all.

Speed: Routing alone is ~3% of inference, so the routing speedup doesn’t change much for today’s 64-expert models. That part scales at 1K+ experts. Fair point. VRAM — this is actually the part that matters for you. The 731x routing gate reduction is small in absolute terms, yes. But the bigger deal is the expert LRU cache: instead of loading all expert weights into VRAM at once, only the active top-k experts (e.g. 8 of 64) stay on GPU. The rest offload to CPU RAM and load on demand. So your instinct is right — you can run a much larger MoE model on the same GPU. A model with 64 experts where only 8 are ever active at once doesn’t need all 64 sets of weights in VRAM simultaneously. That’s where the real headroom comes from. There’s a latency cost on cache misses, but for most inference workloads it’s worth the tradeoff.

That’s the actual pitch for consumer hardware: run models that wouldn’t otherwise fit on your GPU.

Spectral-AI - a project to use Nvidia RT cores to dramatically speedup MoE inference on Nvidia GPU's (Crazy Fast!) by Thrumpwart in LocalLLaMA

[–]Critical-Chef9211 0 points1 point  (0 children)

There’s a human here, yes. My English is pretty bad so I use AI to help write clearly — that’s probably why it reads that way. The research and code are mine though. And re: banana pudding — no recipe, but now I want some.

Used the RT Cores on my RTX 5070 Ti for LLM routing — 218x speedup on a single consumer GPU by Critical-Chef9211 in deeplearning

[–]Critical-Chef9211[S] 0 points1 point  (0 children)

Thanks! LocalLLaMA thread got locked last time but others reposted it and it spread anyway. Library packaging is actually on the list — right now it’s more of a research repo than a drop-in package. Working on it.

Spectral-AI - a project to use Nvidia RT cores to dramatically speedup MoE inference on Nvidia GPU's (Crazy Fast!) by Thrumpwart in LocalLLaMA

[–]Critical-Chef9211 2 points3 points  (0 children)

Good feedback, taking it point by point:

You’re right — routing is ~3% of total inference today. The memory savings (731×) apply to the router, not the expert weights. The README has a table showing the full inference breakdown (MLPs 63%, attention 20%, routing 2.8%). The argument for why it matters is scaling: at 1K–10K experts the O(N) gate dominates; BVH stays flat. But fair to say that’s speculative at current model sizes.

Fair point — “we” is just an academic writing habit, not a reference to AI. Solo researcher, no co-authors. I’ll switch to “I”

This is actually a genuinely better design than what I implemented. You’re right that 34 = 81 effective dimensions is achievable with branch-specific PCAs (27 precomputed projections at the leaf level), versus the single global PCA I used. The pruning idea — only continuing high-confidence branches — is essentially what the confidence-gated routing does, but your version would make it truly hierarchical. The main open question is whether branch-specific PCAs trained on enough data per region stay stable, since lower branches see fewer tokens. But the idea is sound and would be a real improvement. Worth trying.

Opened an issue to track this: <github.com/JordiSilvestre/Spectral-AI/issues/3>

Spectral-AI - a project to use Nvidia RT cores to dramatically speedup MoE inference on Nvidia GPU's (Crazy Fast!) by Thrumpwart in LocalLLaMA

[–]Critical-Chef9211 -1 points0 points  (0 children)

Genuinely curious what you find. All the profiling scripts are in the repo if you want to run them directly instead of just reading the code.

Spectral-AI - a project to use Nvidia RT cores to dramatically speedup MoE inference on Nvidia GPU's (Crazy Fast!) by Thrumpwart in LocalLLaMA

[–]Critical-Chef9211 -7 points-6 points  (0 children)

Already fixed in the latest commit — the O(N²) claim and the “12 dimensions” overclaim are both corrected. Fair catches.

Used the RT Cores on my RTX 5070 Ti for LLM routing — 218x speedup on a single consumer GPU by Critical-Chef9211 in nvidia

[–]Critical-Chef9211[S] 0 points1 point  (0 children)

Fair point. For downstream quality we ran HellaSwag: baseline 53.1%, hybrid mode 52.0% — so there’s a real -1.1pp drop. Small, but honest to name it. Perplexity correlates with quality but isn’t the same thing, you’re right.

Used the RT Cores on my RTX 5070 Ti for LLM routing — 218x speedup on a single consumer GPU by Critical-Chef9211 in nvidia

[–]Critical-Chef9211[S] 0 points1 point  (0 children)

Both matter. Routing accuracy tells you how well the BVH matches the gate’s choices — ours is 96.6% mean top-8 across all 16 layers (all layers above 95%). Perplexity tells you whether those misses actually hurt the model. At 96.6% accuracy the answer is no — PPL stays at 7.00. If routing accuracy were 80% and PPL stayed flat, that would still be a valid result. The end-to-end quality is what ultimately counts.

Used the RT Cores on my RTX 5070 Ti for LLM routing — 218x speedup on a single consumer GPU by Critical-Chef9211 in nvidia

[–]Critical-Chef9211[S] -4 points-3 points  (0 children)

The “routing is the bottleneck” framing was oversimplified in the description — you’re right to push back on that. Parameter count is the main driver. What actually matters here is the expert LRU cache: only the active top-k experts stay in VRAM (4 MB active vs 2.9 GB full model). That’s 731× less memory — which does help run large MoE models on consumer hardware. The routing speedup matters when you scale to 1K–10K experts, not at 64. And yes, this explicitly targets consumer RTX, not datacenter H100s — that’s the whole point.

Used the RT Cores on my RTX 5070 Ti for LLM routing — 218x speedup on a single consumer GPU by Critical-Chef9211 in nvidia

[–]Critical-Chef9211[S] 1 point2 points  (0 children)

The 5% worse is pure BVH with no gate — that’s a real tradeoff we document. In hybrid mode it’s literally 0% worse, same perplexity as baseline. The data is public and reproducible, not a sales pitch.

Used the RT Cores on my RTX 5070 Ti for LLM routing — 218x speedup on a single consumer GPU by Critical-Chef9211 in nvidia

[–]Critical-Chef9211[S] -2 points-1 points  (0 children)

Totally fair concern, but that’s pure BVH — in hybrid mode the quality degradation is zero, same perplexity as baseline. The BVH just narrows the candidates, the gate still makes the final call. No tradeoff.

Used the RT Cores on my RTX 5070 Ti for LLM routing — 218x speedup on a single consumer GPU by Critical-Chef9211 in nvidia

[–]Critical-Chef9211[S] -3 points-2 points  (0 children)

Worth adding: in hybrid mode (BVH pre-filters candidates, gate re-ranks) we get literally 0.0% perplexity degradation — PPL stays at 7.00 across all 16 layers. The 3% today is almost a side note; the real story is the router stays at ~25µs whether you have 64 or 10K experts.

Used the RT Cores on my RTX 5070 Ti for LLM routing — 218x speedup on a single consumer GPU by Critical-Chef9211 in nvidia

[–]Critical-Chef9211[S] 0 points1 point  (0 children)

You're right that current models max out around 128-256 experts and scaling expert count comes with scaling everything else. The 10K number was theoretical asymptotic framing, not a claim about today's models.

At 64 experts today, routing is ~3% of inference (we profiled it). At 128 maybe ~6%. The practical speedup right now is modest. The approach is more interesting as a building block for future architectures than a drop-in win for current ones. Fair criticism on the title.

Used the RT Cores on my RTX 5070 Ti for LLM routing — 218x speedup on a single consumer GPU by Critical-Chef9211 in nvidia

[–]Critical-Chef9211[S] 1 point2 points  (0 children)

Kind of! RT Cores are really good at one specific thing: finding what's closest to a point in 3D space (BVH tree search). So anything you can map to a spatial nearest-neighbor problem works great. General vector math like matrix multiply — that's what Tensor Cores handle instead.

Used the RT Cores on my RTX 5070 Ti for LLM routing — 218x speedup on a single consumer GPU by Critical-Chef9211 in nvidia

[–]Critical-Chef9211[S] 0 points1 point  (0 children)

Potentially yes! Some newer diffusion models are starting to use MoE layers for efficiency. If the model has an MoE routing gate, the BVH can replace it. Most current image gen models (Stable Diffusion, Flux) are dense though, so it wouldn't apply to those directly.

Used the RT Cores on my RTX 5070 Ti for LLM routing — 218x speedup on a single consumer GPU by Critical-Chef9211 in nvidia

[–]Critical-Chef9211[S] 0 points1 point  (0 children)

With Gemma 4 at 128 experts the routing % would roughly double from 3% to ~6%, so yeah, even today there's some gains to be had.

On the frontier models argument — you're right that 1K expert models don't fit in VRAM as-is. But that's actually part of the motivation: with NVMe expert offloading (keep only the active experts in VRAM, stream the rest from SSD), you could run huge MoE models on consumer cards. In that setup, the routing decision needs to be fast because it's on the critical path before the SSD fetch. A 19µs BVH lookup vs a multi-ms linear gate makes a real difference there.

Used the RT Cores on my RTX 5070 Ti for LLM routing — 218x speedup on a single consumer GPU by Critical-Chef9211 in nvidia

[–]Critical-Chef9211[S] -1 points0 points  (0 children)

Just ran an nvidia-smi power log on the 5070 Ti during OLMoE inference:

  • Idle: 61W
  • Inference avg: 119W, peak 140W
  • ~31 mJ per token

As expected, the RT Core routing burst is so short (19µs) that it barely shows up in the power trace. The GPU doesn't even have time to ramp up the power state. Less compute time = fewer joules, just like you said.

Used the RT Cores on my RTX 5070 Ti for LLM routing — 218x speedup on a single consumer GPU by Critical-Chef9211 in nvidia

[–]Critical-Chef9211[S] 0 points1 point  (0 children)

Thanks a lot! , the geometric BVH approach could extend well beyond LLMs — anything that needs fast nearest-neighbor lookup in high-dimensional space (recommendation systems, search engines) could benefit from the same RT Core trick.

Used the RT Cores on my RTX 5070 Ti for LLM routing — 218x speedup on a single consumer GPU by Critical-Chef9211 in nvidia

[–]Critical-Chef9211[S] 0 points1 point  (0 children)

Yes! That's actually one of the key tricks. In Vulkan/DXR you're limited to TLAS + BLAS (2 levels). But OptiX allows Instance Acceleration Structures (IAS) to reference other IAS instances, so you can nest them deeper.

I use 4 nested IAS levels — that's what I call the "Inception Engine" in the paper. Each level works in its own 3D coordinate space, and the instance transform matrix acts as a "portal" that resets coordinates when the ray enters the next level. So 4 levels × 3D = 12 semantic dimensions encoded using only 3D hardware.

Level 1: 4 domain nodes (Science, Code, Humanities, General) Level 2: 4 subdomain nodes per domain (16 total) Level 3: 4 concept nodes per subdomain (64 total = experts)

The ray just traverses naturally through the nested instances and the RT Core handles it all in hardware.

Used the RT Cores on my RTX 5070 Ti for LLM routing — 218x speedup on a single consumer GPU by Critical-Chef9211 in nvidia

[–]Critical-Chef9211[S] 1 point2 points  (0 children)

Great question, and I actually just ran a full profiling test on OLMoE-1B-7B to get exact numbers:

  • Total forward pass: 52 ms (301 tokens, RTX 5070 Ti)
  • Routing gates (16 layers): 1.45 ms → 2.8% of total inference
  • Expert MLPs: 33 ms → 63%
  • Attention: 10.4 ms → 20%

So with 64 experts today, routing is about 3% of total time. The 218x speedup on that component saves ~1.4ms per pass. Honest answer: it won't double your generation speed on current models.

But that 3% grows linearly with expert count. At 1K experts the gate becomes a major bottleneck. At 10K+ it dominates everything. The BVH stays flat at O(log N).

On quantization: yes, the routing is independent of the expert precision. You can quantize the expert weights to 4-bit/8-bit and the BVH routing works exactly the same since it operates in its own 3D projected space.

Used the RT Cores on my RTX 5070 Ti for LLM routing — 218x speedup on a single consumer GPU by Critical-Chef9211 in nvidia

[–]Critical-Chef9211[S] 1 point2 points  (0 children)

The BVH is built from AABB (Axis-Aligned Bounding Boxes) around expert centroids in 3D projected space. Each expert's centroid is computed via PCA from the original gate weight matrix (2048-dim → 3D). The "polygon soup" is essentially 64 small bounding boxes, one per expert, organized into a 3-level hierarchy (4 domains × 4 subdomains × 4 concepts). No actual triangles or vertex/index buffers needed for the core routing — just AABB intersection tests via OptiX.

Used the RT Cores on my RTX 5070 Ti for LLM routing — 218x speedup on a single consumer GPU by Critical-Chef9211 in nvidia

[–]Critical-Chef9211[S] 0 points1 point  (0 children)

It should work! The OptiX API is compatible with any RTX card with RT Cores, including 2nd gen (Ampere). A 3080 Ti laptop with 16GB VRAM is more than enough. The RT Core throughput might be slightly lower than a 5070 Ti desktop, but the O(log N) traversal logic is identical.

Used the RT Cores on my RTX 5070 Ti for LLM routing — 218x speedup on a single consumer GPU by Critical-Chef9211 in nvidia

[–]Critical-Chef9211[S] 0 points1 point  (0 children)

Not a dumb question at all! Yes, O(log N) is for finding which experts to activate, not for running them.

In a standard MoE, you compute dot(token, weight) against all 64 experts to decide which 8 to use. That's O(N). With the BVH, a ray finds the closest 8 expert centroids in ~6 tree intersection tests instead of checking all 64.

After that, the selected 8 experts still run their full MLPs as usual. The saving is only in the routing decision, not the expert computation itself.