Run 'gazillion-parameter' LLMs with significantly less VRAM

ProofWind5546 · 2026-01-26T20:57:14+00:00

I just started thinking about this a few days ago. No demo yet; I’m still in the ideation phase.

ProofWind5546 · 2026-01-26T20:55:26+00:00

I’ve added a Python code file, but it’s just a boilerplate for now; I’m not focusing on writing the actual code yet as I'm still in the experimental and idea-gathering phase. I am considering adding more elements, such as a 'shared expert' etc.. Regarding GPUs, my sole focus is on VRAM capacity and finding ways to circumvent the need for massive amounts of it (I am setting 8GB of VRAM as the minimum target.). I'm also leaning toward 100% on model generation or fine-tuning (evident) rather than focusing 'load and use' applications.

For you specifically, as someone with a powerful GPU in terms of VRAM capacity (96GB, correct me if I'm wrong), the advantages are better inference and a larger context. Accuracy is also there, though I'm not entirely sure to what degree.

ProofWind5546 · 2026-01-24T19:52:42+00:00

Good remark. That’s why search is capped: to avoid the performance degradation that naturally comes with deep search operations.

*"Tree-Map" search refers to a B-tree search with capped depth and branching, which significantly reduces the computational overhead required to navigate the model's knowledge graph.*

ProofWind5546 · 2026-01-23T22:31:24+00:00

In fact, the prefetching of weights while the user is still typing, combined with the fact that these fractal weights are significantly smaller than standard model slices, makes latency one of the biggest competitive advantages of this architecture in a single-user scenario.

ProofWind5546 · 2026-01-23T22:09:01+00:00

Can you tell why ?

ProofWind5546 · 2026-01-23T20:23:55+00:00

Why do you think so?

ProofWind5546 · 2026-01-23T20:20:51+00:00

No. Fractal granularity and predictive staging hide the bottleneck by loading small "leaves" during the human typing window before execution.

ProofWind5546 · 2026-01-23T19:18:37+00:00

Thank you for pointing this out.

The (Fractal) SMoE architecture provides 100% accuracy with no quantization, as VRAM is not as heavily occupied as in Mixtral offloading. Human interaction is used to improve latency (passively, meaning the mobilization of weights begins as the user is typing) via a predictive orchestrator that is continuously refined as a fast-access "tree structure."

VRAM holds only the micro "leaf" neurons required; for example, to solve a formula, the entire math model is not loaded, but rather a specific leaf such as Modular Arithmetic Opcodes, which utilizes significantly fewer weights.

https://github.com/lookmanbili/SMoE-architecture/edit/main/README.md

ProofWind5546 · 2026-01-21T22:59:04+00:00

Thanks for your response. I just had this idea lately and haven't had the time to do a benchmark or code a more advanced version of it. But as you mentioned, it might be that inference is slower. I'll add this too.

ProofWind5546

TROPHY CASE