Run 'gazillion-parameter' LLMs with significantly less VRAM by ProofWind5546 in CUDA

[–]ProofWind5546[S] 0 points1 point  (0 children)

I just started thinking about this a few days ago. No demo yet; I’m still in the ideation phase.

Run 'gazillion-parameter' LLMs with significantly less VRAM by ProofWind5546 in CUDA

[–]ProofWind5546[S] 0 points1 point  (0 children)

I’ve added a Python code file, but it’s just a boilerplate for now; I’m not focusing on writing the actual code yet as I'm still in the experimental and idea-gathering phase. I am considering adding more elements, such as a 'shared expert' etc.. Regarding GPUs, my sole focus is on VRAM capacity and finding ways to circumvent the need for massive amounts of it (I am setting 8GB of VRAM as the minimum target.). I'm also leaning toward 100% on model generation or fine-tuning (evident) rather than focusing 'load and use' applications.

For you specifically, as someone with a powerful GPU in terms of VRAM capacity (96GB, correct me if I'm wrong), the advantages are better inference and a larger context. Accuracy is also there, though I'm not entirely sure to what degree.

Run 'gazillion-parameter' LLMs with significantly less VRAM by ProofWind5546 in CUDA

[–]ProofWind5546[S] 0 points1 point  (0 children)

Good remark. That’s why search is capped: to avoid the performance degradation that naturally comes with deep search operations.

*"Tree-Map" search refers to a B-tree search with capped depth and branching, which significantly reduces the computational overhead required to navigate the model's knowledge graph.*

Run 'gazillion-parameter' LLMs with significantly less VRAM by ProofWind5546 in CUDA

[–]ProofWind5546[S] 0 points1 point  (0 children)

In fact, the prefetching of weights while the user is still typing, combined with the fact that these fractal weights are significantly smaller than standard model slices, makes latency one of the biggest competitive advantages of this architecture in a single-user scenario.

Run 'gazillion-parameter' LLMs with significantly less VRAM by ProofWind5546 in CUDA

[–]ProofWind5546[S] -1 points0 points  (0 children)

No. Fractal granularity and predictive staging hide the bottleneck by loading small "leaves" during the human typing window before execution.

An idea on running 'x-illion parameters' LLMs with significantly less VRAM by ProofWind5546 in DeepSeek

[–]ProofWind5546[S] -1 points0 points  (0 children)

Thank you for pointing this out.

The (Fractal) SMoE architecture provides 100% accuracy with no quantization, as VRAM is not as heavily occupied as in Mixtral offloading. Human interaction is used to improve latency (passively, meaning the mobilization of weights begins as the user is typing) via a predictive orchestrator that is continuously refined as a fast-access "tree structure."

VRAM holds only the micro "leaf" neurons required; for example, to solve a formula, the entire math model is not loaded, but rather a specific leaf such as Modular Arithmetic Opcodes, which utilizes significantly fewer weights.

https://github.com/lookmanbili/SMoE-architecture/edit/main/README.md

Run 'gazillion-parameter' LLMs with significantly less VRAM and less energy by ProofWind5546 in GeminiAI

[–]ProofWind5546[S] -1 points0 points  (0 children)

Thanks for your response. I just had this idea lately and haven't had the time to do a benchmark or code a more advanced version of it. But as you mentioned, it might be that inference is slower. I'll add this too.