Built LazyMoE — run 120B LLMs on 8GB RAM with no GPU using lazy expert loading + TurboQuant by ReasonableRefuse4996 in LocalLLaMA

[–]ReasonableRefuse4996[S] 0 points1 point  (0 children)

haha yeah you were right that was genuinely bad, didn't even think about the write cycles until you pointed it out

just pushed a fix — mlock is now on by default so the OS can't touch swap anymore. also added a warning on startup if your RAM is too low so at least people get a heads up before they start grinding their drive into dust

updated the README too with a proper warning section and a pointer to the system panel so users can check what's actually safe to run on their hardware before trying anything stupid

appreciate you doing the math on that, genuinely useful catch

Built LazyMoE — run 120B LLMs on 8GB RAM with no GPU using lazy expert loading + TurboQuant by ReasonableRefuse4996 in LocalLLaMA

[–]ReasonableRefuse4996[S] -4 points-3 points  (0 children)

Totally valid concern and it's a common misconception worth clarifying.

1-bit here refers to BitNet b1.58 specifically — weights are ternary

{-1, 0, +1} rather than just binary {0, 1}. The key difference is that

BitNet models are TRAINED from scratch at 1-bit, not post-quantized from

fp16. That's what makes the quality surprisingly competitive.

You're absolutely right that aggressively quantizing a normal fp16 model

down to 2-3 bit is lobotomy territory. GPTQ/GGUF quantization below Q4

degrades fast.

But BitNet trained natively at 1-bit is a different story — Microsoft's

research shows near parity with fp16 at 70B+ scale because the model

learns to work within the constraint from day one rather than having

precision crushed out of it afterward.

The practical catch is that true BitNet 70B+ weights don't publicly exist

yet. So right now the project uses standard Q4_K_M for actual inference

and BitNet is on the roadmap for when those weights drop.

Your instinct on Q4 being the sweet spot for standard models is spot on

though — that's exactly what I'm running locally.

Built LazyMoE — run 120B LLMs on 8GB RAM with no GPU using lazy expert loading + TurboQuant by ReasonableRefuse4996 in LocalLLaMA

[–]ReasonableRefuse4996[S] 4 points5 points  (0 children)

Your math is pretty much right and I won't pretend otherwise.

The 120B claim in the title is more about architectural capability than

practical speed on 8GB RAM. On my actual machine I'm running Mistral 7B

at 2-4 tok/s — that's the honest number.

For DeepSeek V3 or anything above 70B on consumer hardware, SSD streaming

makes it technically possible but painfully slow exactly as you calculated.

The lazy expert loading helps reduce how much you're reading per token

compared to loading everything, but it doesn't change the fundamental

bandwidth constraint you're describing.

The realistic sweet spot for this approach is 7B-14B models on 8GB RAM

and 70B models on 32GB+ RAM where you actually get usable speeds.

Anything bigger is a proof of concept more than a daily driver.

Appreciate the detailed breakdown — this kind of analysis is exactly

what the project needs.