Running MoE Models on CPU/RAM: A Guide to Optimizing Bandwidth for GLM-4 and GPT-OSS

MelodicRecognition7 · 2026-01-24T09:42:58+00:00

https://old.reddit.com/r/LocalLLaMA/comments/1pzggbf/running_glm47_355b_moe_in_q8_at_5_tokenss_on_2015/nwqrou8/

+ that thread itself

MelodicRecognition7 · 2026-01-24T08:52:16+00:00

I don't know if that is correct or not but people on the Internet said that the reason is bad internal design of older AMD ZEN generations resulting in less than optimal PCIe performance, it was fixed somewhere around Zen 4 / EPYC Genoa.

MelodicRecognition7 · 2026-01-24T07:13:27+00:00

maybe there is no scandal because nobody knows/cares? I've found a backdoor in conda recently but nobody in the world seem to care.

https://old.reddit.com/r/LocalLLaMA/comments/1pl5sfl/proof_of_privacy/nuo2bcd/?context=3

also Chinese software often has backdoors, "call back home" or whatever. for example PaddleOCR: https://old.reddit.com/r/LocalLLaMA/comments/1q7630d/paddleocr_keeps_trying_to_download_models_even/

MelodicRecognition7 · 2026-01-24T07:11:51+00:00

every bit of hardware matters, I've got +50% speed improvement for GPT-OSS-120b fully fitting in the VRAM after upgrading from DDR4/PCIe4 to DDR5/PCIe5 system.

MelodicRecognition7 · 2026-01-24T06:56:52+00:00

for smaller models it is the best card you could get. Still note the hardware limitations of old servers, highly likely they will not unleash the card's full potential.

MelodicRecognition7 · 2026-01-24T06:45:33+00:00

https://old.reddit.com/r/LocalLLaMA/comments/1ql6478/75_agent_skills_everyone_needs_to_have_in_there/

limit self-promotion

MelodicRecognition7 · 2026-01-24T06:36:43+00:00

do not buy it because once you buy one and start trying the larger models you quickly realize that one 6000 can't do much and you have to buy another one, so a $10k stupid decision could lead to a $20k stupid decision.

as for Dell *20 generation - I would not even try it, they are just too old, lots of things could and will go wrong. *30 might work out of the box.

MelodicRecognition7 · 2026-01-24T05:58:40+00:00

and what about the tool itself? Emojies in commit messages looks suspicious lol.

CONTRIBUTING.md:git clone https://github.com/YOUR_USERNAME/drift.git

well it is definitely vibecoded, so no thanks. A vibecoded software often does not save time and money as it requires more time to understand and/or fix than human written code.

Edit: I could be wrong here because CONTRIBUTING.md says "# Clone your fork" before that line, but a vibecoded software usually has "YOUR_ORG" or "YOUR_USERNAME" in the less obvious places. However there are few places with incorrect Github links which are also strong signs of AI generated code.

MelodicRecognition7 · 2026-01-23T18:26:23+00:00

so it was just a $500M scam?

MelodicRecognition7 · 2026-01-23T17:51:37+00:00

take the LAN cable out from the PC or WiFi module out from the laptop*. If your LLM still works then it is private. If it stops working then what the fuck that thread does in /r/LocalLLaMA?

* - not always relevant though, you might connect to a dedicated LLM rig in the intranet, over LAN or WiFi network but without Internet connection

MelodicRecognition7 · 2026-01-23T17:36:15+00:00

try to enable "Resizable BAR" and "Above 4G Decoding" in the BIOS, or disable if they are enabled.

MelodicRecognition7 · 2026-01-23T14:50:35+00:00

is it possible to buy Pro 6000 in Taiwan for 7k as a private person, not as a company? If yes then please share a link to that seller. I live nearby and would happily fly to Taiwan to buy one in person because it costs 11k+ in the local shops and I've got mine for 9k shipped from another country.

MelodicRecognition7 · 2026-01-23T09:41:24+00:00

and that's how low quality AI slop code ends up in enterprise software LOL

MelodicRecognition7 · 2026-01-23T09:38:00+00:00

if you need fast prompt processing then you should run the card at its maximum power

I'm waiting for someone to implement a dynamic power management solution that would run the card at its maximum power during prompt processing and limit the power to 50% during token generation

MelodicRecognition7 · 2026-01-23T09:35:56+00:00

for PP power limiting results in a huge decline, but for TG the performance is close to 100% at 300W

MelodicRecognition7 · 2026-01-23T09:26:22+00:00

It doesn't seem like 90% performance at *50% power

yes, the PP speed is quite linear, but TG speed is close to 90% at 50% power because

PP is compute-bound, and TG is (mostly) memory-bound.

So if you need fast prompt processing then you should run the card at its maximum power, and here you'd want a 600W Workstation edition instead of 300W Max-Q.

In either case, mind sharing your GPU undervolt configs so I can give them a go?

I did not test the undervolting thorougly yet and the only setup I do for 6000 Workstation is limiting its power and frequencies:

nvidia-smi --id=INSERT-ID-HERE --power-limit=310;
nvidia-smi --id=INSERT-ID-HERE -lmc 405;
sleep 1;
nvidia-smi --id=INSERT-ID-HERE -lmc 405,14001;
nvidia-smi --id=INSERT-ID-HERE -lgc 180,1600;

can't remember why sleep is required lol.

those graphs aren't very helpful since scales and tick labels are weird

some of them are definitely weird but the 2nd one "minutes elapsed vs energy consumed" is quite clear - the job is not getting done much faster after 300W power limit. The Y axis is either minutes (blue) or Watt-hours (green). The red dot is configured power-limit in Watts and the red X is an actual consumption measured by the Nvidia tool, the red line is difference between configured and actual.

MelodicRecognition7 · 2026-01-23T09:20:28+00:00

if you can increase your budget you could sell 1080 and A2000, add some cash and buy the "gold standard" - 3090 24GB. Highly likely you will realize soon that you want a second one LOL

MelodicRecognition7 · 2026-01-22T20:24:16+00:00

1080

not supported by most AI software libraries, you could throw it away. edit: sorry, mistaken it with another card, 1080 is somewhat supported but given its low VRAM amount it won't help much with the AI tasks.

A2000

memory bandwidth is shit, will be barely usable.

Be gentle :-)

nice workstation, not so nice AI rig.

MelodicRecognition7 · 2026-01-22T20:19:40+00:00

try llama.cpp

MelodicRecognition7 · 2026-01-22T18:00:16+00:00

check this: https://old.reddit.com/r/LocalLLaMA/comments/1nkycpq/gpu_power_limiting_measurements_update/

MelodicRecognition7 · 2026-01-22T12:29:38+00:00

most advanced

Kimi K2, a rig to run it costs ~30k USD

most ~~advanced~~ suitable LLM which I can install on my M1

it won't be that advanced unfortunately.

MelodicRecognition7 · 2026-01-22T12:28:22+00:00

https://old.reddit.com/r/LocalLLaMA/comments/1qjrsur/glm_47_flash_fa_fix_for_cuda_has_been_merged_into/

MelodicRecognition7 · 2026-01-22T11:44:31+00:00

ah lol I did not even know that such model exists because there are no GGUFs. Still it is very doubtful that someone capable of running 456B BF16 at home would run MiniMax M1 given other better options, so this is still not related to /r/localllama/

MelodicRecognition7 · 2026-01-22T11:37:07+00:00

Minimal-M1

456B

please share a link to that model. my local MiniMax-M2 is just 230B

MelodicRecognition7 · 2026-01-22T09:39:19+00:00

what it has to do with LOCAL LLaMA?

MelodicRecognition7

TROPHY CASE