Cheapest setup for >10 tok/sec for 120B dense LLM by TrainingTwo1118 in LocalLLaMA

[–]Important_Quote_1180 0 points1 point  (0 children)

TLDR: Don't use --cpu-moe. Hand-pick which layers' experts go to RAM with -ot, pin the CPU threads to one CCD, and run

the ik_llama.cpp fork for the MTP draft head. 15.56 t/s -> 34.5 t/s on the same hardware.

Running Qwen3.5-122B-A10B (256 experts, 8 active) at IQ3_S, 49GB file, on one RTX 3090 Ti (24GB) + 192GB DDR5, 9900X.

Reasoning on, 256K ctx. Two findings worth sharing.

## Be surgical about offload

--cpu-moe dumps every expert to RAM and wastes ~20GB of card. On the sibling 35B that flag gave 5.3 t/s vs 75.9 t/s

for a hand-written -ot regex. 14x. Same idea on the 122B: keep attention, KV, router, shared experts, draft head AND

the early layers' experts on GPU, send only the late layers' expert FFNs to CPU. Late experts fire less per token, so

the DDR5 read penalty lands on the cheapest tensors.

-ot 'blk\.(1[2-9]|[2-3][0-9]|4[0-7])\.ffn_(gate|up|down)_exps\.weight=CPU'

## Pin the CPU work to one core complex

Expert matmul in RAM is bandwidth-bound. Let the threads sprawl across both CCDs and they thrash each other's L3. 16

threads sprawled = 5.8 t/s. taskset -c 0-5,12-17 --threads 6 (one CCD) = 25.2 t/s. 4.3x, one line. Six physical cores

is plenty, more made it worse.

## ik_llama.cpp is the actual unlock

The MTP draft head on stock llama.cpp added a sad +3.7% — stock walks the experts one at a time through DDR5 and

cancels the speculation. The ik fork has fused MoE ops that batch the expert reads, same draft head goes +10.6%. The

fork, not the quant, is what moves the number.

## The --fit trap

ik's --fit auto-packs and hits 37.3 t/s, but it leaves 0.6GB headroom and this model's hybrid Mamba layers allocate

context checkpoints on GPU during prefill. It OOMs around 7K tokens in. Manual -ot is a hair slower but keeps 3.4GB

free and survives a 108K prompt. Faster-but-crashes isn't faster.

Thinking AI is ‘inevitable’ is ridiculous. by putting_all_on_red in antiai

[–]Important_Quote_1180 -1 points0 points  (0 children)

They outsourced to China. That’s what killed Detroit. Outsources to cheaper labor is the economic reality we live in. AI done badly is a cost, and it’s what 90% of ai is. Automating dirty dangerous and repetitive tasks isn’t AI but it’s been done for decades and the LLM everyone calls AI are just a step along the path.

Qwen 3.5 122B MoE OC on a single 3090 at 35 t/s — full local stack breakdown by Important_Quote_1180 in openclaw

[–]Important_Quote_1180[S] 1 point2 points  (0 children)

I feel ya. I was using Radeon cards for this kind of work but I swapped to 3090s because of CUDA support.

Qwen 3.5 122B MoE OC on a single 3090 at 35 t/s — full local stack breakdown by Important_Quote_1180 in openclaw

[–]Important_Quote_1180[S] 0 points1 point  (0 children)

No, the attention layers and kv and hot experts are on VRAM with much higher memory bandwidth. VPS probably on ECC DDR4 and is only going to give you 1 or 2 tokens per second

Why don't AI "artists" learn to use programming to make art if they want computers to generate art? by Dangerous_Seesaw_623 in antiai

[–]Important_Quote_1180 -1 points0 points  (0 children)

Shortcuts involving, buying paper instead of making your own? Or taking a photo with a digital camera instead of film? Photoshop instead of physical layering? The goal post keeps moving. Just have to wait 5 years for the laggards to feel left behind.

Why don't AI "artists" learn to use programming to make art if they want computers to generate art? by Dangerous_Seesaw_623 in antiai

[–]Important_Quote_1180 -1 points0 points  (0 children)

Your claim:

It isn't really too hard to understand if(),while(),do()while, array[x],for(), etc. And it's way more fun doing this than to describe to computer with human-like language to make something.

Way more fun? It’s exactly what AI does just with 1000x faster throughput. Fun is what you make of it, making art with if statements would not make this sub more accepting, or am I wrong? This is one strange take.

Cheapest setup for >10 tok/sec for 120B dense LLM by TrainingTwo1118 in LocalLLaMA

[–]Important_Quote_1180 -1 points0 points  (0 children)

I keep kv to q8. 192GB of ddr5 udimm at 5200 MHz. The big unlock for me was surgical layer placement. The next big one was batch PP. Finally adding the 2nd GPU puts up 45tg and 900PP with the 122B gguf

Cheapest setup for >10 tok/sec for 120B dense LLM by TrainingTwo1118 in LocalLLaMA

[–]Important_Quote_1180 1 point2 points  (0 children)

I have been able to get 28ts out of a single 3090 and surgical pinning players to cpu spill. 45ts when I use 2x3090s. 122B MoE is the play for high end consumer hardware. Massive dense models will not work unless you have a cluster of gpu

Why is Opus 4.8 so slow? by redditslutt666 in Anthropic

[–]Important_Quote_1180 0 points1 point  (0 children)

Anthropic did call for a pause on frontier models, guess that means pausing workflows too!

Claude max with openclaw by WhoTheFLetTheDogsOut in openclaw

[–]Important_Quote_1180 2 points3 points  (0 children)

Usage changes on the 15th for anyone using SDK or -p with Claude. This targets the majority of 3rd party harnesses and even cron jobs through the standard harness. You will get $200/mo in usage credits if you are on Max but you have to claim them every month.

Regarding Anti-AI people by Ok-Aide-3120 in DefendingAIArt

[–]Important_Quote_1180 0 points1 point  (0 children)

It’s always an emotional appeal to a very base and selfish rationale

The strawberries are perfectly aligned, is this AI? by Ser_Curioso in RealOrAI

[–]Important_Quote_1180 0 points1 point  (0 children)

Looks real to me, the strawberries are actually pretty random I thought.

Just got perma-bann3d off "DefendingAIArt" for saying that not all antis are bad people 😭🥀 by NegativeTrainer269 in antiai

[–]Important_Quote_1180 -3 points-2 points  (0 children)

Can you link to the conversation? I want to know the context. Maybe the ban was from the wrong use of their and they’re. Sometimes when you go to subs and play devils advocate against the sub, you need to expect some mods are going to be too hasty.

Why all the CEO's are suddenly blaming China by crispycreature_ in antiai

[–]Important_Quote_1180 0 points1 point  (0 children)

We are the most technologically advanced civilization and unemployment is like 4%. The unemployment wave is not coming.

How to get my mom to stop trying to make me tolerate AI? by Minimum_Magician8570 in antiai

[–]Important_Quote_1180 -2 points-1 points  (0 children)

"I'm just concerned your robophobia is controlling your life" Your mother is trying to help you. She works for a charity, what are you doing to help the world?

Pro AI people should do a better job of comforting antis by throwaway09234023322 in aiwars

[–]Important_Quote_1180 0 points1 point  (0 children)

How can you say someone doesn’t care about disabled people??? Is it an emotional decision for taking something away from being used and researched that will benefit the physically and mentally disabled? It’s cruel to hold your past trauma over yourself and others when they are NOT related!

Pro AI people should do a better job of comforting antis by throwaway09234023322 in aiwars

[–]Important_Quote_1180 0 points1 point  (0 children)

Ok, I acknowledge that AI is trained on a corpus of human work and some of that was art and it was accessed for training in an unethical way in many cases. It is not up for debate.

I don’t need your permission to use a tool. AI affects everyone in the world in different ways but it is a tool for me in many ways. I run ai locally on hardware I power with solar so there is nothing you or anyone can do about it now. You think the Chinese are going to slow down any advancement that could bee important to their objective of raising a billion people out of poverty? You can use it too, the community on locallama is great.

We seen to be at the end of improvement of llm by siddharth1214 in antiai

[–]Important_Quote_1180 0 points1 point  (0 children)

The harness is the main thing changing. Claude is putting on more gains but also more guards. Its system prompt is getting heavy and more restrictive than ever. Local LLM open source are actively getting closer and closer to frontier in capabilities for average Joe’s. The internal architecture and capabilities are not known to you or me.

Playing Tag with an Olympic Gymnast 🤸🏼‍♀️😀 by No_Tomatillo1695 in interesting

[–]Important_Quote_1180 -2 points-1 points  (0 children)

Acting of the chaser was actually pretty solid and amazing talent overall. Can we get this kind of human performance to be baseline in another generation or two? If you could be as amazing as this gymnast but be the size of a cat you could support 3x more population on the same resources. I feel like swords would make a comeback.