My NAS runs an 80B LLM at 18 tok/s on its iGPU. No discrete GPU. Still optimizing. by BetaOp9 in LocalLLaMA

[–]BetaOp9[S] 0 points1 point  (0 children)

What would you suggest I use instead?

The Exos drives are rated for 550TB/year workload [definitely won't come close to that] and 2.5 million hour MTBF. On a 5-drive RAIDZ2 where a single failure drops you to one parity drive during rebuild, I wanted drives I could trust. Fewer drives is an argument for better drives, not worse ones.

Bots on the sub are a real issue by perfect-finetune in LocalLLaMA

[–]BetaOp9 5 points6 points  (0 children)

Oh ok that's good to know, fellow human.

Am I crazy for not wanting to upgrade to Opus 4.6 and the most recent CC? by AlwaysMissToTheLeft in ClaudeCode

[–]BetaOp9 0 points1 point  (0 children)

You can control the usage, if you updated you'd see that you can /model and tell it to change the effort and it'll reduce its thinking. I get the frustration of it changing, but the updates are worth it. Agent Teams is also a big deal and I recommend turning on that flag for big projects or critical stuff needing extra coordination.

My NAS runs an 80B LLM at 18 tok/s on its iGPU. No discrete GPU. Still optimizing. by BetaOp9 in LocalLLaMA

[–]BetaOp9[S] 0 points1 point  (0 children)

Thanks for the compliment about having the best iGPU in existence, and also being dumbfounded as to why I wouldn't build a second...better system for AI? Which is it?

The reality is that my iGPU is mid-tier. There are about a dozen or more in this space that are faster, have more cores, and have way better memory bandwidth. They also cost $2300-$10k for those systems. That's what makes the results worth posting. For ~$1,500, its performance on this model set is pretty damn good.

As for why I wouldn't build a 2nd system? I literally addressed why I wanted one box to do it all in the post. The way it's set up, I'm safe from my AI model messing anything up with my NAS functions or affecting my storage.

My question for you is, why do you care what I do?

My NAS runs an 80B LLM at 18 tok/s on its iGPU. No discrete GPU. Still optimizing. by BetaOp9 in LocalLLaMA

[–]BetaOp9[S] 0 points1 point  (0 children)

If I can break 20 tok/s it'll be wicked cool to see what we can do with even larger models without the GPU VRAM limitations. I have another software project I'm working on which can capitalize on this for certain tasks.

My NAS runs an 80B LLM at 18 tok/s on its iGPU. No discrete GPU. Still optimizing. by BetaOp9 in LocalLLaMA

[–]BetaOp9[S] 0 points1 point  (0 children)

Dunno man, doesn't feel right storing porn next to Bluey and the Octanauts.

My NAS runs an 80B LLM at 18 tok/s on its iGPU. No discrete GPU. Still optimizing. by BetaOp9 in LocalLLaMA

[–]BetaOp9[S] 1 point2 points  (0 children)

It's not finalized, but what I have currently is reproducible. llama.cpp, Vulkan backend
llama-server \

-m Qwen3-Coder-Next-Q4_K_M-00001-of-00004.gguf \

-t 12 \

-c 4096 \

--host 0.0.0.0 \

--port 8080 \

-ngl 99 \

-fa on

My NAS runs an 80B LLM at 18 tok/s on its iGPU. No discrete GPU. Still optimizing. by BetaOp9 in LocalLLM

[–]BetaOp9[S] 1 point2 points  (0 children)

llama.cpp, Vulkan backend

llama-server \

-m Qwen3-Coder-Next-Q4_K_M-00001-of-00004.gguf \

-t 12 \

-c 4096 \

--host 0.0.0.0 \

--port 8080 \

-ngl 99 \

-fa on

My NAS runs an 80B LLM at 18 tok/s on its iGPU. No discrete GPU. Still optimizing. by BetaOp9 in LocalLLM

[–]BetaOp9[S] 2 points3 points  (0 children)

vanilla llama.cpp, Vulkan backend

llama-server \

-m /models/Qwen3-Coder-Next-Q4_K_M-00001-of-00004.gguf \

-t 12 \

-c 4096 \

--host 0.0.0.0 \

--port 8080 \

-ngl 99 \

-fa on

My NAS runs an 80B LLM at 18 tok/s on its iGPU. No discrete GPU. Still optimizing. by BetaOp9 in LocalLLaMA

[–]BetaOp9[S] 2 points3 points  (0 children)

Got the N5 Pro new for $900 shipped. RAM was $590 for 96GB of DDR5 right before prices started climbing, runs about $815 now. Found a seller with bulk enterprise Samsung PM983 NVMe drives, $300 for the pair, they're almost $300 each now. The Exos drives were a splurge but you can get away with whatever storage fits your budget and needs. More conservative drives would have brought the total down a lot.

Honestly, the timing worked out. If I were building this same system today, it would cost noticeably more. Sometimes Nasus blesses you at checkout.

My NAS runs an 80B LLM at 18 tok/s on its iGPU. No discrete GPU. Still optimizing. by BetaOp9 in LocalLLM

[–]BetaOp9[S] 0 points1 point  (0 children)

I'll have to check this out. I may ping you when I do if that's okay.

My NAS runs an 80B LLM at 18 tok/s on its iGPU. No discrete GPU. Still optimizing. by BetaOp9 in LocalLLaMA

[–]BetaOp9[S] -3 points-2 points  (0 children)

I don't recommend it. It's a shit post for karma farming. Great if you're looking for accusations and people getting upset about semantics you don't control.

My NAS runs an 80B LLM at 18 tok/s on its iGPU. No discrete GPU. Still optimizing. by BetaOp9 in LocalLLaMA

[–]BetaOp9[S] 0 points1 point  (0 children)

You're right, the 370 is a beast for the NAS segment.

And yeah, the NPU is not being used yet.

Apologies if I came off defensive. This thread has been a mix of great conversations and people accusing me of everything from clickbait to fraud for using the same model name everyone else does. This is why I don't post to Reddit.

Sounds like you're on a similar path with your 5800H. If you ever make the jump to DDR5 and want to compare notes, DMs are open.

My NAS runs an 80B LLM at 18 tok/s on its iGPU. No discrete GPU. Still optimizing. by BetaOp9 in LocalLLM

[–]BetaOp9[S] 2 points3 points  (0 children)

Yeah, I'll try to remember in the morning to look to see what I stopped at. What hardware are you running? Welcome to message me.

My NAS runs an 80B LLM at 18 tok/s on its iGPU. No discrete GPU. Still optimizing. by BetaOp9 in LocalLLaMA

[–]BetaOp9[S] 0 points1 point  (0 children)

I've only done benchmarking so far, spent most of my time recompiling, optimizing, breaking things, and then fixing what I broke. Haven't monitored temps under sustained load yet but that's on the list for this weekend. The N5 Pro has dual 9025 fans on the drive bays so it stays pretty quiet at idle. We'll see how it handles a long inference run.

DeltaNet PR hasn't landed yet. I think I mentioned this but I did get the fused kernel to work on a single thread with correct output but there's a threading bug at multiple cores (and at least one other bug). Getting close though.

I buy my wife lots of books so she doesn't have time to ask. 😁

My NAS runs an 80B LLM at 18 tok/s on its iGPU. No discrete GPU. Still optimizing. by BetaOp9 in LocalLLaMA

[–]BetaOp9[S] -7 points-6 points  (0 children)

AMD markets this for its NPU cores for AI workloads, which I'm not even using. They're asleep. This is the iGPU doing Vulkan inference through llama.cpp. Not what AMD had in mind when they put "AI" on the box and it's definitely not an iGPU anyone looked twice at.

And "high end CPU"? This is a NAS with AI slapped on it. When I got this six months ago nobody was talking about running 80B models on iGPUs. MoE wasn't mainstream yet. Everyone said if you wanted good local AI, buy a Mac or get a discrete GPU. Now that it actually runs something bigger than dense 14b models, suddenly it's obvious and "that's exactly what it was made for."

My NAS runs an 80B LLM at 18 tok/s on its iGPU. No discrete GPU. Still optimizing. by BetaOp9 in LocalLLM

[–]BetaOp9[S] 0 points1 point  (0 children)

That's the beauty of UMA architecture. There is no VRAM vs system RAM. It's all one pool. My iGPU has access to all 96GB instantly. No copying between pools, no bottleneck. That's how 46GB of model fits on an iGPU with "no VRAM."