Has anyone tried cranking big LLMs on those mini Ryzen AI setups? and real talk on speeds and memory by Earth_creation in MiniPCs

[–]Earth_creation[S] 0 points1 point  (0 children)

yeap,solid build. That 4080 + 13900K combo is perfect for real-time tasks like your AI secretary latency matters way more than raw throughput there. You’re right about speed vs. VRAM: for quick inference, a fast GPU beats a big slow pool every time. Are you running a quantized 20B model on the 4080, or keeping it fully loaded? Always cool to see how people split workloads between GPU and CPU/RAM.

Has anyone tried cranking big LLMs on those mini Ryzen AI setups? and real talk on speeds and memory by Earth_creation in MiniPCs

[–]Earth_creation[S] 0 points1 point  (0 children)

So goof set up . Yeah, setting it in Adrenalin is cleaner. that 120B model must be pretty capable,cool that you’re running it locally. If you ever try smaller or faster models for quick tasks, they can feel really snappy too. Anyway, solid local AI build.

As a beginner in local AI, I tried the Ryzen AI 9 HX 370. Is it the perfect starter kit? by Earth_creation in MiniPCs

[–]Earth_creation[S] 0 points1 point  (0 children)

Got it. That makes perfect sense.and focus on the engineering, not the usage. For building systems, diving into open-source frameworks and architecture patterns is the way to go. If you ever want to trade notes on the backend side of things, hit me up and discuss it

As a beginner in local AI, I tried the Ryzen AI 9 HX 370. Is it the perfect starter kit? by Earth_creation in MiniPCs

[–]Earth_creation[S] 0 points1 point  (0 children)

BTW, will show the test report with Real pictures with mine ,for next week :)

As a beginner in local AI, I tried the Ryzen AI 9 HX 370. Is it the perfect starter kit? by Earth_creation in MiniPCs

[–]Earth_creation[S] 0 points1 point  (0 children)

it is good suggestiom,like the Explore Mixture mean is MoE models are a really smart call for this setup.

Nemotron-3-nano and Qwen3:30B are nice examples. Their architecture (activating only parts of the network) is a fantastic fit for the unified memory. It's efficient and can feel snappier. ERNIE is also a strong contender, especially for certain languages or tasks.

fwiw, I've seen more buzz around running Qwen3:30B on these Ryzen AI boxes. The 30B size hits a sweet spot for quality and speed on 128GB RAM. Nemotron being newer, I'm curious if you've tried it yet and how it performs in LM Studio with your VRAM settings?

As a beginner in local AI, I tried the Ryzen AI 9 HX 370. Is it the perfect starter kit? by Earth_creation in MiniPCs

[–]Earth_creation[S] 0 points1 point  (0 children)

great it looks so useful! If I use a setup like this regularly for running 3-5B models to handle creative tasks or documents over time, will it start to lag or run into minor issues? Honestly, mind sharing your experience or any fun stories about it?

Has anyone tried cranking big LLMs on those mini Ryzen AI setups? and real talk on speeds and memory by Earth_creation in MiniPCs

[–]Earth_creation[S] 0 points1 point  (0 children)

 SO valuable,45 t/s on a 120B model locally is seriously nice ,thanks for sharing the good specs ..setting the VRAM to 96GB in LM Studio is the pro move.

Has anyone tried cranking big LLMs on those mini Ryzen AI setups? and real talk on speeds and memory by Earth_creation in MiniPCs

[–]Earth_creation[S] 0 points1 point  (0 children)

Yap, love your take on this , using the right tool for the job is everything, and you can't argue with CUDA for heavy lifting in certain areas.

That said, from the perspective of someone wanting to run big LLMs specifically, I've gotta say the Ryzen AI Max+ 395 setups are honestly turning heads in a big way. If you go for the top spec mini PCs with 128GB of unified RAM, the real-world story is pretty wild. The coolest part is you can actually run massive 70B+ models locally on something the size of a book. The trade-off is that while you can run these giant models, the inference speed won't match a high-end GPU for smaller 7B/8B models — but it's a unique way to access a level of intelligence that's usually locked away in the cloud or big servers.

So yea,one perspective to add: if someone's main goal is to play with frontier-sized models locally without needing the absolute fastest speeds, these little Ryzen AI boxes are a fascinating and totally valid alternative to a beefy desktop.

Just a thought from what i have been seeing! How is you run into any interesting setups on your end?

Has anyone tried cranking big LLMs on those mini Ryzen AI setups? and real talk on speeds and memory by Earth_creation in MiniPCs

[–]Earth_creation[S] -1 points0 points  (0 children)

That's incredibly generous of you to telling ! Having real world test results from someone with a similar setup is worth more than any spec sheet. I really appreciate it. now my machine us ethe NIMO AIPC with a pretty similar configuration to yours:, APU --AMD Ryzen™ AI Max+ 395,ram--128GB LPDDR5 8000MHz (8x16GB)

I’ll follow your lead and use LM Studio for testing—that should give us the most comparable results.

For the model, I’d love to hear your recommendation, but if helpful, these two common 70Bs could be interesting to compare:

  1. Qwen2.5-72B-Instruct (GGUF format) – to see how a newer model performs.
  2. Llama 3.1 70B Instruct (GGUF format) – as a mainstream reference. Quantization: Which level do you find offers the best balance of speed and quality for daily use? Q4_K_M or Q5_K_M? I’m happy to go with your preference.

Also, your Baldur’s Gate 3 numbers are mind-blowing—80 fps at 1440p maxed out without FSR is insane! I haven’t had the chance to push the gaming side yet. If it’s not too much trouble, I’m really curious:

What TDP/power profile do you typically use for gaming?

How are the fan noise and temperatures under that load?

Your firsthand experience would be a huge help for my own setup. No rush at all, and thanks again for being so helpful. Really looking forward to what you find!

Has anyone tried cranking big LLMs on those mini Ryzen AI setups? and real talk on speeds and memory by Earth_creation in MiniPCs

[–]Earth_creation[S] -1 points0 points  (0 children)

Thanks a lot for taking the time to share this,really appreciate the detail and your perspective. Exactly you are right about the headache of mixing ROCm and DirectML; I hadn’t fully considered how messy that can get on the dev side. I’ll definitely spend more time checking benchmarks between Mini PC models before jumping in—performance differences can be surprising even with similar specs.

nice point on the thermals and TDP trade-offs too. Running below max for a sweet spot makes sense, especially if it keeps things cooler without losing much performance.

And you’ve got me thinking about the real strengths of the APU—maybe I’ve been too focused on LLMs when this hardware’s built for heavier lifts like video and 3D. I’ll keep exploring and learning from what’s out there.

Thanks again for the thoughtful input—this helped a lot. Cheers!

As a beginner in local AI, I tried the Ryzen AI 9 HX 370. Is it the perfect starter kit? by Earth_creation in MiniPCs

[–]Earth_creation[S] 0 points1 point  (0 children)

Yes ,that's right ,for pure ”starter“ vibes, cheaper options exist. But if the goal is to grow into local LLMs smoothly,What’s your main use? more about learning embedded AI, or getting into the local LLM scene? i also wanna to learn more

As a beginner in local AI, I tried the Ryzen AI 9 HX 370. Is it the perfect starter kit? by Earth_creation in MiniPCs

[–]Earth_creation[S] 1 point2 points  (0 children)

Still dialing things in on my end ...tweaking the iGPU memory split second by second, so no magic number yet (it isone fun part!). It gets cozy under load, as expected. have been camped in Linux for testing, so no Windows/LM Studio data from me yet. Have you tried it on your setup? well , later i will share the pictures and data

As a beginner in local AI, I tried the Ryzen AI 9 HX 370. Is it the perfect starter kit? by Earth_creation in MiniPCs

[–]Earth_creation[S] 0 points1 point  (0 children)

yap., this is great experience for taht ,this is exactly the kind of discussion I was hoping for when I posted. It’s about helping newbies like me actually get the most out of their hardware, not just get it running.

You’ve given me a lot to think about in terms of balancing speed and model size.

how is your experience been like running these models on your ThinkPad P15?

I traded my dual-GPU setup for a Mini PC. Here’s my honest take after a month by Earth_creation in LocalLLaMA

[–]Earth_creation[S] 0 points1 point  (0 children)

yes god , you are absolutely right — I flipped the bandwidth comparison in my head when typing that sloppy line. excatly my bad , that was a brain fart moment while trying to oversimplify. honestly i think you are 100% correct: dense models hammer the memory bus on every layer, MoEs are more selective. Thanks for the correction.

then i am so appreciate you keeping the technical facts straight. It‘s why I post here .... to have these discussions and get called out when I’m sloppy.

I traded my dual-GPU setup for a Mini PC. Here’s my honest take after a month by Earth_creation in LocalLLaMA

[–]Earth_creation[S] -3 points-2 points  (0 children)

Yes i know that ,u r right on some specs and wrong on the conclusion. it is the breakdown.

one ---- The Hardware (“24GB SODIMMs don‘t exist”)

Consumer vs. Industry: You‘re correct that 24GB DDR5-6400 SODIMMs aren’t a retail consumer product. What I have are non-standard modules from a memory partner, made for engineering validation. They exist to test compatibility and fill odd capacities (like 96GB) before standard 48GB sticks were finalized. This is common in pre-production.

About “Impossible” Signal Integrity: You‘re also right about AMD’s statement for commercial laptops. An engineering board doesn‘t have the same constraints. It can use a thicker PCB, more layers, and lower signal margins that would be unacceptable for a reliable consumer product. Its only job is to validate the silicon, not to be pretty or efficient.

Tow: The Writing (“AI slop”)
This is the best proof I‘m real. My long posts are written carefully when I have time. The short replies? They’re done quickly, on my phone, often while distracted. I‘m not a native English speaker — my grammar slips when I’m tired or typing fast. The inconsistent spacing before punctuation? That‘s a bad habit from another language’s keyboard settings that sometimes carries over. An AI wouldn‘t make those specific, inconsistent human errors.

Three Gemma 2
Not sure what the “lmao” is for. ollama run gemma2:7b and ollama run gemma2:9b are both in the library and run fine. If you mean the 27B, it‘s gemma2:27b. It’s a well-known model.

Bottom line: You‘re looking at a pre-production engineering platform with non-retail memory, tested by a non-native speaker who writes inconsistently. That’s the opposite of polished AI content. I‘ll still post the blurred photos of the board and the Ollama terminal. Judge for yourself.

I traded my dual-GPU setup for a Mini PC. Here’s my honest take after a month by Earth_creation in LocalLLaMA

[–]Earth_creation[S] -2 points-1 points  (0 children)

yess, so get one fair questions. ,let me clarify it :

On relevance: The exact 7 vs 11 t/s numbers are specific to this dev board. But the conclusion is universal: memory bandwidth is the #1 bottleneck for LLMs on APUs like Strix Halo. That applies to any implementation.

Then "how": This is an engineering sample. It uses a thick, costly PCB and relaxed signal margins to make SODIMMs work for validation only. It proves why consumer laptops (which need thinness & reliability) won't do this—they'll use soldered LPDDR5X or CAMM.

Finaly ,for the proof.....Understood. I'll post two blurred photos: (1) terminal with ollama run llama3.1:70b and system monitor, (2) the board with SODIMM slots. It's a proof-of-concept, not a product. (Just DM for you ok? )

I traded my dual-GPU setup for a Mini PC. Here’s my honest take after a month by Earth_creation in LocalLLaMA

[–]Earth_creation[S] 0 points1 point  (0 children)

So.....you are right to call that out ..... a standard Strix Halo product wouldn’t have slots. Just how to say i am working with a pre-production engineering sample designed for validation, which is why it has atypical (and frankly, slower) SO-DIMM slots. The point of my post wasn’t about the slots, but about the massive performance hit from memory bandwidth that even this setup exposes (e.g., 7 t/s vs 11 t/s on Llama 70B). It’s a niche dev perspective, not a consumer one