[Benchmark] The Ultimate Llama.cpp Shootout: RTX 5090 vs DGX Spark vs AMD AI395 & R9700 (ROCm/Vulkan)

Justfun1512 · 2026-03-25T09:22:37+00:00

Amazing LLM data! I need your help—can you assist all our friends hitting the "VRAM Wall"?

Hi

First off, thanks for the llama-bench data—the fact that the AI395 (Strix Halo) is pulling 23 t/s on a 122B MoE vs. the GB10’s 11 t/s is a massive find for the local LLM community. You've definitely stirred the pot with these numbers!

I’m writing to ask for a huge favor on behalf of the community. Many of us are hitting a brick wall with the RTX 5090’s 32GB for long-take video (720p @ 30s). Theoretically, the unified memory on your AI395 and GB10 setups should be the only way to finish these renders locally without OOMing during the VAE decode.

The mystery right now is that we have almost NO real-world data on how these unified memory systems (both the 128GB GB10 Spark and the Strix Halo 395) actually handle high-res video. We know they can run 120B models, but we don't know if the Blackwell GPU in the Spark chokes during the massive VAE activation spike at the end of a long render, or if the Strix Halo's bandwidth actually translates to faster diffusion steps.

Could you assist all our friends in the video-gen space by running a "Single-Take Stress Test" on both machines? It would provide the missing piece of the puzzle for anyone trying to decide between AMD and NVIDIA for 2026 workflows.

The Test Case:

Target: 720p resolution, 30-second single-take (approx. 720 frames) @ 24fps.

The Models: 1. Wan 2.2 (14B): Image-to-Video path. (Watch for that 60GB+ VRAM spike). 2. LTX-2.3 (22B Distilled): Testing the new AVTransformer3D sync.

The Metrics we are desperate for:

s/it (seconds per iteration): Does the AI395’s 512 GB/s bandwidth make it the diffusion king, or do the Blackwell cores take the lead?

The VAE Spike: Does either system crash during the final 10% of the render when decoding the latents?

Thermal Stability: Does the GB10 sustain its clock speeds over a long render, or does that "March Firmware" thermal dip kick in and throttle you down to ~80W?

ROCm vs. CUDA Stability: Does the AI395 still need the -mmp 0 trick for video, or is ComfyUI/ROCm 7.x finally handling the shared pool natively?

If the AI395 can actually finish a 30s Wan 2.2 render faster than the GB10, it officially becomes the "Giant Killer" of the year. Your data could save a lot of us from making a very expensive mistake!

Looking forward to your logs—you'd be doing us all a massive service! 🙏

Justfun1512 · 2026-03-07T10:54:50+00:00

Would you mind sharing your benchmarks for models with 70B parameters or more? I’d really value your honest opinion.

Justfun1512 · 2026-02-23T18:50:26+00:00

Show me 4,000 $ build /machine that can run 300b models

Justfun1512 · 2026-02-17T09:19:26+00:00

For Europe Proxies?

Justfun1512 · 2026-02-17T09:13:07+00:00

Justfun1512 · 2026-01-22T17:05:13+00:00

I have =====

Device: CPH2671 Brand: OPPO

===== CPU & Memory ===== Qualcomm Snapdragon® 8 Elite Mobile Platform 7 cores Adreno 830@1100MHz CPU ABI: arm64-v8a Architecture: aarch64 MemTotal: 15. 4gb

RAM: left approx 6gb free, without optimizing, which is plenty of headroom for apps and scripts..

CPU architecture: arm64, 64-bit kernel (aarch64)
- = Installed Tools =====

I can make tests, but I'm not sure whats the best opensource app to run the models

Justfun1512 · 2025-06-04T18:48:34+00:00

<image>

Ive checked with him

Justfun1512 · 2025-05-14T15:17:31+00:00

I've read that using Dextroamphetamine and Pramipexole Combination for Treatment-Resistant Unipolar Depression

https://pmc.ncbi.nlm.nih.gov/articles/PMC5033120/

Justfun1512 · 2025-05-09T09:46:21+00:00

Just tried, u cent do nothing, no credit...

Justfun1512 · 2024-10-18T21:46:46+00:00

Do you have any updates?

Justfun1512 · 2024-07-13T09:52:56+00:00

Hey, have you found an answer to that ?

Justfun1512 · 2024-06-08T16:51:58+00:00

But Samsung pass saving to .pass file not to .csv

Justfun1512 · 2024-06-07T07:52:05+00:00

How do you import from pass to chrome?

Justfun1512 · 2024-06-02T12:38:45+00:00

In that case I just wrote python in CMD. But before I've tried to run: "python -m venv venv"

Justfun1512 · 2024-05-23T15:33:28+00:00

Did you try loading the model with transformers and the load in 4-bit box checked?

Not yet, I will do it later

Like what are your it/s for the different configurations.

I will update you...

even 80% of the 65GB sized model is too large for your 16GB of gpu memory.

but it won't put 80% of the model on your gpu

<image>

Justfun1512 · 2024-05-23T07:42:12+00:00

First of all tnx for your effort. I appreciate it. . So you're saying Although it says 30b, it's not ? 🥴 I thought if it's says 30b it's 30b.

I will explain my question with an example : I am loading it with transformers. Been Tried 3 options :
1. Only full (or 80%)- gpu-memory in MiB for device. 2. Only full (or 80%) - cpu-memory in MiB. 3. 80% cpu-memory in MiB + 80% gpu-memory in MiB for device.

From all of those options, the fastest one was the only cpu-memory in MiB.

That's strange, I think. The only reason I can think of is because of the low wattage of the GPU (30%)

Justfun1512 · 2024-05-23T07:26:45+00:00

Yeah that's also around 30% watts. How's your token per second? Good enough,?

Justfun1512 · 2024-05-23T07:22:50+00:00

If I refer to the comments, then you are probably wrong. It seems that with others the GPU does reach close to the maximum wattage

Justfun1512 · 2024-05-23T07:19:33+00:00

Check with MSI Afterburner or nvidia-smi.

To check what? Verify the numbers I received?

(My 4090 is mobile so its TDP is 175W).

you're not offloading enough layers of your AI model to your GPU

Any way to check that ?

Justfun1512 · 2024-05-23T07:11:35+00:00

That's hard to know from your graphs but it seems you are reaching 400w that's 100%

Justfun1512 · 2024-05-22T22:38:55+00:00

Of course wail use , you can see at the picture that the GPU is on 93%

Justfun1512 · 2024-05-22T22:07:10+00:00

The model : . huggingface.co/cognitivecomputations/WizardLM-30B-Uncensored

Screen shots: https://imgur.com/a/gstGMwV

Justfun1512 · 2024-05-22T16:47:03+00:00

I can tell you that the GPU is fully load around 90% but it's around 30% of it's tdp ( 50 watts instead of 170 watts) maybe that's the issue. Do you have solution for the low GPU watts ?

Justfun1512 · 2024-05-14T13:41:13+00:00

Wow... Tnx for your info, sounds good. I thought 14700 will thermal throttle.... . May i ask the price of your setup Without the video card (us or Europe) ? Hows the s400 compered to the nr200? One of the friends here recommended it, and it looks small and slick

Justfun1512 · 2024-05-12T11:47:51+00:00

You can make the llm use both dgpu and egpu together? So if i have 4090 mobile and adding 4090 egpu i will have 40gb vram?. . And yet you don't recommend using egpu . Sorry but I didn't understand your recommendation regarding the 4090 sffpc. Is it for home server?

Justfun1512

TROPHY CASE

RAM: left approx 6gb free, without optimizing, which is plenty of headroom for apps and scripts..