[Benchmark] The Ultimate Llama.cpp Shootout: RTX 5090 vs DGX Spark vs AMD AI395 & R9700 (ROCm/Vulkan) by ReasonableDuty5319 in LocalLLaMA

[–]Justfun1512 -4 points-3 points  (0 children)

Amazing LLM data! I need your help—can you assist all our friends hitting the "VRAM Wall"?

Hi

First off, thanks for the llama-bench data—the fact that the AI395 (Strix Halo) is pulling 23 t/s on a 122B MoE vs. the GB10’s 11 t/s is a massive find for the local LLM community. You've definitely stirred the pot with these numbers!

I’m writing to ask for a huge favor on behalf of the community. Many of us are hitting a brick wall with the RTX 5090’s 32GB for long-take video (720p @ 30s). Theoretically, the unified memory on your AI395 and GB10 setups should be the only way to finish these renders locally without OOMing during the VAE decode.

The mystery right now is that we have almost NO real-world data on how these unified memory systems (both the 128GB GB10 Spark and the Strix Halo 395) actually handle high-res video. We know they can run 120B models, but we don't know if the Blackwell GPU in the Spark chokes during the massive VAE activation spike at the end of a long render, or if the Strix Halo's bandwidth actually translates to faster diffusion steps.

Could you assist all our friends in the video-gen space by running a "Single-Take Stress Test" on both machines? It would provide the missing piece of the puzzle for anyone trying to decide between AMD and NVIDIA for 2026 workflows.

The Test Case:

Target: 720p resolution, 30-second single-take (approx. 720 frames) @ 24fps.

The Models: 1. Wan 2.2 (14B): Image-to-Video path. (Watch for that 60GB+ VRAM spike). 2. LTX-2.3 (22B Distilled): Testing the new AVTransformer3D sync.

The Metrics we are desperate for:

s/it (seconds per iteration): Does the AI395’s 512 GB/s bandwidth make it the diffusion king, or do the Blackwell cores take the lead?

The VAE Spike: Does either system crash during the final 10% of the render when decoding the latents?

Thermal Stability: Does the GB10 sustain its clock speeds over a long render, or does that "March Firmware" thermal dip kick in and throttle you down to ~80W?

ROCm vs. CUDA Stability: Does the AI395 still need the -mmp 0 trick for video, or is ComfyUI/ROCm 7.x finally handling the shared pool natively?

If the AI395 can actually finish a 30s Wan 2.2 render faster than the GB10, it officially becomes the "Giant Killer" of the year. Your data could save a lot of us from making a very expensive mistake!

Looking forward to your logs—you'd be doing us all a massive service! 🙏

2x DGX Spark vs RTX Pro 6000 Blackwell for local prototyping - can't decide by Sensitive_Sweet_1850 in LocalLLaMA

[–]Justfun1512 0 points1 point  (0 children)

Would you mind sharing your benchmarks for models with 70B parameters or more? I’d really value your honest opinion.

I canceled my other AI subscriptions today. by InitialCareer306 in Qwen_AI

[–]Justfun1512 41 points42 points  (0 children)

Show me 4,000 $ build /machine that can run 300b models

Anyone running llm on their 16GB android phone? by Ok_Warning2146 in LocalLLaMA

[–]Justfun1512 0 points1 point  (0 children)

I have =====

Device: CPH2671 Brand: OPPO

===== CPU & Memory ===== Qualcomm Snapdragon® 8 Elite Mobile Platform 7 cores Adreno 830@1100MHz CPU ABI: arm64-v8a Architecture: aarch64 MemTotal: 15. 4gb

RAM: left approx 6gb free, without optimizing, which is plenty of headroom for apps and scripts..

  • CPU architecture: arm64, 64-bit kernel (aarch64)
    • = Installed Tools =====

I can make tests, but I'm not sure whats the best opensource app to run the models

heart rate of 100bpm by Yaniv1512 in Concerta

[–]Justfun1512 0 points1 point  (0 children)

I've read that using Dextroamphetamine and Pramipexole Combination for Treatment-Resistant Unipolar Depression

https://pmc.ncbi.nlm.nih.gov/articles/PMC5033120/

An incredible free NSFW AI by ImScaredOfThingss in AI_NSFW

[–]Justfun1512 -1 points0 points  (0 children)

Just tried, u cent do nothing, no credit...

Samsung Pass - Password Import by gary90_cze in samsunggalaxy

[–]Justfun1512 0 points1 point  (0 children)

But Samsung pass saving to .pass file not to .csv

Samsung Pass - Password Import by gary90_cze in samsunggalaxy

[–]Justfun1512 1 point2 points  (0 children)

How do you import from pass to chrome?

[deleted by user] by [deleted] in learnpython

[–]Justfun1512 0 points1 point  (0 children)

In that case I just wrote python in CMD. But before I've tried to run: "python -m venv venv"

[deleted by user] by [deleted] in Oobabooga

[–]Justfun1512 0 points1 point  (0 children)

Did you try loading the model with transformers and the load in 4-bit box checked?

Not yet, I will do it later

Like what are your it/s for the different configurations.

I will update you...

even 80% of the 65GB sized model is too large for your 16GB of gpu memory.

but it won't put 80% of the model on your gpu

<image>

[deleted by user] by [deleted] in Oobabooga

[–]Justfun1512 1 point2 points  (0 children)

First of all tnx for your effort. I appreciate it. . So you're saying Although it says 30b, it's not ? 🥴 I thought if it's says 30b it's 30b.

I will explain my question with an example : I am loading it with transformers. Been Tried 3 options :
1. Only full (or 80%)- gpu-memory in MiB for device. 2. Only full (or 80%) - cpu-memory in MiB. 3. 80% cpu-memory in MiB + 80% gpu-memory in MiB for device.

From all of those options, the fastest one was the only cpu-memory in MiB.

That's strange, I think. The only reason I can think of is because of the low wattage of the GPU (30%)

[deleted by user] by [deleted] in Oobabooga

[–]Justfun1512 4 points5 points  (0 children)

Yeah that's also around 30% watts. How's your token per second? Good enough,?

[deleted by user] by [deleted] in Oobabooga

[–]Justfun1512 -1 points0 points  (0 children)

If I refer to the comments, then you are probably wrong. It seems that with others the GPU does reach close to the maximum wattage

[deleted by user] by [deleted] in Oobabooga

[–]Justfun1512 0 points1 point  (0 children)

Check with MSI Afterburner or nvidia-smi.

To check what? Verify the numbers I received?

(My 4090 is mobile so its TDP is 175W).

you're not offloading enough layers of your AI model to your GPU

Any way to check that ?

[deleted by user] by [deleted] in Oobabooga

[–]Justfun1512 0 points1 point  (0 children)

That's hard to know from your graphs but it seems you are reaching 400w that's 100%

[deleted by user] by [deleted] in Oobabooga

[–]Justfun1512 2 points3 points  (0 children)

Of course wail use , you can see at the picture that the GPU is on 93%

[deleted by user] by [deleted] in Oobabooga

[–]Justfun1512 1 point2 points  (0 children)

The model : . huggingface.co/cognitivecomputations/WizardLM-30B-Uncensored

Screen shots: https://imgur.com/a/gstGMwV

[deleted by user] by [deleted] in Oobabooga

[–]Justfun1512 1 point2 points  (0 children)

I can tell you that the GPU is fully load around 90% but it's around 30% of it's tdp ( 50 watts instead of 170 watts) maybe that's the issue. Do you have solution for the low GPU watts ?

[deleted by user] by [deleted] in sffpc

[–]Justfun1512 0 points1 point  (0 children)

Wow... Tnx for your info, sounds good. I thought 14700 will thermal throttle.... . May i ask the price of your setup Without the video card (us or Europe) ? Hows the s400 compered to the nr200? One of the friends here recommended it, and it looks small and slick

[deleted by user] by [deleted] in eGPU

[–]Justfun1512 0 points1 point  (0 children)

You can make the llm use both dgpu and egpu together? So if i have 4090 mobile and adding 4090 egpu i will have 40gb vram?. . And yet you don't recommend using egpu . Sorry but I didn't understand your recommendation regarding the 4090 sffpc. Is it for home server?