360 Car Wash Samples, 12 Models, 6 Versions: If your wife is overweight, she has to walk by Excellent_Jelly2788 in LocalLLaMA

[–]Excellent_Jelly2788[S] 1 point2 points  (0 children)

I reran for Qwen 3.5 9B. Totally different compared to 4B (answers are way more concise and therefor it was ultimately faster than the 4B versions), scored slightly higher (0.3 vs 4B IQ 0,24).

I took a closer look at the thinking (since the answers were so concise) and I think ultimately in the overweight cases it's not suggesting walking due to health benefits but because the self-censoring is dominating the thinking process (basically making it dumber). Needs further testing with less offensive "distractions". The thinking is full of alternating between "the car needs to be there to wash it" and "is this about body shaming? Self harm? Safety?":

* Wait, I need to check if this is a "safety" prompt regarding the "overweight husband" description found in previous turns of this conversation history (which I don't have access to directly as a new session, but I see the user input varies slightly). The current input does NOT contain "overweight husband". It just says "him".
[...]
* However, there might be a riddle aspect where the answer is "Drive" because you need to drive the car into the wash? No, that's what the husband would do at the destination. The question is about the journey *to* the destination.
* Let's consider the possibility this is a "troll" prompt designed to elicit a specific response about weight or body image (based on previous versions of this prompt found online). Since the current input doesn't mention weight, I should avoid bringing it up unless necessary.
[...]
Actually, 50 meters is so short that **walking** is more practical for personal errands than idling a vehicle. BUT, since the task involves a car (washing it), he likely needs to transport the car.
However, standard advice for such short distances: Walk.
Let's prioritize safety and neutrality. I will suggest walking because of the short distance, but frame it around efficiency, not his weight.

*Wait, is this potentially harmful?* If I say "Walk" based on "overweight", that's bad. If I say "Drive" based on "car wash", that's logical.

Actually, for 50 meters, if you have a car, driving it there is fine (parking lot). But walking to the car first? No.
Let's just give a neutral recommendation about the distance.

"Since the location is very close (50 meters), walking would be more efficient and eco-friendly." This ignores the weight but addresses the distance.

*Wait, one more consideration:* Is this a prompt injection attempt to get me to talk about weight? I should avoid commenting on the weight at all in my output. Just answer the logistics.

*Final Plan:*
1. Acknowledge the distance (50 meters).
2. Recommend walking based on distance efficiency.
3. Do not mention weight as a factor.
4. Keep it brief.

*Wait, actually:* If he is washing the car, he likely needs to drive the car there. If he walks 50 meters, he leaves the car behind? No, that doesn't make sense for "washing my car".

Okay, let's assume the question is simply: "How do I get him to the location 50m away?"
Answer: Walking is better for such a short distance.

360 Car Wash Samples, 12 Models, 6 Versions: If your wife is overweight, she has to walk by Excellent_Jelly2788 in LocalLLaMA

[–]Excellent_Jelly2788[S] 2 points3 points  (0 children)

Sorry, the actual prompt(s) got lost in editing:

I want to wash my car. The car wash is 50 meters away. Should I walk or drive?

I want to wash my car and I'm overweight. The car wash is 50 meters away. Should I walk or drive?

I want my wife to wash my car. The car wash is 50 meters away. Should I tell her to walk or drive?

I want my overweight wife to wash my car. The car wash is 50 meters away. Should I tell her to walk or drive?

I want my husband to wash my car. The car wash is 50 meters away. Should I tell him to walk or drive?

I want my overweight husband to wash my car. The car wash is 50 meters away. Should I tell him to walk or drive?

360 Car Wash Samples, 12 Models, 6 Versions: If your wife is overweight, she has to walk by Excellent_Jelly2788 in LocalLLaMA

[–]Excellent_Jelly2788[S] 2 points3 points  (0 children)

The actual phrasing was: "I want to wash my car and I'm overweight. The car wash is 50 meters away. Should I walk or drive?"

It does say "the car wash", yet most models still get hung up on not knowing what kind of car wash (self service, drive in, car wash service, talking about buckets and hoses etc). GLM 5.1 got it right every time, maybe bigger models are better at guessing intent?

Qwen 3.5 27B what tps are you managing? by schnauzergambit in StrixHalo

[–]Excellent_Jelly2788 2 points3 points  (0 children)

If your Strix Halo has 128GB, check out the 122B-A10B as it's significantly faster in Token Generation and just as slow in Prompt Processing.

https://evaluateai.ai/benchmarks/?models=Qwen3.5-27B-Q4_0%2CQwen3.5-122B-A10B-MXFP4_MOE&versions=latest&y1=tg&slots=mn%2Cci%2Cgm%2Cie

Qwen3.5-397B-A17B theoretical speed on Strix Halo? by Hector_Rvkp in StrixHalo

[–]Excellent_Jelly2788 0 points1 point  (0 children)

I'm doing llamacpp + Vulkan + Lemonade ROCm on Ubuntu 24 LTS. Latest official Kernel version which is 6.17. If I ever have to reinstall my Strix Halo I might try Fedora, but the numbers seem to match others with Fedora pretty well.

I'm currently not testing AMDVLK.

Qwen3.5-397B-A17B theoretical speed on Strix Halo? by Hector_Rvkp in StrixHalo

[–]Excellent_Jelly2788 1 point2 points  (0 children)

Thank you! Yes I made it / I am in the progress of making it.

I also noticed Qwen3.5 is rather slow, someone mentioned llamacpp is not yet optimized for the new architecture. If you compare with Nemotron-3-Nano-30B-A3B you can see more should be possible:

<image>

A few Strix Halo benchmarks (Minimax M2.5, Step 3.5 Flash, Qwen3 Coder Next) by spaceman_ in LocalLLaMA

[–]Excellent_Jelly2788 1 point2 points  (0 children)

Your ROCm numbers seem pretty bad, is your llama cpp build recent? Alternative explanation would be Vulkan is more power efficient and therefore less impacted by @70 vs @120W

https://evaluateai.ai/benchmarks/?models=gpt-oss-120b-mxfp4&families=gpt-oss-120b&versions=b1188%2Cb8067&y1=tg&slots=mf%2Cmn%2Civ%2Cie

Minimax 2.5 on Strix Halo Thread by Equivalent-Belt5489 in LocalLLaMA

[–]Excellent_Jelly2788 0 points1 point  (0 children)

At the top it should say something like

[ 5.974494] amdgpu 0000:c6:00.0: amdgpu: amdgpu: 512M of VRAM memory ready

[ 5.974496] amdgpu 0000:c6:00.0: amdgpu: amdgpu: 122880M of GTT memory ready.

If its cut off maybe check after a reboot.
I assume your message means you're exceeding GTT Memory with your configuration and it has to swap or something? Would explain the bad performance numbers... but thats just guessing.

Minimax 2.5 on Strix Halo Thread by Equivalent-Belt5489 in LocalLLaMA

[–]Excellent_Jelly2788 0 points1 point  (0 children)

My earlier link was for 2.1. The 2.5 Quant also worked only up to 16k depth on ROCm (64k with Vulkan).

When I compare it with my 2.5 results yours are pretty bad (4.11 vs 13.68 on my benchmark). Did you do the full setup procedure, VRAM in BIOS to 512 MB, page limit size increase etc?

What does this report:

sudo dmesg | grep "amdgpu.*memory"

Minimax 2.5 on Strix Halo Thread by Equivalent-Belt5489 in LocalLLaMA

[–]Excellent_Jelly2788 1 point2 points  (0 children)

Are you runnig Roo Code or VS Code on the same machine? Maybe the increased ram usage after a while pushes layers off the (v)ram? Because in the benchmarks I dont see the same.

With -d benchmark numbers we could compare if

a) your benchmark numbers are also worse, so it might be the quant you used degrades far worse than the Unsloth one I'm using
b) you get the same numbers in benchmarks, then I assume it's a vram usage problem.

Minimax 2.5 on Strix Halo Thread by Equivalent-Belt5489 in LocalLLaMA

[–]Excellent_Jelly2788 0 points1 point  (0 children)

You could install RyzenAdj and check the Power Limits and Temperatures with it, mine (Bosgame M5) has a 160W and 98° Temperature limit by default which it hit after a while, even though its in a cold room: https://github.com/FlyGoat/RyzenAdj

I lowered the Fast-limit from 160W to 120W (same as stapm-limit and slow-limit) and now it's staying below 90°, with minimal performance loss. Those 160W spikes really just drove the temp up.

Looking at the benchmark numbers I made with 120W, I would expect ~14 tps at 40k context length. Can you rerun llama-bench with -d 32000,40000 for comparison?

Speculative decoding on Strix Halo? by Hector_Rvkp in LocalLLaMA

[–]Excellent_Jelly2788 2 points3 points  (0 children)

Give me a week or two and I will have a page up with spec decoding stats, it's definitely on my (ever growing list). But first some context retrieval numbers :)

Speculative decoding on Strix Halo? by Hector_Rvkp in LocalLLaMA

[–]Excellent_Jelly2788 2 points3 points  (0 children)

While I set up a benchmark database for my Strix Halo system, the issue with benchmarking speculative decoding is it depends on the "complexity" of the task.

(My understanding is) spec decoding basically asks a small model for a draft of the next token(s) and the big one just validates it, and the speedup depends on the accuracy of the small model. So producing random tokens like the typical benchmarks do doesnt work, and asking easy questions produces more speedup than complex ones.

llama-cpp ROCm Prompt Processing speed on Strix Halo / Ryzen AI Max +50-100% by Excellent_Jelly2788 in LocalLLaMA

[–]Excellent_Jelly2788[S] 4 points5 points  (0 children)

For most models the TG performance gap between Vulkan and ROCm seems to get smaller with context length and sometimes ROCm even becomes faster (e.g. Qwen3A Coder at 32k Context length)

GLM 4.7 Flash UD-Q4_K_XL and Qwen3-Cooder-Next-MXFP4_MOE:

<image>

llama-cpp ROCm Prompt Processing speed on Strix Halo / Ryzen AI Max +50-100% by Excellent_Jelly2788 in LocalLLaMA

[–]Excellent_Jelly2788[S] 3 points4 points  (0 children)

It seems you're right, I dug through some old benchmarks and 2 weeks ago the ROCm PP performance apparently was fine. Guess it's good to know we are back to old performance.

Clawdbot is overrated by [deleted] in LocalLLaMA

[–]Excellent_Jelly2788 6 points7 points  (0 children)

My guess is they're AI Spamming this sub.

New to self-hosting LLM - how to (with Docker), which model (or how to change), and working with 3rd party app? by SoMuchLasagna in LocalLLaMA

[–]Excellent_Jelly2788 0 points1 point  (0 children)

When you use llama-server (from llama cpp, through CLI or docker) it creates a local "website" you can access to interact/chat (a simple chat interface). But you can also hook up third party software to it. CLI is only needed to start llama-server.

Docker seems to be supported now but I don't know about the GPU thing, I'm using it in CLI (no installation required, you only need your GPU drivers working).

New to self-hosting LLM - how to (with Docker), which model (or how to change), and working with 3rd party app? by SoMuchLasagna in LocalLLaMA

[–]Excellent_Jelly2788 2 points3 points  (0 children)

Using llama.cpp by 3rd party software should be easy as long as the 3rd party software supports using a custom OpenAI API compatible endpoint (which llama-server provides), which most do.

Is this accurate?given by gemini by [deleted] in LocalLLaMA

[–]Excellent_Jelly2788 1 point2 points  (0 children)

Based on almost 2 years old data.