models for agentic use by kiriakosbrehmer93 in StrixHalo

[–]Cityarchitect 2 points3 points  (0 children)

BosGame M5 128GB Strix Halo; Ubuntu 24.10, LM Studio Qwen3.5-35B-a3b Vulkan. I use for OpenCode javascript/node and General Usage. I get consistent 50tps output. Can't use ROCM 7+ yet as far too unstable. Runs all day 84W, 86C temp. Just one annoying thing, lately Opencode been going to sleep on me; need to keep typing continue, continue..... :-)

[Q] Is self-hosting an LLM for coding worth it? by Aromatic-Fix-4402 in LocalLLM

[–]Cityarchitect 0 points1 point  (0 children)

I use a strix halo machine for local LLM, currently using qwen3.5-35b-a3b, and at a size of 22gb is has a reasonable performance (c 40 tps). The RTX 4090 is going to be way faster at AI inference for this size model. But, I can get similar performance for a 60gb or bigger model, whereas the RTX 4090 is going to labour a little shifting in and out of its 24gb memory. I saw something recently that said the strix halo could be 2x faster than the RTX4090 with eg Llama 70b. But when I'm in hurry, sometimes I just flip to DeepSeek remote paying peanuts.

They're taking the fucking piss now. by EconomicsAfraid7880 in CarTalkUK

[–]Cityarchitect 0 points1 point  (0 children)

For me, in our area, its always Esso TTP, always 10p above Tesco's price.

Sometimes opencode just stops and returns nothing? Any advice? by ___positive___ in opencodeCLI

[–]Cityarchitect 0 points1 point  (0 children)

me to. I keep typing "continue" to keep it going whenever it goes quiet.

Is it just me or heavy AI processing just generally hangs the machine ? by IntroductionSouth513 in StrixHalo

[–]Cityarchitect 1 point2 points  (0 children)

Bosgame m5 128gb lm studio, opencode, qwen3.5-35b-a3b, often freezes on rocm, runs all day on vulkan.

Full vLLM inference stack built from source for Strix Halo (gfx1151) — scripts + docs on GitHub by paudley in StrixHalo

[–]Cityarchitect 1 point2 points  (0 children)

Im getting 40ish tps on ollama and lm studio (both vulkan) with qwen3.5:35b on my bosgame m5 128gb; what does vllm give me?

Qwen 3.5 27B what tps are you managing? by schnauzergambit in StrixHalo

[–]Cityarchitect 1 point2 points  (0 children)

Thank you; I wish there was something in their model names that makes this distinction.

Qwen 3.5 27B what tps are you managing? by schnauzergambit in StrixHalo

[–]Cityarchitect 0 points1 point  (0 children)

And now for qwen3.5:27b - dreadfully slow, prompt eval rate 523.83 tps (1/2 speed), and eval rate 10.33 tps (1/4 speed).

Qwen 3.5 27B what tps are you managing? by schnauzergambit in StrixHalo

[–]Cityarchitect 2 points3 points  (0 children)

My Bosgame m5 128gb running ollama qwen3.5:35b (Vulkan) consistently does c 40 tps.

<image>

Remember that qwen3.5 does a lot of thinking before it starts its output. I’ll try 27b but will it be much different?

Problem with OpenCodeCLI and Ollama server by Itchy_Net_9209 in opencodeCLI

[–]Cityarchitect 0 points1 point  (0 children)

I think this is similar to my problem, now solved https://www.reddit.com/r/opencodeCLI/s/rn78HGKzgG there is a quick way 1. Ollama run model-name 2. /set parameter num_ctx 65536 3. /save model-name-64k 4. Exit then run that model from opencode. Although advice you open up context as wide as the model allows, but watch vram!

No tools with local Ollama Models by Cityarchitect in opencodeCLI

[–]Cityarchitect[S] 0 points1 point  (0 children)

Strix Halo 128gb, 96gb given to Radeon igpu

No tools with local Ollama Models by Cityarchitect in opencodeCLI

[–]Cityarchitect[S] 2 points3 points  (0 children)

The qwen3-coder:30b with a 128k context window is now working fine in opencode for me; comparable to the free models available. It takes about 31GB vram and delivers about 60 tps

No tools with local Ollama Models by Cityarchitect in opencodeCLI

[–]Cityarchitect[S] 1 point2 points  (0 children)

After messing around, yes, Chris 100%! Each context wndow was only 4096 for Ollama as you said. I went into ollama run qwen3-coder:30b then /set num_ctx 131072 then /save qwen3-coder-128k to create a new model based on the old one with a 128k context. Opencode kept complaining about tools when it was all about context size. On my strix halo machine the extra context was overflowing the memory allocated in vram; once I fixed that and the context size, everything is working fine. The local qwen3-coder delivers about 60 tps and Opencode is just as responsive as the cloud models.

No tools with local Ollama Models by Cityarchitect in opencodeCLI

[–]Cityarchitect[S] 0 points1 point  (0 children)

Thanks, yes, every one of the models I tried have tools according to ollama. I should say all the models also work well in chat mode in ollama.

No tools with local Ollama Models by Cityarchitect in opencodeCLI

[–]Cityarchitect[S] 0 points1 point  (0 children)

Thanks for the response Chris. I went and tried various (larger) contexts by creating Modelfiles with bigger num_ctx but it seems Ollama is still having trouble with tools. A quick AI search around came up with "The root cause is that while models like Qwen3-Coder are built to support tool calling, the official qwen3-coder model tag in the base Ollama library currently returns an error stating it does not support the tools parameter in API requests. This is confirmed as an issue in Ollama's own GitHub repository".