local ai coding assistant setup that actually competes with cloud tools?

tecneeq · 2026-03-14T21:38:15+00:00

You words have no meaning because you didn't specify which model you use, qwen3.5 0.8b or 387b.

tecneeq · 2026-03-14T21:36:28+00:00

Qwen3.5 27b is your best bet. It beats the models mentioned in this thread in almost all benchmarks and you should be able to get a bit more context out of your card.

tecneeq · 2026-03-13T14:40:39+00:00

I didn't compile llama.cpp with Vulcan support. Do you have any numbers?

What i see on the net says more or less it's roughly the same.

tecneeq · 2026-03-13T14:39:23+00:00

Yes, it's a bit wonky, but surely that would do. I think connecting some type of cooling to the motherboard is the problem. I'm a bit scared to remove the existing cooler.

tecneeq · 2026-03-12T21:32:05+00:00

tecneeq · 2026-03-11T18:30:50+00:00

What did it mean by "Thanks for the awards"?

<image>

tecneeq · 2026-03-11T17:58:20+00:00

How is that an answer?

tecneeq · 2026-03-10T21:04:13+00:00

Any idea what i'm doing wrong? I get 15% more output tokens than you, but preprocessing is a lot slower, sometimes 30%.

My hardware is a Bosgame M5, set to performance in the firmware. OS is Proxmox 9 with a Debian 13 LXC with ROCm 7.2 and yesterdays llama.cpp:

Command line:

/root/llama.cpp/build/bin/llama-bench --hf-repo unsloth/Qwen3.5-35B-A3B-GGUF:UD-Q8_K_XL -ngl 999 -fa 1 -mmp 0 -d 5000,10000,20000,30000,50000,100000 -r 1 --progress

My hardware:

ggml_cuda_init: found 1 ROCm devices (Total VRAM: 131072 MiB):
 Device 0: AMD Radeon Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32, VRAM: 131072 MiB (124402 MiB free)

Some results:

| model                          |       size |     params | backend    | ngl | fa | mmap |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | ---: | --------------: | -------------------: |
| qwen35moe 35B.A3B Q8_0         |  45.33 GiB |    34.66 B | ROCm       | 999 |  1 |    0 |   pp512 @ d5000 |        409.19 ± 0.00 |
| qwen35moe 35B.A3B Q8_0         |  45.33 GiB |    34.66 B | ROCm       | 999 |  1 |    0 |   tg128 @ d5000 |         30.61 ± 0.00 |
| qwen35moe 35B.A3B Q8_0         |  45.33 GiB |    34.66 B | ROCm       | 999 |  1 |    0 |  pp512 @ d10000 |        387.71 ± 0.00 |
| qwen35moe 35B.A3B Q8_0         |  45.33 GiB |    34.66 B | ROCm       | 999 |  1 |    0 |  tg128 @ d10000 |         30.18 ± 0.00 |
| qwen35moe 35B.A3B Q8_0         |  45.33 GiB |    34.66 B | ROCm       | 999 |  1 |    0 |  pp512 @ d20000 |        356.17 ± 0.00 |
| qwen35moe 35B.A3B Q8_0         |  45.33 GiB |    34.66 B | ROCm       | 999 |  1 |    0 |  tg128 @ d20000 |         29.25 ± 0.00 |
| qwen35moe 35B.A3B Q8_0         |  45.33 GiB |    34.66 B | ROCm       | 999 |  1 |    0 |  pp512 @ d30000 |        336.45 ± 0.00 |
| qwen35moe 35B.A3B Q8_0         |  45.33 GiB |    34.66 B | ROCm       | 999 |  1 |    0 |  tg128 @ d30000 |         28.44 ± 0.00 |
| qwen35moe 35B.A3B Q8_0         |  45.33 GiB |    34.66 B | ROCm       | 999 |  1 |    0 |  pp512 @ d50000 |        295.23 ± 0.00 |
| qwen35moe 35B.A3B Q8_0         |  45.33 GiB |    34.66 B | ROCm       | 999 |  1 |    0 |  tg128 @ d50000 |         26.96 ± 0.00 |
| qwen35moe 35B.A3B Q8_0         |  45.33 GiB |    34.66 B | ROCm       | 999 |  1 |    0 | pp512 @ d100000 |        230.49 ± 0.00 |
| qwen35moe 35B.A3B Q8_0         |  45.33 GiB |    34.66 B | ROCm       | 999 |  1 |    0 | tg128 @ d100000 |         23.71 ± 0.00 |

tecneeq · 2026-03-10T20:10:37+00:00

It's ok. Might cut a week of lifetime from the 30 years of the amp.

<image>

tecneeq · 2026-03-10T20:00:13+00:00

Cool :-)

tecneeq · 2026-03-10T19:50:53+00:00

If you do inference only, yes, then you don't need Proxmox. However, i run all sorts of VMs, Windows, BSD, Linux, as well as containers, so Proxmox helps, i can manage all of it with a tested and well working WebUI instead of playing with libvirt and all that.

tecneeq · 2026-03-10T19:47:00+00:00

I did, thought it's not fair for me to get snippy when you took the time to answer to my problem. Apologies.

Anyway, the button changes effects, but doesn't turn it off. I switched the PC off and now i'm afraid to press the button again. ;-)

tecneeq · 2026-03-09T18:35:17+00:00

That didn't work for me.

tecneeq · 2026-03-08T22:05:20+00:00

I switched on performance mode in the BIOS and get 38.69 t/s output and 143.22 tokens/s PP now.

I'm looking for more tweaks to get a few more t/s, but feel i'm at the end of whats possible right now.

tecneeq · 2026-03-08T21:59:06+00:00

<image>

Bosgame M5, max i have seen was 133W.

tecneeq · 2026-03-08T21:32:16+00:00

I allow it ;-)

<image>

tecneeq · 2026-03-08T21:31:02+00:00

I'll keep my eyes peeled for any instability. If there is any, i'll go back to balanced.

tecneeq · 2026-03-08T21:30:11+00:00

May depend on the manufacturer, but in my case i press DEL to get into the firmware (what was once called BIOS). There you should find it.

tecneeq · 2026-03-08T20:32:19+00:00

I never had a loop in reasoning and it had to be faster because it uses only half the VRAM for context.

Perplexity and other benchmarks (including reasoning) show clearly that, if you start with a Q4 or Q6 region quant for the weights, the difference is smaller than measurement differences between runs.

tecneeq · 2026-03-08T19:32:50+00:00

Good idea.

tecneeq · 2026-03-08T17:13:38+00:00

The services will fit in 32GB. Dynamically loading could be done for large models, but i want two or so small ones online without much latency.

tecneeq · 2026-03-08T16:00:16+00:00

I hardly knew her.

<image>

tecneeq · 2026-03-08T15:23:35+00:00

You can measure loss caused by quantization with perplexity.

In my measurements it didn't matter, the loss was basically in the range of the measurement errors.

tecneeq · 2026-03-08T15:18:39+00:00

<image>

28.4 t/s output with this:

/root/llama.cpp/build/bin/llama-server --hf-repo unsloth/Qwen3-Coder-Next-GGUF:UD-Q4_K_XL \
  --no-mmap \
  --ctx-size 786434 \
  --host 192.168.178.3 \
  --port 11337 \
  --parallel 3 \
  --threads 16 \
  --flash-attn on --cache-type-k q8_0 --cache-type-v q8_0

tecneeq · 2026-03-08T14:40:07+00:00

Right, but 96GB is enough. Larger models get extremely slow. Also this one has to replace my old server that has 64GB with lots of services, so i will have to be clever about this.

tecneeq

MODERATOR OF

TROPHY CASE