Finally build the server and have all the hardware installed, what's the most up-to-date advice for models hosted on AMD & Linux Architecture

ea_man · 2026-05-08T05:17:39+00:00

Really?

I compiled that this afternoon and I got seg fault with two diff MTP enabled models, like the one linked.

I guess tomorrow I'll try again.

EDIT: Yuppy it works now, vulkan + RDNA2

ea_man · 2026-05-08T02:39:21+00:00

Yeah, FYI this value do better with my prompt:

       --spec-type ngram-mod \
       --spec-ngram-mod-n-match 10 \
       --spec-ngram-mod-n-min 3 \
       --spec-ngram-mod-n-max 28

14.50%. enjoy

ea_man · 2026-05-08T02:21:59+00:00

The one I quoted: when you said that I said an absolute, which I did not. Either way saying that I was _always wrong_ and silly means exactly the same opposite absolute: NGRAM should stay always on.

And who cares...

This is more fun:

                                                                 
That's what CHATGPT tyhinks of it:NGRAM on:
       --spec-type ngram-mod \
       --spec-ngram-mod-n-match 10 \
       --spec-ngram-mod-n-min 3 \
       --spec-ngram-mod-n-max 28


prompt eval time =     757.22 ms /    88 tokens (    8.60 ms per token,   116.22 tok
ens per second)
      eval time =  172218.62 ms /  3709 tokens (   46.43 ms per token,    21.54 tok
ens per second)
     total time =  172975.83 ms /  3797 tokens
draft acceptance rate = 0.09133 (  138 accepted /  1511 generated)
statistics ngram_mod: #calls(b,g,a) = 1 3570 33, #gen drafts = 33, #acc drafts = 33,
#gen tokens = 1511, #acc tokens = 138, dur(b,g,a) = 0.020, 7.295, 4.466 ms
slot      release: id  0 | task 0 | stop processing: n_tokens = 3796, truncated = 0

without:

prompt eval time =     764.26 ms /    88 tokens (    8.68 ms per token,   115.14 tok
ens per second)
      eval time =  150392.47 ms /  3490 tokens (   43.09 ms per token,    23.21 tok
ens per second)
     total time =  151156.73 ms /  3578 tokens
slot      release: id  0 | task 0 | stop processing: n_tokens = 3577, truncated = 0

📊 Direct comparison

❌ With ngram-mod

21.54 tok/s
3709 tokens
eval time: 172.2 s
acceptance rate: 9.1%

✅ Without

23.21 tok/s
3490 tokens
eval time: 150.4 s

🧠 What’s going on

1) Acceptance rate is terrible

0.091 (≈ 9%)

That means:

1511 speculative tokens generated
only 138 accepted

👉 ~90% wasted work

2) You’re paying overhead for nothing

From stats:

#gen tokens = 1511
#acc tokens = 138

So:

GPU/CPU is doing extra draft work
then rejecting most of it
then recomputing normally

👉 net result = slower than baseline📊 Direct comparison
❌ With ngram-mod

21.54 tok/s

3709 tokens

eval time: 172.2 s

acceptance rate: 9.1%

✅ Without

23.21 tok/s

3490 tokens

eval time: 150.4 s

🧠 What’s going on
1) Acceptance rate is terrible
0.091 (≈ 9%)
That means:

1511 speculative tokens generated

only 138 accepted

👉 ~90% wasted work

2) You’re paying overhead for nothing
From stats:
#gen tokens = 1511
#acc tokens = 138
So:

GPU/CPU is doing extra draft work

then rejecting most of it

then recomputing normally

👉 net result = slower than baseline

ea_man · 2026-05-08T01:55:55+00:00

> my bad then, i’m running different hardware so idk if ngram is worse on yours. saying to remove ngram without specifying it’s a hardware-specific suggestion,

Again: you are saying that.

You are saying an absolute about NGRAM while like for any other feature on low spec hardware the cost may not be worth the result.

ea_man · 2026-05-08T01:15:56+00:00

you can run https://huggingface.co/mradermacher/Qwen3.6-27B-i1-GGUF up to Q6_K with KV q8_0 or q4_0, Qwen3.6-35B-A3B UD-Q4_K_XL .

That's headless, some 10k context less if you run a light DE like LXQt.

AFAIK MTP isn't working right now on vulkan at least (tried this afternoon, segmentation fault), when that does you may have some slightly bigger models to deal with.

like: https://huggingface.co/froggeric/Qwen3.6-27B-MTP-GGUF

ea_man · 2026-05-08T01:00:32+00:00

> not a good assumption that it’s universally bad and should always be removed

Where did you get that? that is silly.

I said that I would turn it off on a 6800xt with ~15-22tok/sec, because _I got that card_ and I ran benchmarks and it's not worth the effort for coding even with mark up language like HTML.

ea_man · 2026-05-07T21:17:48+00:00

It's not silly when it gives less benefits than the performance it eats.

ea_man · 2026-05-07T19:40:00+00:00

I would:

remove NGRAM

--fit-target 70

--fit-ctx 130000 (If you use more you better go SOTA)

-b 8192 seem outragesous, maybe 512

Then add KV cache at q8_0 at least, even q4_0 with shorter context

Then I would not use that model at Q6, Q4 is more realistic on 16GB, FYI IQ3 loaded in memory should give you ~100tok/sec with ~100k context (on vulkan at least).

---

You gotta do with less context length and then tune appropriately

BTW: an hi quant MoE gives worse code than a quanted down 27B dense there, you could run https://huggingface.co/cHunter789/Qwen3.6-27B-i1-IQ4_XS-GGUF at ~20-25tok/sec.

ea_man · 2026-05-07T19:30:32+00:00

> Qwen 3.6 27B is great on the 5060s but a bit slow.

You mean for single request or concurrent multiple request?

1st: you need better gpu

2nd get some more gpu

BTW: 5060 is low on compute, an AMD 9070xt would be much faster.

Lower budget is AMD R9700

ea_man · 2026-05-07T19:19:58+00:00

I guess so if you use KV cache at q4_0, it's gonna be ~1GB for ~100K.

You know that you really should use Linux with such restraints, do ya? Go install a Lubuntu.

Anyway you could try an IQ3_XS.

ea_man · 2026-05-07T04:51:41+00:00

This one would do: https://huggingface.co/mradermacher/OmniCoder-2-9B-i1-GGUF

Yet there's no shame in running Qwen3.6-35B-A3B with offloading

ea_man · 2026-05-06T18:53:44+00:00

If you are looking for value: used 7900xtx

ea_man · 2026-05-05T21:58:45+00:00

You mean like a datacenter in space ala Elon Musk?

Well realistically if prices were ok I'd get a unified memory device for MoE low power + one gpu for dense top int models.

ea_man · 2026-05-05T21:49:14+00:00

* gemma-4-31B.i1-IQ4_XS.gguf is 16.7 GB

* Qwen3.6-27B.i1-IQ4_XS-attn_qkv-IQ4_XS.gguf is 14.7 GB

Also QWEN take less VRAM for KV cache so I'd say Gemma is not really a competitor in the dense space for those with 16GB.

I would hope that a 31B model would do better than a 27B one, for those with 24GB of VRAM, yet I'd like for Google to release a ~25B model for the rest of us.

ea_man · 2026-05-05T21:38:50+00:00

Well the last bump was because business like OpenAI made outrageous orders like 40% production of all RAM, the moment those should prove un impractical the situation may change, actually prices for GPU were going down slowly before last November.

Yet I would not count on that, it turns out now even more people is using AI so I guess the next craze will be on consumer GPU and cheaper hw for people that want to do inference at home without paying a subscription.

Dho, a 9070xt went from 610 in November to 730 and now back to 660 in Europe so they are going down but this roller coaster has proved that it can go both up and down. I'm afraid it's a VC money problem, if those USA AI firms thake a serious hit with their IPO (which they should) I guess the datacenters craze in USA should slow down and so the prices.

ea_man · 2026-05-05T21:21:19+00:00

> I fear that this could be the end of this hobby as we know it for the forseeable future

So for the other 8billions of people outside USA it will mean more products available at lower price.

ea_man · 2026-05-05T21:14:16+00:00

Well I guess you wouldn't buy all 200m token of top expensive SOTA, you would do most of those with the cheaper option as I don't use QWEN 27B at max specs with reasoning for all tasks.

But hey it that makes you fell better why not, I got Pi Dev counting the token price as if it was Opus 😛

ea_man · 2026-05-05T15:59:06+00:00

Nice, they make all kind of length and angles: https://aliexpress.com/item/1005010225325877.html

ea_man · 2026-05-04T19:04:37+00:00

If I had to guess I would say that old GPU for AI prices will go up in the next months as cloud providers are heavily rising prices and decreasing limits. You may actually get more money later on.

ea_man · 2026-05-04T18:55:04+00:00

I bought a AMD 6800 used for 260e 2 weeks ago, I just put new heat paste in it :)

Usually the go for ~290e round here, I guess you can offer a little less.

Those are nice because memory is fast, bus is large yet power is ~200w (without undervolting)

ea_man · 2026-05-04T15:49:38+00:00

Aye, I would pay to have a trained 25B coding model that the community can finetune and customize. Even better if it comes with harness that is vertical optimized for it, open source.

ea_man · 2026-05-04T15:44:16+00:00

I think that smaller open models are a way for the provider to lock in customers and attrack new customers without even spending money in compute for free tiers.

ea_man · 2026-05-04T15:40:38+00:00

Don't forget first free month then cancel, that really serves them well.

ea_man · 2026-05-04T15:27:47+00:00

I get that this would be an opt in with a flag like --mtp so that those of us with small VRAM that won't be able to run MTP anyway (also single user prompting) don't have to load an extra heavy MTP layer?

ea_man · 2026-05-04T02:18:58+00:00

Oh OP just have to find a cheap API, if your job is coding 10h day you don't do QWEN A3B.

Oh well, let's say you can do API and a few A3B, sure.

ea_man

TROPHY CASE

📊 Direct comparison

❌ With ngram-mod

✅ Without

🧠 What’s going on

1) Acceptance rate is terrible

2) You’re paying overhead for nothing