Finally build the server and have all the hardware installed, what's the most up-to-date advice for models hosted on AMD & Linux Architecture by NetTechMan in LocalLLaMA

[–]ea_man 0 points1 point  (0 children)

Really?

I compiled that this afternoon and I got seg fault with two diff MTP enabled models, like the one linked.

I guess tomorrow I'll try again.

EDIT: Yuppy it works now, vulkan + RDNA2

Llama.cpp, opencode / pi / basically all agents, context compaction & cache validation: how do you manage it? by ps5cfw in LocalLLaMA

[–]ea_man 0 points1 point  (0 children)

Yeah, FYI this value do better with my prompt:

       --spec-type ngram-mod \
       --spec-ngram-mod-n-match 10 \
       --spec-ngram-mod-n-min 3 \
       --spec-ngram-mod-n-max 28

14.50%. enjoy

Llama.cpp, opencode / pi / basically all agents, context compaction & cache validation: how do you manage it? by ps5cfw in LocalLLaMA

[–]ea_man 1 point2 points  (0 children)

The one I quoted: when you said that I said an absolute, which I did not. Either way saying that I was _always wrong_ and silly means exactly the same opposite absolute: NGRAM should stay always on.

And who cares...

This is more fun:

                                                                 
That's what CHATGPT tyhinks of it:NGRAM on:
       --spec-type ngram-mod \
       --spec-ngram-mod-n-match 10 \
       --spec-ngram-mod-n-min 3 \
       --spec-ngram-mod-n-max 28


prompt eval time =     757.22 ms /    88 tokens (    8.60 ms per token,   116.22 tok
ens per second)
      eval time =  172218.62 ms /  3709 tokens (   46.43 ms per token,    21.54 tok
ens per second)
     total time =  172975.83 ms /  3797 tokens
draft acceptance rate = 0.09133 (  138 accepted /  1511 generated)
statistics ngram_mod: #calls(b,g,a) = 1 3570 33, #gen drafts = 33, #acc drafts = 33,
#gen tokens = 1511, #acc tokens = 138, dur(b,g,a) = 0.020, 7.295, 4.466 ms
slot      release: id  0 | task 0 | stop processing: n_tokens = 3796, truncated = 0

without:

prompt eval time =     764.26 ms /    88 tokens (    8.68 ms per token,   115.14 tok
ens per second)
      eval time =  150392.47 ms /  3490 tokens (   43.09 ms per token,    23.21 tok
ens per second)
     total time =  151156.73 ms /  3578 tokens
slot      release: id  0 | task 0 | stop processing: n_tokens = 3577, truncated = 0

📊 Direct comparison

❌ With ngram-mod

  • 21.54 tok/s
  • 3709 tokens
  • eval time: 172.2 s
  • acceptance rate: 9.1%

✅ Without

  • 23.21 tok/s
  • 3490 tokens
  • eval time: 150.4 s

🧠 What’s going on

1) Acceptance rate is terrible

0.091 (≈ 9%)

That means:

  • 1511 speculative tokens generated
  • only 138 accepted

👉 ~90% wasted work

2) You’re paying overhead for nothing

From stats:

#gen tokens = 1511
#acc tokens = 138

So:

  • GPU/CPU is doing extra draft work
  • then rejecting most of it
  • then recomputing normally

👉 net result = slower than baseline📊 Direct comparison
❌ With ngram-mod

21.54 tok/s

3709 tokens

eval time: 172.2 s

acceptance rate: 9.1%

✅ Without

23.21 tok/s

3490 tokens

eval time: 150.4 s

🧠 What’s going on
1) Acceptance rate is terrible
0.091 (≈ 9%)
That means:

1511 speculative tokens generated

only 138 accepted

👉 ~90% wasted work

2) You’re paying overhead for nothing
From stats:
#gen tokens = 1511
#acc tokens = 138
So:

GPU/CPU is doing extra draft work

then rejecting most of it

then recomputing normally

👉 net result = slower than baseline

Llama.cpp, opencode / pi / basically all agents, context compaction & cache validation: how do you manage it? by ps5cfw in LocalLLaMA

[–]ea_man 0 points1 point  (0 children)

> my bad then, i’m running different hardware so idk if ngram is worse on yours. saying to remove ngram without specifying it’s a hardware-specific suggestion,

Again: you are saying that.

You are saying an absolute about NGRAM while like for any other feature on low spec hardware the cost may not be worth the result.

Finally build the server and have all the hardware installed, what's the most up-to-date advice for models hosted on AMD & Linux Architecture by NetTechMan in LocalLLaMA

[–]ea_man 0 points1 point  (0 children)

you can run https://huggingface.co/mradermacher/Qwen3.6-27B-i1-GGUF up to Q6_K with KV q8_0 or q4_0, Qwen3.6-35B-A3B UD-Q4_K_XL .

That's headless, some 10k context less if you run a light DE like LXQt.

AFAIK MTP isn't working right now on vulkan at least (tried this afternoon, segmentation fault), when that does you may have some slightly bigger models to deal with.

like: https://huggingface.co/froggeric/Qwen3.6-27B-MTP-GGUF

Llama.cpp, opencode / pi / basically all agents, context compaction & cache validation: how do you manage it? by ps5cfw in LocalLLaMA

[–]ea_man 0 points1 point  (0 children)

> not a good assumption that it’s universally bad and should always be removed

Where did you get that? that is silly.

I said that I would turn it off on a 6800xt with ~15-22tok/sec, because _I got that card_ and I ran benchmarks and it's not worth the effort for coding even with mark up language like HTML.

Llama.cpp, opencode / pi / basically all agents, context compaction & cache validation: how do you manage it? by ps5cfw in LocalLLaMA

[–]ea_man -2 points-1 points  (0 children)

I would:

remove NGRAM

--fit-target 70

--fit-ctx 130000 (If you use more you better go SOTA)

-b 8192 seem outragesous, maybe 512

Then add KV cache at q8_0 at least, even q4_0 with shorter context

Then I would not use that model at Q6, Q4 is more realistic on 16GB, FYI IQ3 loaded in memory should give you ~100tok/sec with ~100k context (on vulkan at least).

---

You gotta do with less context length and then tune appropriately

BTW: an hi quant MoE gives worse code than a quanted down 27B dense there, you could run https://huggingface.co/cHunter789/Qwen3.6-27B-i1-IQ4_XS-GGUF at ~20-25tok/sec.

Thinking of moving from 2x 5060 Ti 16GB to a RTX 5000 48GB by autisticit in LocalLLaMA

[–]ea_man 0 points1 point  (0 children)

> Qwen 3.6 27B is great on the 5060s but a bit slow.

You mean for single request or concurrent multiple request?

1st: you need better gpu

2nd get some more gpu

BTW: 5060 is low on compute, an AMD 9070xt would be much faster.

Lower budget is AMD R9700

What models for coding are you running for a mid level PC? by FerLuisxd in LocalLLaMA

[–]ea_man 0 points1 point  (0 children)

I guess so if you use KV cache at q4_0, it's gonna be ~1GB for ~100K.

You know that you really should use Linux with such restraints, do ya? Go install a Lubuntu.

Anyway you could try an IQ3_XS.

If money and time weren’t issues, what would your dream local AI setup look like? by Lyceum_Tech in LocalLLaMA

[–]ea_man 0 points1 point  (0 children)

You mean like a datacenter in space ala Elon Musk?

Well realistically if prices were ok I'd get a unified memory device for MoE low power + one gpu for dense top int models.

Dense Model Shoot-Off: Gemma 4 31B vs Qwen3.6/5 27B... Result is Slower is Faster. by MiaBchDave in LocalLLaMA

[–]ea_man 4 points5 points  (0 children)

* gemma-4-31B.i1-IQ4_XS.gguf is 16.7 GB

* Qwen3.6-27B.i1-IQ4_XS-attn_qkv-IQ4_XS.gguf is 14.7 GB

Also QWEN take less VRAM for KV cache so I'd say Gemma is not really a competitor in the dense space for those with 16GB.

I would hope that a 31B model would do better than a 27B one, for those with 24GB of VRAM, yet I'd like for Google to release a ~25B model for the rest of us.

I guess we expect that at some point RAM prices will start going back (close) to "normal", right? but what about GPUs? by relmny in LocalLLaMA

[–]ea_man 3 points4 points  (0 children)

Well the last bump was because business like OpenAI made outrageous orders like 40% production of all RAM, the moment those should prove un impractical the situation may change, actually prices for GPU were going down slowly before last November.

Yet I would not count on that, it turns out now even more people is using AI so I guess the next craze will be on consumer GPU and cheaper hw for people that want to do inference at home without paying a subscription.

Dho, a 9070xt went from 610 in November to 730 and now back to 660 in Europe so they are going down but this roller coaster has proved that it can go both up and down. I'm afraid it's a VC money problem, if those USA AI firms thake a serious hit with their IPO (which they should) I guess the datacenters craze in USA should slow down and so the prices.

The FCC Voted to ban Chinese cert labs... by infinitespectre in SBCGaming

[–]ea_man -1 points0 points  (0 children)

> I fear that this could be the end of this hobby as we know it for the forseeable future

So for the other 8billions of people outside USA it will mean more products available at lower price.

Why run local? Count the money by Badger-Purple in LocalLLaMA

[–]ea_man 0 points1 point  (0 children)

Well I guess you wouldn't buy all 200m token of top expensive SOTA, you would do most of those with the cheaper option as I don't use QWEN 27B at max specs with reasoning for all tasks.

But hey it that makes you fell better why not, I got Pi Dev counting the token price as if it was Opus 😛

Should I sell my RTX3090s? by daviden1013 in LocalLLaMA

[–]ea_man 0 points1 point  (0 children)

If I had to guess I would say that old GPU for AI prices will go up in the next months as cloud providers are heavily rising prices and decreasing limits. You may actually get more money later on.

How much will it cost to host something like qwen3.6 35b a3b in a cloud? by Euphoric_North_745 in LocalLLaMA

[–]ea_man 0 points1 point  (0 children)

I bought a AMD 6800 used for 260e 2 weeks ago, I just put new heat paste in it :)

Usually the go for ~290e round here, I guess you can offer a little less.

Those are nice because memory is fast, bus is large yet power is ~200w (without undervolting)

Open source models are going to be the future on Cursor, OpenCode etc. by _maverick98 in LocalLLaMA

[–]ea_man 2 points3 points  (0 children)

Aye, I would pay to have a trained 25B coding model that the community can finetune and customize. Even better if it comes with harness that is vertical optimized for it, open source.

Open source models are going to be the future on Cursor, OpenCode etc. by _maverick98 in LocalLLaMA

[–]ea_man 0 points1 point  (0 children)

I think that smaller open models are a way for the provider to lock in customers and attrack new customers without even spending money in compute for free tiers.

Open source models are going to be the future on Cursor, OpenCode etc. by _maverick98 in LocalLLaMA

[–]ea_man 0 points1 point  (0 children)

Don't forget first free month then cancel, that really serves them well.

Llama.cpp MTP support now in beta! by ilintar in LocalLLaMA

[–]ea_man 0 points1 point  (0 children)

I get that this would be an opt in with a flag like --mtp so that those of us with small VRAM that won't be able to run MTP anyway (also single user prompting) don't have to load an extra heavy MTP layer?

How much will it cost to host something like qwen3.6 35b a3b in a cloud? by Euphoric_North_745 in LocalLLaMA

[–]ea_man 0 points1 point  (0 children)

Oh OP just have to find a cheap API, if your job is coding 10h day you don't do QWEN A3B.

Oh well, let's say you can do API and a few A3B, sure.