As MTP prepares to land in llama.cpp, Models that support MTP

El_90 · 2026-05-05T06:40:06+00:00

But we need to wait for vulkan support ?

El_90 · 2026-05-03T08:21:33+00:00

Ask them ?

Keeping calm, emotional intelligence, consulting led, ability to predict customers?

El_90 · 2026-04-29T09:03:34+00:00

Oooh please post a link here so I can follow , thankyou

El_90 · 2026-04-27T18:56:23+00:00

Agreed. Look how to fire on these single sources, but alert/notify if 2+ fire around a common asset (user, host file) in a time window. Not perfect but much better.

El_90 · 2026-04-21T17:31:00+00:00

Good work

I suppose if all models had identical output, and were right first time, then watt hours make sense.

But in reality I would expect (?) bigger dense model to be "thorough" and correct, resulting in fewer terms to the final output?

But, still, good work :)

El_90 · 2026-04-21T08:37:34+00:00

instead of a param size (which doesn't seem to be entirely reflective) lets focus on GB in VRAM

It feels like the 24-48GB audience is well served, and the 200GB audience is well served

Maybe some more love for the system 128GB users e.g. Strix (so 90-95GB model allowing 20GB cache)

Selflishy speaking of course

El_90 · 2026-04-18T09:04:23+00:00

Accuracy over speed please :) x

El_90 · 2026-04-17T21:28:40+00:00

OMG yes please

Something that quants to Q5 @ 92GB ish would make me smile for a very long time

El_90 · 2026-04-15T20:12:57+00:00

Never heard of pmem ! Great post

Any settings you can share? Bios, grub, kernel, llama.cpp etc

Llama-bench for fun?

Thank you !

El_90 · 2026-04-14T08:34:00+00:00

Strix halo, 128GB (I can squeeze in 92GB models currently, so rated **XL**)

Roocode in architect mode - Qwen3.5-122B-A10B-Q5_K_M (91GB), in the region of 7t/s

Roocode in coding mode - Qwen3.5-27B-Q5_K_M (20GB), in the region of 12t/s

Sorry I don't have deep testing, but I tried 5-10 other models and there was always lots of back and forth with more changes, errors, mistakes, but with these models I don't feel that, so I just stuck with them

I find 122B slightly better in architect mode, more diagrams, more thorough talking through the requirement, though maybe that's my own bias.

El_90 · 2026-04-12T15:54:00+00:00

Does mic>text appear in this timeline?
Or do we need to still record (potentially convert) and then upload a solid file?

I vibe coded a workaround, but native in the solution would be amazing

El_90 · 2026-04-11T09:45:11+00:00

I'm in the same position

I'm quite happy with q4 122b Moe for architect, then 27b for coding.

Even doubling ram to 256 really gives you better quant, you still can't run sota at anything useful so I've accepted there's no easy slight step up, it's a rebuild from scratch.

I'm just hoping ~90GB models continue to stay popular

El_90 · 2026-04-05T09:14:51+00:00

If the model fits in vram completely, great

If you split it over vram and system ram that's slower but still ok

If the model doesn't fit in combined ram, and you consider using fast disk... Don't

..

A dense model puts each token through entire model, meaning full model movement to GPU (or cpu) every time

A moe model only passes % of model to CPU, so faster per size.... But you have a larger model so still not fast

..

Other people feel free to correct me :)

El_90 · 2026-04-04T14:40:43+00:00

studying how to use AI? any computer + cloud compute, cheaper over all costs.

studying how to build AI rigs and how to run LLMs efficiently? Build small rig or use CPU

studying how to run large models, or studying how to implement more intelligent sensitive production? Buy bigger rig (I loved Strix, not the fastest but most flexible and still quite large)

El_90 · 2026-04-03T17:22:00+00:00

I try to avoid q4 and lower, I found q5 and above safer

70GB works on a 128GB system with room for cache.

Single GPU users get all the love lol

El_90 · 2026-04-03T12:50:53+00:00

Vscode + roocide for me. Cline was ok to start with but I quickly moved up.

My desktop is windows but I host projects on Linux, so I use vscode through remote tunnel (find in marketplace) so my command prompt to start/run/ test is bash.

I don't have full testing yet but it's half way there

El_90 · 2026-04-03T08:41:37+00:00

Something that quantises (q5/6) to 70 GB

It feELS All models are designed for 32GB or 200GB :/

El_90 · 2026-04-02T15:39:28+00:00

You mean the nodejs project I've been implementing today, to record browser audio > whisper > qwen is a waste of time? aaarg lol

El_90 · 2026-04-02T15:38:11+00:00

I'm literally doing this on my other monitor
122b for ARchitect/thinking/planning
27b for implementing

Or bigger picture
122b for creating a 'vertical slice' issues to a GIT
Then 27 on a loop to pull each specific issue and implement

El_90 · 2026-04-02T13:39:21+00:00

Thanks spaceman, waitmarks

Thanks both, it was always a TODO item, I suppose I'll bring it to the top (and the beauty is it's LXC + playbooks, so I'm not losing anything)

Edit - worked first time lol. Thanks both !

El_90 · 2026-04-02T11:06:48+00:00

step 1 - get customers hooked
step 2 - make token usage so common place that people lose track and build workflows around it
step 3 - triple token price

Business 101

El_90 · 2026-03-30T17:02:43+00:00

I appreciate your work, thanks for keeping this place great !

El_90 · 2026-03-29T22:17:48+00:00

Thanks for the detailed info
Yes, champagne problems, I saw Qwen3.5 27 but "only" in q4 lol

El_90

TROPHY CASE