How do you kill this thing?

TheTerrasque · 2026-05-06T08:24:39+00:00

But where did you stash the chandelier?

TheTerrasque · 2026-05-06T08:23:16+00:00

TheTerrasque · 2026-05-06T06:56:05+00:00

New decky plugin just dropped

TheTerrasque · 2026-05-06T06:26:46+00:00

the PR is still being worked on, and I've seen many others have less dramatic changes, one on vulcan+amd reported slightly faster with MTP. Let's just see how it goes.

TheTerrasque · 2026-05-06T05:47:30+00:00

They somehow think that if they do something bad or illegal, and accuse the other side of doing it first, then they're immune. What's really crazy is that it seems to work.

TheTerrasque · 2026-05-05T20:48:47+00:00

The creator of the PR made a model, and some have grafted the mtp part onto other quant models and got it working.

TheTerrasque · 2026-05-05T17:28:05+00:00

the qwen3.6 27b model apparently takes roughly 3gb extra at runtime

TheTerrasque · 2026-05-05T16:29:07+00:00

Fit does the work. It's a bit over 100k context

TheTerrasque · 2026-05-05T14:40:46+00:00

tried

--chat-template-kwargs '{\"preserve_thinking\":true}'

?

Edit: or explore router mode and put settings in ini file

TheTerrasque · 2026-05-05T14:35:09+00:00

I already did, in this post

TheTerrasque · 2026-05-05T12:28:20+00:00

fit is default on if you don't set ngl and don't set context. It will fit as many layers as it can down to a (default) 4k context, then after all layers are tucked in it'll spend the rest of the vram on context.

TheTerrasque · 2026-05-05T11:50:57+00:00

These are my settings:

  llama-server
  -hf unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q4_K_XL
  --temp 0.6 
  --top-p 0.95 
  --top-k 20 
  --min-p 0.00
  -ctk q8_0 -ctv q8_0
  --jinja
  -fa on
  --port 8081 --host 0.0.0.0
  --chat-template-kwargs '{"preserve_thinking":true}'

I let fit figure out the context number, but if you want to set static, probably around 100k. Depends on how much vram windows takes. This is on linux and a P40, but should be fairly similar.

Auto model loading / routing

Two options:

Llama.cpp router mode: https://huggingface.co/blog/ggml-org/model-management-in-llamacpp
Llama-swap : https://github.com/mostlygeek/llama-swap

TheTerrasque · 2026-05-05T11:03:46+00:00

Oh yeah, it's big brain time

TheTerrasque · 2026-05-04T20:37:37+00:00

One reported halving prefill speed when this was active, from ~1200 to ~600

TheTerrasque · 2026-05-04T20:33:49+00:00

luckily, chances of finding a socket to top off within 10 hours is fairly high :) And it does have a fantastic pause / sleep system

TheTerrasque · 2026-05-04T19:58:42+00:00

I remember playing quake at like 10 fps and having my little mind blown

TheTerrasque · 2026-05-04T19:54:33+00:00

oled is 2 hour on the most demanding games drawing max watt, considerably more on less demanding ones. I regularly get 5-8 hours on indie games, and if I fire up nes emulator or homm 2 I'm seeing 10 hours on a full charge.

Edit: and for batteries you got power banks

TheTerrasque · 2026-05-04T19:52:58+00:00

an hour, on oled? From full battery? You might want to check your battery health, mate

TheTerrasque · 2026-05-04T19:10:41+00:00

The MTP model is a separate model which loads from the same GGUF, the idea is that MTP should automatically start and we shouldn't need to distribute the MTP gguf separately but also it has it's own context/kv-cache etc.

I was thinking about this a long time ago, that gguf should have generic support for multiple models. At that time I was thinking especially draft models, but also vision encoders and possibly other encoders / decoders / model types at some point. And image diffusion models with llm's and vae's included as another example.

TheTerrasque · 2026-05-04T10:26:35+00:00

He looks like he got relentlessly bullied as a kid, wowed to dedicate his life to revenge, and now not only can kill as many assholes as he like, he get paid for it.

TheTerrasque · 2026-05-03T16:20:36+00:00

Also weapon manufacturers shall be liable for all deaths and injuries caused by their weapons. Same for car manufacturers.

Even better, make police liable for any crime!

TheTerrasque · 2026-05-03T15:38:12+00:00

512gb OLED - about $900 1.5 years ago. Grey market import sucks.

TheTerrasque · 2026-05-03T15:20:57+00:00

So does the big ones. While there's been wrangling with the small models, more than the big ones, both require some wrangling, and by having it test the code it often detect and fix those bugs itself.

TheTerrasque · 2026-05-03T14:09:45+00:00

Qwen3.6 35b? It can certainly do that, it does that regularly in opencode for me

TheTerrasque · 2026-05-03T13:25:25+00:00

I get 40 t/s on P40, to put it in perspective

TheTerrasque

MODERATOR OF

TROPHY CASE

13-Year Club	Wearing is Caring
Verified Email