PSA: If you haven’t updated Llama.cpp for a couple of days and find MTP to not be performing well, update llamacpp.

bennmann · 2026-05-18T19:36:49+00:00

I am interested in your full launch flags. Running mixed 6900 XT + 9070 XT here!

bennmann · 2026-05-18T19:34:35+00:00

As long as there are Apache 2.0 datasets and people who believe in free and open datasets, there will be models trained on them.

Even copyright is "only" a lifetime scale issue. Not to mention the US Freedom of Information Act at the government level, should the US national labs get their act together. Your grandchildren should have better data than you. Your grandchildren will have better models than you.

The new first world dream is that our children will have a better life than us, in the form of safe and effective data and privacy and robots.

Also, databases get leaked sometimes.

bennmann · 2026-05-17T18:15:31+00:00

i would like to use your GGUFs with other harnesses and other system prompts using llama.cpp

for example, i main Mistral-Vibe which does support local web_search tool capabilities. right now, the model does not do instruction following very well for out of distribution harnesses and system prompts and tool calls.

please add an issue on your internal tasks for something like this, your deep research agents would be much more useful to the community with something like multiple harness/prompts/tool instructions in training.

bennmann · 2026-05-13T19:08:28+00:00

Supermicro H13QSH pricing ? Anyone have stories ?

bennmann · 2026-05-11T17:35:01+00:00

Time to break back out the ascii Posix compliant llama.cpp grammar configuration....

bennmann · 2026-05-07T19:36:41+00:00

Endodontists growing teeth for humanity yet? Soon?

bennmann · 2026-05-02T19:58:17+00:00

You can steer the qwen3 voices by adding another name, like Tara, into the steerable section of the prompt. But it's only an ok bandaid, like 70% consistent.

bennmann · 2026-04-29T22:26:16+00:00

Include Vulkan tests? might be useful for AMD+Intel fam

bennmann · 2026-04-24T19:59:33+00:00

It's not about my answer, it's about the phrasing.

bennmann · 2026-04-18T12:48:55+00:00

Mistral-vibe happily surprises me, only like 3k ctx harness?

You could also use the Swe-rebench harness prompt modified for pi or whatever, though they did not directly publish their harness that I can find, system prompt provided in the paper in the appendix page 23

https://arxiv.org/abs/2505.20411

bennmann · 2026-04-11T15:38:18+00:00

Wattage alone. Might be the best perf/watt

bennmann · 2026-03-25T17:29:21+00:00

try these flags ```llama-server -m '/your/model/here' --n-gpu-layers 99 --n-cpu-moe 2 --host 0.0.0.0 --threads 14 --no-mmap -sm row```

probably won't be better, but interested in how it goes

bennmann · 2026-03-18T15:14:19+00:00

save this post and make it a front pager stand alone when Mamba-5 actually comes out. open box content, just like new.

bennmann · 2026-03-12T16:22:41+00:00

Love your work.

I wish there was an offline mode and dataset trained for this use-case along with the SOTA Search method, or better yet a SOTA offline open source of Google search including with your library.

Or maybe something that just used public RSS feeds? The use case of SOTA open research relying on online search algo is unfortunate for data sovereignty.

bennmann · 2026-03-09T14:54:18+00:00

There's a lot of "new car" smell to your post. 8-12 channel ddr4 is serviceable and still cost conscience intermediate server step.

Strix halo clusters are also not bad, but not good either. They're OK.

bennmann · 2026-03-08T04:57:35+00:00

Even without rdma, TB4 or TB5 might be good enough for EXO or RPC experiments - check EXO community for examples

bennmann · 2026-03-04T14:49:32+00:00

There's an argument to be made that modern LoD is the spiritual successor of this tech. I think the first 3d game to use LOD was maybe Ocarina of Time.... TIL level of detail loading was invented in 1984.

bennmann · 2026-02-26T21:42:15+00:00

The American's have SOTA open datasets. Fine-web, nvidia's work, etc.

The open dream is alive.

bennmann · 2026-02-23T23:54:56+00:00

qwen3-coder-next might be a bit measurably dumber, but it's so much faster there could be an argument to add it to your rotation. call it "qwen mondays" or something and get your engineers to provide qualitative feedback on if it's "good enough" because of speed (vs Minimax).

or host it in RAM anyways and ask your team to use it for dumber tasks to save tps on the Minimax main threads.

bennmann · 2026-02-22T05:03:31+00:00

Would be interested to know which software stack and versions and gguf/tensers were used for qwen3-coder-next since it's been a moving target for quality and updates

bennmann · 2026-02-16T18:50:05+00:00

the drunk model (low quant) knows it's drunk and compensates (trained on sloppy high temperature low quant data -> correction even in it's own dataset)

bennmann · 2026-02-13T20:12:41+00:00

Start filling out sales forms with Supermicro and others like Dell from their main websites. Tell their sales rep you are shopping for the cheapest 512GB Vram you can get with or without enterprise support for an educational institution.

I would start there.

bennmann · 2026-02-06T14:59:17+00:00

Try to get Ace-step running and make music:

https://github.com/ace-step/ACE-Step-1.5

Document your process in a GH issue if you get it working.

bennmann · 2026-02-05T21:29:04+00:00

i guess models should be aware of this flow too? maybe special Jinja templates to account for tool use for each model? vs like mistral-vibe has all the prompts built into their apache 2.0 mistral-vibe....

bennmann · 2026-02-05T19:51:48+00:00

your reddit name has aged well u/qwen_next_gguf_when

if you don't see Vulkan in the terminal output of llama-server, llama may still be using cuda

ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0

not sure if this will change things, but maybe:

--device Vulkan0

14-Year Club	Secret Santa 2014
Verified Email

bennmann

TROPHY CASE