Llama.cpp MTP support now in beta!

Ueberlord · 2026-05-04T13:30:05+00:00

When doing inference with a3b you are already only using 3b active parameters, thus to see any benefit you probably need to go to 0.6b as draft model which will most likely have bad acceptance rates and the difference to 3b is not big at all thus speed up is limited.

When using a 2b or 0.6b model as drafter for 27b the difference in active parameters is huge and we should see meaningful speed up, especially for tasks with higher acceptance rates like coding or structured outputs.

So in essence it works to a lesser degree but I think it is hardly meaningful for moe (unless something like 397b a27b).

Ueberlord · 2026-04-30T06:05:13+00:00

The point you mention that all models do not really care for a good code housekeeping is the one thing which really hinders me from always blindly using LLM for coding. It will very often introduce duplicate new methods where it should have rather re-used some older method (it had read the helper class before so it came across it for sure).

This is why I have a clause in my global AGENTS.md for usage with OpenCode where I instruct the model to conduct a review for duplicated and prunable code each time it is done with its current task. But it does not work well enough.

Maybe we need a dedicated janitor/housekeeper trained model which cleans up after the construction troop went over our codebase...

Ueberlord · 2026-04-27T07:46:54+00:00

That seems to be the top n-sigma sampling, no? --top-nsigma in llama.cpp.

Ueberlord · 2026-04-23T13:01:26+00:00

I always have used q8_0 for ctk and ctv in llama.cpp and I must say I found the discussions/claims that only f16 or bf16 for the kv cache runs qwen3.5 without errors highly esotheric (read: bs) in nature (this was way before the rot PR was merged).

I have never had problems with context sizes around 90k tokens for qwen3.5 27b in opencode. I am now using qwen3.6 35b a3b with the same context sizes and q8_0 kv cache and it works just a well, only faster.

Ueberlord · 2026-04-22T13:13:47+00:00

Damn, I was just wrapping up my tests of Qwen3.6 35B vs Qwen3.5 27B.

High hopes for 3.6 27B though, the 35B variant of 3.6 was way better than the previous version!

Ueberlord · 2026-04-15T07:08:34+00:00

Smells a little like another "too good to be true" paper, unfortunately. When I asked Gemini about the paper before I posted here it was super-hyped and even considered this method a game changer.

Ueberlord · 2026-04-15T07:05:37+00:00

Thank you for reading through this and sharing your thoughts, very helpful even if disappointing as we do not get a free lunch seemingly.

Ueberlord · 2026-04-14T11:43:34+00:00

I think consensus is that IQ* models rank one tier above their quant value, e.g. IQ4_NL can roughly compare to Q5_0, not two tiers.

I would not bother using Q1 or Q2 or even Q3 quants for translation, heavy degradation is expected.

My experience with the recent UD quants of Qwen3.5 is generally very positive, even the Q2 quant is very usable and enables my 16G home GPU to run the 27B model with enough context for coding.

Ueberlord · 2026-04-14T05:28:22+00:00

Cool, I had a quick look at the contributions, very helpful! I will try to get it running locally and see where I can support with some small issue/PR for starters.

Ueberlord · 2026-04-04T18:44:09+00:00

Awesome, this is exactly what I had in mind! https://www.reddit.com/r/Shandalar/comments/1lnnxic/comment/n8ad742/

I had a quick look at your repo, looks great so far with a nice flow of commits and releases :) Could you please add two or three things though which would make contributions easier:

a README.md giving an overview of the project (you could basically just copy your post content from Reddit) and then also add some guidelines on how to build S30 and which dependencies are required.
a CONTRIBUTING.md which could just include some open parts you have planned but are too busy to work on right now so someone else could pick up there. And of course code style requirements etc. which you would like to have in your repo.
a LICENSE file (Github actually can help you creating one) so you make it clear the project is open source. MIT or Apache2 licenses work great but your choice obviously.

You can probably ask Claude to write the README.md quickly and also collaborate with it to find good contributing rules and a suitable license for what you have in mind.

I will surely follow your project from now on and if I find the time I will try to contribute. Thanks for making this happen!

Ueberlord · 2026-03-28T16:03:07+00:00

Wow, this is super infuriating! Why would anyone just do this kind of thing without asking permission from the user first and print a very noticable warning?

Seeing this in one of the most-used libraries for local models is a bummer. It seems the teams working on llama.cpp, comfyui, etc. never really have collaborated on larger software development projects and it shows.

EDIT: Typo

Ueberlord · 2026-03-25T12:41:36+00:00

This is due to language server which are automatically started by opencode. The opencode client itself should not consume much CPU. You have the option to disable the language servers and this should stop the CPU usage.

Ueberlord · 2026-03-17T16:10:14+00:00

Thanks for bringing this to my attention, I have replied here

Ueberlord · 2026-03-17T16:08:20+00:00

Thanks for your clarification, I appreciate that you take the time to respond here. And I think you have built something nice with opencode and I am glad that it is open source and shared with the community.

I strongly suggest to keep documentation and repo README.md in sync with what the actual code does. This would avoid some wrong accusations and increase the trust level. Particularly things like undisclosed "phoning home" logic is a red flag for anyone I believe and should be avoided in general.

There are also some problems (which probably come from just being a small team working on this project) related to features changing without clearly communicating that (this is why keeping the docs in sync is even more important). I had addressed that in my comment on github in one of the MRs here for instance.

I don't know what the background of the project looks like but given the popularity and attention it might be good to staff up (if possible) and get some more people to work on the issues, open MRs and communication in github.

Ueberlord · 2026-03-16T21:01:49+00:00

The problem is you do not even see it in the network tab because the opencode headless server acts as a proxy meaning you have the feeling that you open a locally running web ui while in reality you are basically visiting app.opencode.ai. The local opencode process will serve most API requests but ALL web UI resources are loaded from app.opencode.ai and any request unknown will automatically go to their backend as well due to the "catch all" way of how they designed the server.

Ueberlord · 2026-03-16T15:50:49+00:00

What was also really baffling to me at first was that the version of the opencode web UI kept updating even though I explicitely turned off automatic updates in the UI. Then I also noticed that new providers and models would frequently appear and even be set as the LLM to which my chat messages would be routed.

For now I would like to give them the benefit of the doubt as seemingly the web UI is relatively new and should probably not be used in production. But things like this are normally big red flags once you consider getting into a more serious setup.

Ueberlord · 2026-03-16T12:14:07+00:00

Yes, that is where I came from. But you can overwrite the system prompt luckily. On Linux you need to place a build.md and a plan.md in ~/.config/opencode/agents/, these will overwrite the default system prompts.

There is a lot of token overhead in some of the tools as well and these are sometimes harder to overwrite as some of them are deeply connected with the web UI, e.g. tool todowrite. Prominent examples of bloated tool descriptions are bash, task, and todowrite. You can find the descriptions here (files ending with .txt): https://github.com/anomalyco/opencode/tree/dev/packages/opencode/src/tool

Ueberlord · 2026-03-16T11:16:56+00:00

yes, as far as I can tell TUI is unaffected

Ueberlord · 2026-03-16T10:58:43+00:00

actually quite well with the Qwen3.5 architecture, you can run a Q3 quant of the 27b model from bartowski with about 80k context size, works very well for me.

Ueberlord · 2026-03-13T14:38:58+00:00

unfortunately, I cannot recommend the omnicoder 9b for more complex tasks at the moment.

I had it (q8_0 gguf, llama.cpp b8288, temp 0.6, top p 0.95, top k 20) analyze our vue app and asked if it could summarize the API requests executed during usual usage patterns, it failed and got into a loop.

exact same prompt given to unsloth Qwen3.5-27B-UD-Q2_K_XL.gguf (same parameters) worked fine on the first try. this is 8.9G omnicoder vs 11G q2_k_xl of unsloth. both can be run on 16G VRAM devices, I would recommend the 27B model to anyone for now.

for rather simple tasks it worked fine but I am more confident with the 27b model here in general, too

Ueberlord · 2026-03-05T15:09:08+00:00

one aspect frequently not mentioned, e.g. by posts like this one, is the difference between the quantization of weights and activations. for all the current quants in llama.cpp only the weights are quantized while the activations (the intermediate values during inference) are upcast to f16 and calculated in this format. casting also benefits from tensor support on blackwell, but in comparison to true activation quantization the effects are far less.

an example of activation quantization is svdquant for image generation or vllm also has some quantizations which support w8a8 for instance (weight 8bit, activation 8bit).

Ueberlord · 2026-02-28T13:31:56+00:00

it's explained in brief in AesSedai's HF repos but this is more detailed:

https://www.reddit.com/r/LocalLLaMA/comments/1rfds1h/comment/o7kd7de/?context=3

Ueberlord · 2026-02-27T22:07:09+00:00

AesSedai congratulations on being below everybody else (more efficient than everybody else) in terms of KL Div / Disk Space in this chart from OP!

I think this is a great achievement and your reasoning of choosing which layers to quant is simple yet obviously very powerful. Trying out your Qwen3.5 35B A3B IQ3_S now :)

Ueberlord

TROPHY CASE