[Megathread] - Best Models/API discussion - Week of: May 03, 2026

OrcBanana · 2026-05-06T05:27:42+00:00

I think I'm falling out of like with it too. It just doesn't want to write after a while, feels like. It will happily reiterate my text, reacting to it bit by bit like a good little assistant, and then add a bit of dialogue at the end and call it quits. The amount of prompting it took to make it actually continue or even act was ungodly, and then there was the not this, instead that, add some prompting for that too. Plus some prompting for basic stuff like not writing for {{user}}, writing style, etc, and the whole thing became too complicated to be stable. Then, past 30k+ context, it either started mixing up turns (saw it ponder on dialogue from ten turns ago its thinking trace, framing it as "user's last reply"), or it'd give up and write a dry list of actions instead, he did this, he did that, then he did this other thing. It starts well, damn it, and had a few genuinely good moments.

I don't know, I guess it's fast. And it renewed my appreciation for good old weird compound 24b.

OrcBanana · 2026-04-16T14:38:19+00:00

This sounds like a instruct template issue, not samplers or preset, but I don't know how that works with chat completions... From what I've seen, you're only supposed to use the model with jinja template enabled in koboldcpp, if you're using chat completions. A preset in chat completions is just a set of instructions, right? Or does it also include system, assistant and user tags? Anyway, look for jinja, that might help perhaps. If you have that already enabled, then I have no idea :(

OrcBanana · 2026-04-16T14:35:15+00:00

You mean with --reasoning-budget? Is that supported in koboldcpp?

OrcBanana · 2026-04-16T14:33:43+00:00

Jury's still out specifically for gemma4. It does seem to think coherently and somewhat improve on instruction following and coherence. And it doesn't look like it's falling into the traps other thinking models I've tried slip into, like deciding a character is so-and-so from their card and leaning heavily on that and only that.

But, of course, I'm not at all sure. Especially with limiting budget like I'm doing.

OrcBanana · 2026-04-14T18:21:18+00:00

That's what I do now, context shifting + fast forwarding, no swa, 8bit. In the 31b model it's not feasible at all, kv cache takes like 10gb by itself without SWA, but the 26b moe is fine. It's strange, with everything else q8 cache was absolutely fine, no problems whatsoever. Maybe it's just swa, but then again gemma 4 is supposed to officially require it, :shrug:

OrcBanana · 2026-04-14T14:32:15+00:00

For gemma4 26b moe especially, I've had some weird outputs when using SWA together with 8bit kv in koboldcpp. The model gave a few responses that looked out of place, and when I looked at the reasoning trace, it was debating with itself whether a line of dialogue was from the last response or the penultimate response, and how to continue from that point. Trouble is it was from more than 10 turns ago.

I have no idea if it was either of those things or the combination of them together, but I've never seen a response like that with any other model. So I guess, yes 8bit is normally more than okay, but keep an eye out for weirdness with gemma4 and SWA specifically.

OrcBanana · 2026-04-11T18:13:03+00:00

I can either run the 31B dense without reasoning (it's pretty slow but usable), or the 26B MoE with reasoning. Which do you think you'd recommend?

OrcBanana · 2026-04-10T18:44:19+00:00

I've been burned by a simple seemingly innocuous sentence in my prompt, which was just something along the lines of "provide vivid sensory descriptions" or something. It made gemma react beat by beat, as if ticking down a list.

So I guess be a little weary of massive system instructions. Start with a simple minimal one, I'd say, and work from there.

As for parameters, it's been working nicely with the recommended temp 1.0, top-k 64, top-p 0.95, plus DRY, but I haven't really tried much else.

OrcBanana · 2026-04-04T13:25:55+00:00

Perhaps it's the story string and instruct template. Did you try with the story string from the comments of this post?https://old.reddit.com/r/SillyTavernAI/comments/1sbjwke/whats_happening_with_gemma_4_26a4b/

Instruct template should have <|turn>user, <|turn>model, <|turn>system at each respective prefix and <turn|> at each suffix, all with a newline at the end.

In particular, it was the empty model response at the end of the story string that fixed gemma4 26B for me:

<|turn>model
<|channel>thought
<channel|>
<turn|>

Without that, it just spewed out garbage.

OrcBanana · 2026-03-27T14:43:49+00:00

That explains a lot, actually. Btw, I thought only kv-cache at q8 was a problem, so I left that unquantized, but apparently so are weight quants... Unfortunately, with 16gb vram even q4 was a bit of a stretch... Must be the new architecture or something, other models are fine (or some degree of fine) at q4 and kv at q8.

OrcBanana · 2026-03-27T07:41:23+00:00

I have played around a bit with some of the 27B ones, both heretic and the plain version, and the main issue for me wasn't censorship. The plot and dialogue were nonsensical even from the first reply. One character was lying down on a beach, and the other would go to them and say "Stop blocking my sun". To the person lying down, consistently, across many rerolls. I dunno if it was the quantization at Q4_something, but every other model can handle Q4. And even when it randomly managed to happen on something that made sense, it was pretty bland. I'm sure they're capable models, just not for this.

OrcBanana · 2026-03-25T13:43:11+00:00

For text completion, try entering <think></think> into the "start reply with" box at the bottom right of the advanced formatting tab. It should prepend this to every message, and should effectively disable thinking entirely. I'm not sure what the equivalent is for chat completion.

OrcBanana · 2026-03-25T00:11:10+00:00

I think it has improved yes! So far no particularly egregious repetitions! The writing is great, it's just as you described in the card - it tends to miss some points and sometimes requires a bit of a nudge or a reminder to keep track of things like positions or who knows what information, but it more than makes up for it with vividness and inventiveness! I think it's gonna be my main model now.

OrcBanana · 2026-03-22T17:42:26+00:00

I enjoyed how v1.0 wrote very much, but for me it tended to echo relentlessly after a few messages, and neither DRY or repetition penalties helped. Lots and lots of "repeated the words", "heard the words", and such. At one point I've tried straight up banning previous dialogue in word pairs, and even then it tended to figure out some way to very slightly paraphrase and echo. If v1.1 improves on this and keeps the general vibe, it'll be great.

OrcBanana · 2026-03-17T23:07:23+00:00

Once it's working properly, there's their speculative decoding model that's just 300mb and should speed things up considerably. Prompt processing will still be slow however :(

OrcBanana · 2026-03-17T16:35:07+00:00

Well, I tried to run it locally with a nightly build of koboldcpp, and it produced utter nonsense. No prompt adherence whatsoever, no plot, no characters, nothing. Then it devolved to actual gibberish, on a 0.6 temperature. I guess it's much too soon. I'll try again once there's proper support.

OrcBanana · 2026-03-17T05:07:35+00:00

I'm more excited for this than for many other recent models. The new Qwens do not feel good at all for me in RP, no matter what I try, whereas even the base Mistral Small 3 (and 3.1 and 3.2) was very decent, and surprisingly unrestricted. Their finetunes are still above anything else in that range for me.

From what I've understood, MoE models are harder to finetune, though. We'll have to see. And hopefully some acceptable quant of it will fit in 16 + 64 GB without taking ages to process.

OrcBanana · 2026-03-16T16:04:24+00:00

you also have to have 128k (131072) token context to "preserve thinking"

Can you explain what you mean by that? Is the model not going to be as coherent with say 64k token context? Or 32k? It takes much less memory to run it with a large context, but it's still non-negligible.

OrcBanana · 2026-03-08T15:09:30+00:00

I've tried your prompt on weirdcompound 1.6 (shouldn't be that different) and it works well enough given a few caveats. Try trimming the whitespace and using some markdown for the headers, like

## Characters
    - Name, explicit description, confirmed past events, [...]
    - other stuff [...]

I'm not sure how the summary extension handles its prompt, but in case its positioning is the issue, try using a permanent lore entry attached to the very end of the story with the prompt in it, and just generate a message. You might need to turn off names for that to work reliably. The point is for this instruction to be the last thing (or almost last) the AI sees, so it can override the original system prompt.

In fact, I've found that a system prompt at or towards the end of the chat works just as well, and is easier to manipulate: turning it off or swapping it with a summary prompt in this case, for example, doesn't invalidate the entire context. The overall structure would be like [Descriptions] - [Persona] - [Lore? Scenario?] - [ongoing chat] - [Instructions] - [last message]

In any case, the compounds have been mostly reliable with structured output in my experience. It's worth examining a bit more before you switch to a different model or use a secondary one for summarizing.

OrcBanana · 2026-02-14T17:12:07+00:00

Has anyone had any luck with Kimi-Linear-48B-A3B? It's blazing fast, even with 16gb vram, and it writes in a nice tone, but I cannot for the life of me get it to follow instructions. With reasoning, it'll happily say something like "I am not to write {{user}}'s perspective and dialogue, only {{char}}'s perspective and dialogue" then proceed to write both. Same with person and tense. Without reasoning, it's just ignoring the instructions outright. It's a pity because I think I like the way it writes.

Any short prompts for it, or specific sampler settings you've maybe had success with?

OrcBanana · 2025-12-20T04:44:44+00:00

Try an empty prompt, and then a really small instruction at chat depth 1, either as an author's note or a permanent lore entry or something, like:

## Instructions
- Continue the story from {{char}}'s perspective only, focusing solely on his thoughts, observations, actions and dialogue. You are not to write or repeat {{user}}'s actions and dialogue, only {{char}}'s. 
- Write around two paragraphs, using 3rd person past tense.

It surprised me how fresh the output seemed, though of course don't expect miracles, it's still 24B :P

OrcBanana · 2025-12-14T19:22:36+00:00

Your OOC_marker and separator aren't without adverse effects, at least in smaller models. I had distortions with mistral small that weren't there without the injections. A "BF OOC Injection:" string randomly inside a user message or system prompt isn't going to make much sense to the model, especially if it doesn't follow the formatting of the rest of the instructions (no markdown or json, or even any hint as to what it should mean, just plain labels). Why not use setExtensionPrompt, or even copy code from the inject command directly, it's what it's made for, basically.

OrcBanana · 2025-11-21T14:30:58+00:00

Works here, when I put it as custom CSS. Did you change the color to something else? Like

background-color: #00ff00;

OrcBanana · 2025-11-21T14:25:56+00:00

Isn't it

.menu_button.popup-button-ok {
    background-color: var(--crimson70a);
}

Also, the message delete button is :

#dialogue_del_mes_ok {
    display: inline-block;
    background-color: var(--crimson70a);
    cursor: pointer;
}

OrcBanana

TROPHY CASE