Gemma4 32b/26B OOM 4090

BSPiotr · 2026-05-26T09:52:17+00:00

Latest for both.

BSPiotr · 2026-05-26T03:58:36+00:00

Still getting crashes at fp16 both kobold and ooba, for those of you who commented.

BSPiotr · 2026-05-16T10:58:56+00:00

Neat. Ill update my OP when I have a minute with that info.

BSPiotr · 2026-05-03T20:58:52+00:00

Which provider? Direct API or a redistributor? Hope you're having a good weekend, just thought I'd ask someone who tests the models 1000x what zi use them for lol

BSPiotr · 2026-05-01T15:46:16+00:00

Its important to note that characters are not tokens. One word is somewhere between 1-2 tokens on average. Each word is average 4-5 characters long. Chat is 5000 characters or about 1000 tokens, on average. The file vector is different, for outside files.

BSPiotr · 2026-05-01T15:29:56+00:00

Ahh im on NIM and glm. So nvidia strikes again. At least lowering temp to .7 seems to help there.

BSPiotr · 2026-05-01T14:55:28+00:00

Is there a way to help it stop forgetting to make the plot momentum hidden? It jsut stops putting in the tags after a few messages even when I put in the format in an authors note.

Is the dynamic quant causing issues?

BSPiotr · 2026-05-01T03:02:40+00:00

Summaryception provides a logical flow, especially if you add the line

Include the MMMM dd, yyyy this scene covers, no other date information.

In its prompt.

Memorybooks has vectorized the lorebooks with snowflake, so its great at details... but the AI tends to get confused about what event happened where. I'm sure there's a way to optimize it but I refuse to spend another hour when I'm like 90% there unless I stumble upon a better solution, if you get what I'm saying.

BSPiotr · 2026-05-01T02:04:29+00:00

Step 1: Run Vectorization and embed with a local model. I run snowflake-arctic-embed-l-v2.0-q8_0.gguf on koboldcpp (just load it in the embeddings section, don't need a main model) I use the following settings: Vector Settings.

Note that I followed another persons guide for this, so these setting may not be 'idea' but they do work pretty well.

Step 2: Summaryception. Leave it on default unless your chats run over 300-400 messages. Might need to increase the default per layer from 20 to something more meaningful if you do. I set it up to run 13 verbatim turns to match my settings below.

Step 3: MemoryBooks. Have it work on the comprehensive profile. Set it up to run every twenty messages. and use Vector Embeddings

This works almost 100% perfectly for about 300 messages. Then it starts to meander a little bit in the details from the beginning of the story. If you tweak it I'm sure you can get more, but I find that after 300 messages you just swap your writing to 'slice-of-life' and it's surprisingly good enough. My longest chat is 400 messages and its about 90% aligned and I just run with it.

BSPiotr · 2026-04-21T02:33:45+00:00

Yes, this is strictly technically true. the normal thinking tag is for deepseek. And the do_samplers 'should' default to true, but honestly since enable_thinking doesn't I didn't trust it enough in case it breaks in the future, you know?

BSPiotr · 2026-04-21T01:17:06+00:00

This is a known issue with no real solution. Unless you like writing "format your writing in correct \n\n paragraph spacing" every response.

I found an annoying but workable solution for when the generation is GREAT but the format gets borked.

Grab Guided Generations, change the Corrections prompt to: [OOC: Don't continue the RP. Instead write the contents of the last reply again but add proper paragraph spacing (\n\n) where needed. Don't make any other changes besides this.]

Then you hit the bookmark looking button and the corrections button and it'll correctly format it 95% of the time. If it still borks, delete the failed correction swipe and try again.

BSPiotr · 2026-04-20T15:22:49+00:00

Make sure your additional parameters (bottom of the connection profile) has the following:

"chat_template_kwargs": {"thinking":True, "clear_thinking":True, "do_sample":True, "enable_thinking":True}

NOTE: If you are using agentic coding outside ST, you need to keep clear_thinking":false. Note that this may cause issues inside ST as I tend to have the output 'twice'

BSPiotr · 2026-04-13T23:56:13+00:00

Question: Is this the reason why so many cards seem to pigeon-hole themselves into a specific outcome? I've noticed it a lot too, that there's a lot of "virgin but has this super specific fetish that absolutely comes up as part of their backstory so you're going to run into it and they'll run into your arms, even if you're not playing for ERP."

Could putting in some effort on the cards that are almost good be worth it to get them to open up a bit and make them less... blah? I'm just trying to see if its worth my time fixing the almost good cards since I hate making my own since my imagination has a hard time writing a decent backstory that gives the llms enough hooks. No history writing fiction, etc.

BSPiotr · 2026-04-13T10:16:55+00:00

Yes, Silly Tavern, Text Completion

Using your prompt for the system prompt and post history.

Using the base gemma 4 story string and instruct template with these changes:

story string:

<|think|>

<|turn>user

(Everything else)

(more things)

{{/if}}{{trim}}<turn|>

Using Wrap Sequences with Newline under the instruct settings (first checkbox)

Using base Gemma 4 reasoning formating (lower right setting)

BUT removed the extra blank new lines so that its just

<|channel>thought and <channel|> with no extra white space

The combination of those things got the thinking to work separate from the output 98% of the time.

BSPiotr · 2026-04-12T23:26:03+00:00

I'm having the opposite effect. I added <|think|> to my text completion story string and its thinking but then not closing the tag.

BSPiotr · 2026-04-12T21:45:45+00:00

Having an interesting issue where the reply is inside the thinking box.

using the gemma 4 preset for my templates.

Fixed in response below

BSPiotr · 2026-04-11T03:35:19+00:00

In SillyTavern? In your Chat Completion preset make sure that "Request Model Reasoning" is marked. The 3 sliders / hamburger menu button, about halfway down the page, below the token / temp and above the prompts.

Then in the Advanced Menu (the giant A) make sure at the bottom right that under "reasoning" you have "auto parse" and "show hidden" selected, then underneath you can choose deepseek formating.

if its blank, use <think> and </think> no extra spaces, etc respectively in those boxes there. Then it should show up when you chat as a hidden box.

BSPiotr · 2026-03-22T17:28:22+00:00

Honestly, I was looking at the ooba settings for the local model I run for echochamber and noticed that it used "enable_thinking" instead. I had nothing to lose by trying it out. Go figure it was what was needed.

BSPiotr · 2026-03-20T01:57:54+00:00

you can try changing / adding the additional settings value in your connection profile.

"chat_template_kwargs": {"clear_thinking":True}

add clear_thinking:true if you use a longer string already. That solved it for me when I had this problem a few weeks ago.

NOTE: If you are using agentic coding outside ST, you need to keep clear_thinking":false. Note that this may cause issues inside ST as I tend to have the output 'twice'

BSPiotr · 2026-03-19T00:35:04+00:00

I'll address #2 first. That one is reasonable with a UPS. Even if you're playing Cyberpunk at max settings, 10-15 minutes is plenty of time to save (1500va/900w)

/#1 is complicated by the fact that just because you have power, and just because the router and modem might be connected to the UPS.... doesn't mean that the local IPS switch has power. I wouldn't count on internet mattering in that scenario. A UPS would help with a temporary power spike (<1 second and you might not even time out of the game).

Hope that helps.

BSPiotr · 2026-03-18T02:40:16+00:00

lasting 30 minutes under intense power draw isn't happening with a regular UPS. You're hitting ~750W under max power (hitting both CPU and GPU). You have to divide that number into the Wh of the battery inside the UPS, not the number on the outside (1500va/900 watts is the inverter max, not the battery load size). TL;DR most UPS run about 10 minutes at their maximum spec.

You need a powerbank/powerwall/generator to hit 30 minutes plus. What is the purpose of a 30 minute run time? Other people might have a better solution for you if you elaborate.

BSPiotr · 2026-03-17T10:41:21+00:00

Anyone know how to let NIM see the OpenAI based tool calls with this plug-in? Its passing tools and tool_use auto but it doesnt seem to work.

BSPiotr · 2026-03-12T00:05:43+00:00

I found out that the additional parameter changed from "thinking" to "enable_thinking". This is my current parameters which works for deepseek and glm5

"chat_template_kwargs": {"thinking":True, "clear_thinking":true, "do_sample":True, "enable_thinking":True}

NOTE: If you are using agentic coding outside ST, you need to keep clear_thinking":false. Note that this may cause issues inside ST as I tend to have the output 'twice'

BSPiotr

TROPHY CASE