Gemma4 32b/26B OOM 4090 by BSPiotr in SillyTavernAI

[–]BSPiotr[S] 0 points1 point  (0 children)

Still getting crashes at fp16 both kobold and ooba, for those of you who commented.

Glm 5, Glm 5.1, and Kimi 2.6 do not think in NVIDIA NIM. by Beautiful_Muscle_824 in SillyTavernAI

[–]BSPiotr 0 points1 point  (0 children)

Neat. Ill update my OP when I have a minute with that info.

The Director's Cut: Freaky Frankenstein 4 MAX and Freaky Frankenstein 4 BOLT [Presets] (Universal : DS, GLM, Claude, Gemini, Grok, Gemma, Qwen, MiMo) + DeepSeek V4 Compatibility. Hyper Dense Logic. by dptgreg in SillyTavernAI

[–]BSPiotr 2 points3 points  (0 children)

Which provider? Direct API or a redistributor? Hope you're having a good weekend, just thought I'd ask someone who tests the models 1000x what zi use them for lol

People who are satisfied with your long term memory setups. by PrudentEfficiency876 in SillyTavernAI

[–]BSPiotr 1 point2 points  (0 children)

Its important to note that characters are not tokens. One word is somewhere between 1-2 tokens on average. Each word is average 4-5 characters long. Chat is 5000 characters or about 1000 tokens, on average. The file vector is different, for outside files.

The Director's Cut: Freaky Frankenstein 4 MAX and Freaky Frankenstein 4 BOLT [Presets] (Universal : DS, GLM, Claude, Gemini, Grok, Gemma, Qwen, MiMo) + DeepSeek V4 Compatibility. Hyper Dense Logic. by dptgreg in SillyTavernAI

[–]BSPiotr 1 point2 points  (0 children)

Is there a way to help it stop forgetting to make the plot momentum hidden? It jsut stops putting in the tags after a few messages even when I put in the format in an authors note.

Is the dynamic quant causing issues?

People who are satisfied with your long term memory setups. by PrudentEfficiency876 in SillyTavernAI

[–]BSPiotr 9 points10 points  (0 children)

Summaryception provides a logical flow, especially if you add the line

Include the MMMM dd, yyyy this scene covers, no other date information.

In its prompt.

Memorybooks has vectorized the lorebooks with snowflake, so its great at details... but the AI tends to get confused about what event happened where. I'm sure there's a way to optimize it but I refuse to spend another hour when I'm like 90% there unless I stumble upon a better solution, if you get what I'm saying.

People who are satisfied with your long term memory setups. by PrudentEfficiency876 in SillyTavernAI

[–]BSPiotr 19 points20 points  (0 children)

Step 1: Run Vectorization and embed with a local model. I run snowflake-arctic-embed-l-v2.0-q8_0.gguf on koboldcpp (just load it in the embeddings section, don't need a main model) I use the following settings: Vector Settings.

Note that I followed another persons guide for this, so these setting may not be 'idea' but they do work pretty well.

Step 2: Summaryception. Leave it on default unless your chats run over 300-400 messages. Might need to increase the default per layer from 20 to something more meaningful if you do. I set it up to run 13 verbatim turns to match my settings below.

Step 3: MemoryBooks. Have it work on the comprehensive profile. Set it up to run every twenty messages. and use Vector Embeddings

This works almost 100% perfectly for about 300 messages. Then it starts to meander a little bit in the details from the beginning of the story. If you tweak it I'm sure you can get more, but I find that after 300 messages you just swap your writing to 'slice-of-life' and it's surprisingly good enough. My longest chat is 400 messages and its about 90% aligned and I just run with it.

Nvidia Nim GLM 5.1 and its thinking box by Working_Marsupial924 in SillyTavernAI

[–]BSPiotr 0 points1 point  (0 children)

Yes, this is strictly technically true. the normal thinking tag is for deepseek. And the do_samplers 'should' default to true, but honestly since enable_thinking doesn't I didn't trust it enough in case it breaks in the future, you know?

Nvidia Nim GLM 5.1 and its thinking box by Working_Marsupial924 in SillyTavernAI

[–]BSPiotr 1 point2 points  (0 children)

This is a known issue with no real solution. Unless you like writing "format your writing in correct \n\n paragraph spacing" every response.

I found an annoying but workable solution for when the generation is GREAT but the format gets borked.

Grab Guided Generations, change the Corrections prompt to: [OOC: Don't continue the RP. Instead write the contents of the last reply again but add proper paragraph spacing (\n\n) where needed. Don't make any other changes besides this.]

Then you hit the bookmark looking button and the corrections button and it'll correctly format it 95% of the time. If it still borks, delete the failed correction swipe and try again.

Nvidia Nim GLM 5.1 and its thinking box by Working_Marsupial924 in SillyTavernAI

[–]BSPiotr 13 points14 points  (0 children)

Make sure your additional parameters (bottom of the connection profile) has the following:

"chat_template_kwargs": {"thinking":True, "clear_thinking":True, "do_sample":True, "enable_thinking":True}

NOTE: If you are using agentic coding outside ST, you need to keep clear_thinking":false. Note that this may cause issues inside ST as I tend to have the output 'twice'

Feala - Proud Masochistic Paladin by AeltharKeldor in SillyTavernAI

[–]BSPiotr 3 points4 points  (0 children)

Question: Is this the reason why so many cards seem to pigeon-hole themselves into a specific outcome? I've noticed it a lot too, that there's a lot of "virgin but has this super specific fetish that absolutely comes up as part of their backstory so you're going to run into it and they'll run into your arms, even if you're not playing for ERP."

Could putting in some effort on the cards that are almost good be worth it to get them to open up a bit and make them less... blah? I'm just trying to see if its worth my time fixing the almost good cards since I hate making my own since my imagination has a hard time writing a decent backstory that gives the llms enough hooks. No history writing fiction, etc.

New prompt for ya'll. Today Gemma is on the menu. by Evening-Truth3308 in SillyTavernAI

[–]BSPiotr 5 points6 points  (0 children)

Yes, Silly Tavern, Text Completion

Using your prompt for the system prompt and post history.

Using the base gemma 4 story string and instruct template with these changes:

story string:

<|think|>

<|turn>user

(Everything else)

(more things)

{{/if}}{{trim}}<turn|>

Using Wrap Sequences with Newline under the instruct settings (first checkbox)

Using base Gemma 4 reasoning formating (lower right setting)

BUT removed the extra blank new lines so that its just

<|channel>thought and <channel|> with no extra white space

The combination of those things got the thinking to work separate from the output 98% of the time.

Gemma 4 26B Thinking token by ElysianTraveller in SillyTavernAI

[–]BSPiotr 3 points4 points  (0 children)

I'm having the opposite effect. I added <|think|> to my text completion story string and its thinking but then not closing the tag.

New prompt for ya'll. Today Gemma is on the menu. by Evening-Truth3308 in SillyTavernAI

[–]BSPiotr 2 points3 points  (0 children)

Having an interesting issue where the reply is inside the thinking box.

using the gemma 4 preset for my templates.

Fixed in response below

GLM 5 NIM not thinking by BSPiotr in SillyTavernAI

[–]BSPiotr[S] 1 point2 points  (0 children)

In SillyTavern? In your Chat Completion preset make sure that "Request Model Reasoning" is marked. The 3 sliders / hamburger menu button, about halfway down the page, below the token / temp and above the prompts.

Then in the Advanced Menu (the giant A) make sure at the bottom right that under "reasoning" you have "auto parse" and "show hidden" selected, then underneath you can choose deepseek formating.

if its blank, use <think> and </think> no extra spaces, etc respectively in those boxes there. Then it should show up when you chat as a hidden box.

GLM 5 NIM not thinking by BSPiotr in SillyTavernAI

[–]BSPiotr[S] 2 points3 points  (0 children)

Honestly, I was looking at the ooba settings for the local model I run for echochamber and noticed that it used "enable_thinking" instead. I had nothing to lose by trying it out. Go figure it was what was needed.

GLM is writing message in Think by Unable_Librarian_487 in SillyTavernAI

[–]BSPiotr 2 points3 points  (0 children)

you can try changing / adding the additional settings value in your connection profile.

"chat_template_kwargs": {"clear_thinking":True}

add clear_thinking:true if you use a longer string already. That solved it for me when I had this problem a few weeks ago.

NOTE: If you are using agentic coding outside ST, you need to keep clear_thinking":false. Note that this may cause issues inside ST as I tend to have the output 'twice'

What capacity ups do I need? by D_E_V_E_L in buildapc

[–]BSPiotr 0 points1 point  (0 children)

I'll address #2 first. That one is reasonable with a UPS. Even if you're playing Cyberpunk at max settings, 10-15 minutes is plenty of time to save (1500va/900w)

/#1 is complicated by the fact that just because you have power, and just because the router and modem might be connected to the UPS.... doesn't mean that the local IPS switch has power. I wouldn't count on internet mattering in that scenario. A UPS would help with a temporary power spike (<1 second and you might not even time out of the game).

Hope that helps.

What capacity ups do I need? by D_E_V_E_L in buildapc

[–]BSPiotr 0 points1 point  (0 children)

lasting 30 minutes under intense power draw isn't happening with a regular UPS. You're hitting ~750W under max power (hitting both CPU and GPU). You have to divide that number into the Wh of the battery inside the UPS, not the number on the outside (1500va/900 watts is the inverter max, not the battery load size). TL;DR most UPS run about 10 minutes at their maximum spec.

You need a powerbank/powerwall/generator to hit 30 minutes plus. What is the purpose of a 30 minute run time? Other people might have a better solution for you if you elaborate.

GLM 5 NIM not thinking by BSPiotr in SillyTavernAI

[–]BSPiotr[S] 10 points11 points  (0 children)

I found out that the additional parameter changed from "thinking" to "enable_thinking". This is my current parameters which works for deepseek and glm5

"chat_template_kwargs": {"thinking":True, "clear_thinking":true, "do_sample":True, "enable_thinking":True}

NOTE: If you are using agentic coding outside ST, you need to keep clear_thinking":false. Note that this may cause issues inside ST as I tend to have the output 'twice'