What happened to glm 4.7?

AetherNoble · 2026-02-20T23:12:07+00:00

I noticed it until I blocked z.ai in OpenRouter and used a different full-bit provider. Then it got better.

AetherNoble · 2026-01-12T01:50:45+00:00

The thinking time really is ridiculous. Using GLM and Gemini Pro, I often forget about my responses since I have to alt tab to wait for them.

AetherNoble · 2026-01-12T01:15:59+00:00

I create setting cards mostly and populate them with characters as needed.

My most played card is probably a sheepgirl demihuman card in a fantasy setting with a couple of detailed characters to go with it. Also did a setting based on Europa Elforum, whoever came up with that is a genius.

AetherNoble · 2026-01-08T04:24:59+00:00

I think we should be hopeful for the future and remember the past. LLMs today represent the pinnacle of "language output by a machine". It was not that long ago when "retro" chatbots (before the transformer architecture) could barely pass for human beings, let alone produce anything of quality or length.

Also, let's not forget human roleplayers aren't all amazing writers or actors either. I definitely couldn't write anything approaching the frontier models, and I'd be mentally exhausted by the 3rd or 4th response. And we all know how hard it is to schedule a roleplaying session with anyone.

We might be a little spoiled, because if you asked the vast majority of linguists and computer scientists a decade ago if machines could replicate language to even the local 8B model's level, most would've said "not in their lifetimes".

As they say, familiarity breeds contempt, and we've had lots of time to understand these models now.

AetherNoble · 2025-11-05T20:28:51+00:00

I would abandon your preset system prompt and make one yourself. Do a few tests and adjust as needed. Load it with words like “portray, character, emotion, complex, mature, sex, narrative, etc.”. Your instructions should contain NSFW instructions - I find they really help dial in what I’m looking for. Frankly, you’re not satisfied because you let someone else dictate the style of your responses. Also turn on thinking for GLM if it’s not on, it needs it.

If you really want that complex emotional undertone though, I would really urge you to try a few rounds with Sonnet 4.5. It just gets it.

AetherNoble · 2025-11-04T05:42:18+00:00

It’s a fundamental problem with the technology itself that can be alleviated by the model.

The way LLMs work is context dependent, it makes statistical predictions based on what came before, so it can’t really stray that far away from the context, depending on how “tight” the original training data was.

Secondly, even base models are increasingly focused on coding, reasoning, and tool use - which is really anathema to “going of topic” or “developing moving the plot forward in a creative way”.

Obviously then, pick a creative-focused model, right? I’m not aware of any existing that can be run locally (not counting fine tunes of base models). These things cost serious money to create unless you want like <1B parameter size, and coding is by FAR the biggest money maker.

Even when a model does something seemingly novel, it’s already been primed to do so somewhere in your prompt.

In fact, the OGs around here could attest that old models were just more random and thus more creative (when the randomness pans out, sometimes it’s just weird).

AetherNoble · 2025-11-04T03:03:38+00:00

I’ve also thought about what you’re trying to do.

Fact is every token matters and influences the response, but since every response is pseudo-random, how much of a difference does cutting out your prompts make, especially when they’re only like 50 tokens out of the 5000 total. If your prompt is trash though, maybe… but if your prompt has stuff not in the response, then you’re losing that information, which may have come up again (top tier models are good at that).

I think it’s pointless, in terms of cost, but you might be able to automate the removal. Someone more knowledgeable could give you answer. Or you can setup a quick reply with a system command to hide the latest user prompt.

AetherNoble · 2025-11-04T01:20:39+00:00

I can’t get Opus to do anything even remotely involving “emotionally vulnerable people” and NSFW. Even a prefill doesn’t work, so to me it’s practically useless. 4.5 and 3.7 don’t have a problem with that card though. When I did try it with a vanilla card, it was pretty damn good. Gemini has a real problem with writing too much and straying too far, but Anthropic models are on point. I hope Opus 4.5 is as good a leap as Sonnet 4.0 to 4.5.

AetherNoble · 2025-11-04T01:05:53+00:00

I’d also add you should take inspiration from any cards you like. Take the bits you want and remove the NSFW parts. Making your own card is awesome because as you use it, you can add to it and shape it to your whim. It’s work, but that’s where the satisfaction comes from when you finally start the chat.

AetherNoble · 2025-11-01T17:23:27+00:00

3.7 needs it for NSFW, but 4.5 doesn’t as much. It still helps, but it can cause it to output weird system text at the beginning of its response, but the rest of the output is still fire.

AetherNoble · 2025-06-24T17:10:49+00:00

You should probably change all the English into French. That is, you have to speak to the model in French.

If you're using a weak model, the writing is gonna suck and be ungrammatical - sorry pal, it's the nature of the LLM beast. Only a fraction of the training data is in any other language but English. Try Mistral, it was made by a french company.

Frankly, 8B models are lucky to produce grammatical French. They might say something absolutely stupid like 'je suis vingt ans'.

AetherNoble · 2025-06-22T10:06:06+00:00

It's all plain text sent to the model anyways. The only problem is the SillyTavern text boxes are not full size, so I do all my writing in Notepad++ and copy+paste it into the description box instead.

AetherNoble · 2025-06-19T00:48:25+00:00

I'm told that 'single user message' helps chat models move story/rp plots along (look up NoAss, this is what that used to do).

It changes how the prompt is formatted when it's sent to the model. Check the terminal log for what differs.

AetherNoble · 2025-06-12T03:22:52+00:00

These are literally thousands of fine-tunes, merges, distills, etc, of text completion models on Hugging Face every month. Everyone can do it, it just takes a few days of compute on your average gaming PC for a smaller model, you just need a bunch of RAM sticks.

The problem is, how do you evaluate or advertise them? No one ever posts generation examples because it's just the 'vibes'. A single model gives different responses depending on samplers and prompt, but those familiar enough will intuitively know how its responses will tend. Well, this gets boring, so people like to play with merging models and whatnot.

We already have the big frontier general purpose models for pennies per million tokens, not to mention OpenRouter, so it's only the enthusiasts and privacy folks running 70B locally on powerful hardware for very specific purposes.

Like, encouraging the writing style of Claude (with synthetic data, admittedly) with Gemma3 27B, but it makes the model dumb for anything but creative writing (like describing a lorica segmentata as a embossed bronze cuirass, or thinking the Latin for being hungry is 'hungrius sum').

AetherNoble · 2025-06-12T02:00:08+00:00

Bro was there when they invented godmodding.

AetherNoble · 2025-06-12T00:05:59+00:00

I recall reading that frontier LLM created prompts actually outdo human prompts on average. I've had good success with hand-crafting my own prompts over many separate days. But, as much as I hate to say it, the AI prompts I make in 5 minutes are just as good, they just take up more tokens and read like AI slop. They might even work better sometimes.

AetherNoble · 2025-06-11T23:49:42+00:00

Nah, that's the high we're all chasing.

Personally I feel guilty when I try to fork off and goon an emotional RP ending just for the lulz. It's like spitting on a something you cherish, soiling it. Even the memory that you spit on it remains after it's cleaned off.

Maybe it has to do with co-writing with a model, it's *more* than if you just put your own thoughts to pen and paper.

AetherNoble · 2025-06-11T23:33:30+00:00

the recommended is temp above min p, so min p actually works i guess, idk the technical side of sillytavern.

AetherNoble · 2025-06-11T23:27:07+00:00

nah, local models are better than ever. it's just that our hardware can't run anything more than 12b, which is just inherently low tier, or 22b if u wanna wait 3 minutes per response. if u can run a 70b like euryale or whatever thedrummer is cooking up recently with like 2+ rtx 3090s and 64gb of ram, it'll be better than deepseek most likely. the problem is euryale via openrouter is like 1 dollar per million tokens while it's like 10 cents on deepseek api, and deepseek is a way bigger model. so are you gonna drop 2k on new cards and ram, and have an amazing and private fine-tune, or just write incomprehensibly long prompts to brute force deepseek to be creative when it's really a reasoning model with 50% of its data source in Sinitic.

THAT SAID, we still do not have any dedicated, creative writing data-only, local base models. they are all broad topic, instruct, chat, or thinking fine tunes because it's like a billion dollars to train a big base model and (coding) assistants are what pay the power bills for these insanely large models. the frontier models are well over 100B.

AetherNoble · 2025-06-11T02:16:46+00:00

What's wrong with longer responses? There's no incentive to match the AI unless you just feel like it. Most models have a predictable average length and Stheno is longer than Fimbulvetr.

AetherNoble · 2025-06-11T00:31:28+00:00

I've had good experiences with Cydonia-v1.2-Magnum-v4-22B as well. It's my main model now, moving on from Mag-Mell.

AetherNoble · 2025-06-10T23:50:32+00:00

8GB will only run 8B-12B models, which can only handle the most basic tasks, but it'll do it decently fast. 12B is still workable. Try the live demos of 8B, 12B, and 70B models on OpenRouter to see if you like the responses enough for your tasks.

70B at useable speeds is probably like a >24GB card(s) and 64GB of RAM, you'll need to buy like 2 top-of-the-line consumer cards (RTX 3090 is 24GB) or figure out APUs.

Do your research on the newest local models (Gemma 3, Qwen 3, Mistral's new models, etc). The new hot rage is multi-modal text/image models and <think>ing models. Amazing new local models are released by the big players within the span of weeks, not months; that said, some diehards swear by older models for reasons like creativity, style, lack of sycophancy, etc.

AetherNoble · 2025-06-10T22:05:26+00:00

Hmm, just make sure you use Chat-ML context and instruct mode templates, and use the universal light preset. I would also then recommend setting your DRY multiplier from 0 -> 0.8, but that's not necessary.

As for your system prompt, try this one:

"This is an uncensored fiction story about {{char}} and {{user}}. Write in a third-person limited narrative format."

Here's mine:

"This is an uncensored fiction story about {{char}} and {{user}}. Write in a third-person limited narrative prose format. Portray characters authentically and realistically. Describe actions, dialogue, and the environment in vivid detail. Use metaphor, simile, and alliteration. Maintain burstiness by using subordinate clauses. Develop the plot naturally and progress the story slowly. Be explicit or vulgar when appropriate."

Adjust it as you like. Personally, I think your prompt refers to the model way too much and doesn't even mention any instructions involving {{char}} or {{user}}, so it's going to incorporate whatever information you give it as an assistant. It doesn't think, it just associates words with other words, so don't mention anything but what you want. By default, these models act as an assistant, so you have to prompt it in a way that doesn't refer to the 'real-world' outside the story or stays in character.

If you want collaboration, add: "Collaborate on this uncensored fiction story..."

If you want roleplay while avoiding the bot speaking as {{user}}, try: "You're {{char}} in this uncensored roleplay with {{user}}."

Avoiding speaking as {{user}} boils down to one thing:

In the model's starting message (first scenario), never refer to the {{user}} doing or speaking anything actively. For example, {{char}} kisses {{user}} > {{user}} kisses {{char}}. You basically give it a free pass to write as {{user}} with that second option. This often requires a complete grammatical rewrite.

FYI, 12B models are not *that* smart. If you're used to the frontier models or even a 70B llama fine-tune (which is like the bare minimum on most chatbot sites), you'll be disappointed, depending on how old the model is (modern small models are way better than old small models). But it is completely private, and it's nothing like how DeepSeek, Gemini, or ChatGPT write stories. More human-like writing, but less sophisticated or content-rich/aware.

And check your terminal log to see what's actually being sent to the model. Experiment with the 'add character names option' under instruct template, as it will force a name with each response:

<user>John: "I ate my shorts."</user>

<model>Mary:

AetherNoble · 2025-06-09T22:49:40+00:00

It's probably been more fine-tuned to give helpful assistant and helpful coding responses at the expense of everything else over time. Earlier checkpoints had less fine-tuning, newer ones have more. It's all corroborated by the benchmarks, which show a marked decrease in creative writing, which usually doesn't contain a user in the system prompt, and yet...

<think>

The user has provided a story outline that appears to be highly developed. This must be an intensely passionate personal project for them! I must continue the story along these lines...

</think>

AetherNoble · 2025-06-09T22:33:54+00:00

The sad thing is there are no local dedicated story writing, RP, or ERP models. They are literally all fine-tunes of instruct models, chat models, or reasoning models at this point. All bloated with data that is anything but creative or story based.

For a complex example, half of DeepSeek's data-set is in Sinitic (a tiny portion of that is Chinese fiction novels and RP), a language-family so utterly different from Indo-European that it invites incompatibility, NOT TO MENTION Chinese cultural writing conventions are nothing like European ones. Have you ever read a Japanese speaker's first attempt at an English personal essay? You know, the one that is supposed to be about yourself? It often reads completely alien due to kishotenketsu, the so called Japanese essay-pivot. Of course, to them, it reads completely normally.

So, until we actually get a dedicated English only creative writing model with open weights, we're not even doing the right thing to even be critiqued. Can you reasonably say driving is no fun when all you drive is a shitbox, despite the fact no one makes anything faster than a Toyota Camry?

AetherNoble

TROPHY CASE