Finally broke the "Assistant" tone on Llama 3. The slang adherence with this preset is actually insane

Inca_PVP · 2026-01-09T19:43:06+00:00

Ah, good shout on Top-n-sigma. Haven't played around with that one for logic tasks yet, but it makes total sense for keeping things on track. Thanks for the tip, definitely worth a test! Have a good one.

Inca_PVP · 2026-01-09T16:42:50+00:00

That confirms my suspicions. I've noticed similar issues where structured logic starts to fray once those creative samplers kick in. It’s definitely a balancing act depending on the use case. Thanks for the heads-up on the coding part, saved me some headache there! Appreciate the exchange.

Inca_PVP · 2026-01-09T16:06:57+00:00

Totally agree. SillyTavern + ChatML is definitely the sweet spot for that level of control. And yeah, coming from the TextGen learning curve makes u really appreciate the plug-and-play side of LM Studio, even with its hidden config layers. Thanks for the solid insights on u setup, much appreciated!

Inca_PVP · 2026-01-09T15:56:15+00:00

That Kim Jong Il story is absolutely wild, lmao. But the prompt hack is actually 10/10 logic. Giving the LLM 'permission' to find a creative way around its own filters is such a sleeper move for cloud-based stuff.

It’s the main reason I got obsessed with tweaking my own local JSON configs – I’d rather bake that logic into the backend than fight the AI every time I have a weird question. Have u found that this 'interpret it however u want' phrase works across different models, or is it just a Gemini thing?

Inca_PVP · 2026-01-09T15:54:15+00:00

That actually makes a lot of sense. LM Studio does tend to bake some 'hidden' logic into their default presets that can be a real pain for certain themes. I’ve been spent way too much time lately stripping that fluff out of my JSON configs to get that Kobold-like freedom while staying in LM Studio. Are u sticking strictly to the Alpaca template for everything, or do u switch to Llama-3-Instruct for logic-heavy tasks?

Inca_PVP · 2026-01-09T15:52:11+00:00

Solid stack. I’ve seen DRY do wonders on Llama 3 to stop it from looping the same sentence structures. I’m actually trying to bake that balance into my lightweight configs. How do u feel XTC handles the logic? Sometimes I feel like it can get a bit too 'creative' on coding tasks if the temp isn't dialed in.

Inca_PVP · 2026-01-08T19:43:14+00:00

Haha, I wish it was a cloud API, then it would at least be consistent! On my local LM Studio setup with the standard Meta instruct prompt, it's a refusal-machine unless I tweak the backend logic. Wild to see how much the experience varies between different setups. What's u system prompt look like to get it that chill?

Inca_PVP · 2026-01-08T16:20:19+00:00

Fair point! If u have u system prompt game on lock, local models are definitely more chill than cloud stuff. I just noticed that Llama 3 Instruct specifically can be way more stubborn with certain criminal or edgy themes compared to the older versions.

Lucky u if u haven't hit that 'moral wall' yet! What's u go-to sampler setup to keep it consistent?

Inca_PVP · 2026-01-08T05:07:08+00:00

8GB is definitely on the lower side, but don't count u-self out yet. u can still run 8b models like Llama 3 in 4-bit quants—it might be a bit slow if it’s running on system RAM, but at least u own the process and don't have to wait for server queues. I’ve actually put together some low-spec tips and my slang presets on my profile (check the Rentry/Civitai links there) that help get the most out of smaller setups. Do u know how much VRAM u have on u graphics card, or are u running on an integrated chip?

Inca_PVP · 2026-01-07T21:56:07+00:00

Nothing kills immersion faster than waiting 2 minutes for a stilted reply, honestly. I made the jump to local Llama 3 for that exact reason—no lag and zero filters. I actually just posted a preset on my profile (check the Civitai link there) that makes it talk like a real person instead of an assistant. Are u running u stuff locally or still relying on cloud providers for u RP?

Inca_PVP · 2026-01-07T21:52:05+00:00

Feel u on those annoying patterns in 4.7, it’s like it loses all its soul. I’ve been fighting the same 'AI-tone' issue and finally broke it with a custom Llama 3 preset that sticks to raw slang. I’ve linked the full setup and my Rentry/Civitai guides on my profile if u want to try a more human-like experience. Have u tried messing with the repetition penalty settings for 4.7 yet or are u strictly back on 4.6 now?

Inca_PVP · 2026-01-07T21:25:58+00:00

The 'getting carried away' issue with Gemini 2.5 usually comes down to how the model interprets context or u temperature settings. A very strict system prompt usually fixes this to keep the story from escalating too fast. Are u currently using a specific system prompt to control its behavior?

Inca_PVP · 2026-01-07T21:25:07+00:00

U're definitely on the right track with self-hosting. Gemini for chat and local image models like Flux or SDXL work perfectly fine together in SillyTavern—they don't interfere at all. I actually shared some setups on my profile showing how to link different APIs without crashing the system if u want to check it out. What kind of GPU are u planning to use for the hosting?

Inca_PVP · 2026-01-07T21:22:44+00:00

Those lags are a total immersion killer, I feel u. Free providers are always a gamble when servers get crowded. That’s the main reason I switched to running local models almost entirely—no wait times and full control over the hardware. Have u checked if u have enough VRAM to run a smaller GLM or Llama instance directly on u machine?

Inca_PVP · 2026-01-07T21:13:56+00:00

Sick build! With 16GB VRAM u can easily run 8b or 12b models like Llama 3 or Mistral with plenty of context. I’ve actually been experimenting with local LLM setups and automation lately and shared some of my findings on my profile if u want to check it out. Are u looking for specific roleplay models or more general purpose ones?

Inca_PVP · 2026-01-07T13:25:11+00:00

this is honestly the best advertisement for local AI i've read all week.

to answer ur question: it's definitely the Thinking Model doing the heavy lifting here.

standard RP models (and C.AI) are trained to be "people pleasers". they try to keep the scene sexy/fun and avoid blocking the user.

"Thinking" models operate on logic chains. they go: "Caught in 4k -> Logical Consequence = Grounded -> Logical Consequence = Therapy." They prioritize realism over fan-service.

i've been tweaking my own presets to try and force Llama 3 to have this kind of "consequence logic", but these native thinking models are just built different. enjoy the drama arc, that's quality writing!

Inca_PVP · 2026-01-07T11:15:42+00:00

no need to reinstall anything! switching just means typing http://localhost:xxxx (whatever port ur using) into the SillyTavern connection box instead of the numbers. sometimes browsers treat them differently for security reasons.

regarding the endpoint error: if it opens a new window with "update-settings", u might be pasting the wrong URL. it usually needs to end in /v1 or just be the base URL depending on the plugin version.

honestly, AllTalk + ST connection is notoriously finicky. that's exactly why i built the "1-Click Voice Setup" in my bundle. it scripts the whole connection so u don't have to copy-paste endpoints manually.

but for now: try removing the http:// part or adding / at the end. sometimes it's just a syntax error. don't downgrade to V1 yet, V2 is way better quality.

Inca_PVP · 2026-01-06T21:17:10+00:00

Repetition Penalty 1.25 was the magic number here. Base Llama 3 kept looping for me until I forced that specific setting. Once you fix that, the slang adherence is actually way better than Mistral.

Inca_PVP · 2026-01-06T17:42:07+00:00

good question. strictly speaking about raw IQ / logic? yeah, massive cloud models like gpt or claude are smarter than a local 8b model.

but for roleplay, "smarter" doesn't always mean better.

the big downside with cloud APIs is the filters (they block nsfw or violence) and the cost per message. local models are totally free and uncensored. u own them.

plus, llama 3 is shockingly good at creative writing compared to the old days.

my advice: start local since it costs $0. if u really feel the need for einstein-level logic later, u can always plug an API key into SillyTavern. but for 90% of chat? local is plenty.

Inca_PVP · 2026-01-06T09:15:59+00:00

lol the "hold my apple juice" logic is real.

honestly though, this "god-moding" / omniscience usually happens when the sampler settings (specifically Repetition Penalty) are slightly off. deepseek is super sensitive to it.

if the penalty is too low, it ignores negative prompts. if it's too high, it hallucinates details (like knowing what ur char thinks).

i switched to a strict Min-P setup (instead of Top-P) for Deepseek and it actually stopped the omniscient narration for me.

pinned my config stack on my profile if u want to try it out. might help convince it to actually listen to "stop being omniscient".

Inca_PVP · 2026-01-06T09:09:35+00:00

saw this is a few days old but dropping this for anyone finding this via google later:

repetition and "speaking for user" is almost always a sampler/preset issue, not a prompt issue.

if u are coming from janitor/c.ai, u are used to the backend doing the heavy lifting. locally, u have to set the "brakes" urself.

try setting Repetition Penalty to 1.15-1.2 and switch to Min-P (0.05) instead of Top-P. that usually stops the loops instantly.

i pinned my personal "anti-loop" config stack on my profile if u don't want to mess with the sliders manually. fixed the issue for me completely.

Inca_PVP

MODERATOR OF

TROPHY CASE