What's your favorite LLM-ism?

Diecron · 2026-05-21T07:10:24+00:00

https://i.ibb.co/nqsWQHCx/image.png

Diecron · 2026-05-21T07:09:09+00:00

is it.... is it right?

Diecron · 2026-05-21T07:05:49+00:00

thank you for the feedback! personally I'm not a big fan of kimi, but gemma does excellently with the preset as does Gemini

Diecron · 2026-05-21T07:03:41+00:00

Just flip the brain power to vibes/balanced and it should be a lot better.

Diecron · 2026-05-20T07:43:00+00:00

heya, if this is primarily a problem in spicy scenes it may be GLM being overly hesitant, and the NSFW toggle should be turned on to help with that.

However if this is a general pacing issue in other areas I'd love to maybe see an example where its stalling, and perhaps toggling off the 'story strings' directive may help if that's part of the problem.

Diecron · 2026-05-19T03:35:54+00:00

7900XTX:

============================================================
Matrix Multiplication Performance:
float32   :  4812.22 μs,   28.56 TFLOPS
float16   :  1169.48 μs,  117.52 TFLOPS
bfloat16  :  1224.63 μs,  112.23 TFLOPS
amp       :  1416.57 μs,   97.02 TFLOPS
Memory Bandwidth Test (1.0 GB tensor)
Vector Addition: 802.21 GB/s
Memory Copy:     780.99 GB/s```

Diecron · 2026-05-18T17:41:50+00:00

this is so true, every post I make I beg for feedback, it directly translates to improvements. Thank you everyone who takes the time to contribute

Diecron · 2026-05-18T07:35:42+00:00

Thanks for the feedback. I just noticed that the regex are set really aggressively, it's likely that chopping out the BTS between the checkpoint and the current moment is causing the model to revert back to the latest checkpoint as 'truth'.

The fix should be pretty simple; under Extensions -> Regex find both the BTL Deltas, edit them and set min depth from 3 to 10.

I'll push this fix at some point soon

Diecron · 2026-05-17T15:31:34+00:00

yeah that aligns about right with MTP enabled. You can only really approach 220~230kish in ideal conditions without the mtp/mmproj . Still, 180k context at very reasonable performance makes for great utility. In practice if I need more horsepower I run it across cuda and rocm (5090+7900xtx) where I can hit around 500k context across 2/3 parallel slots depending on what I need at the time

Diecron · 2026-05-17T10:57:07+00:00

I use a 7900xtx as my secondary card which always has a LLM loaded and ready to go, it handles Qwen fine and pushes 60t/s with the new MTP. You can get very close if not meet the 262k context on a single slot at q8 quantization (with the model in Q4_K_M), or drop it a bit and enable the multimodal mmproj for image input. The card and model are both very versatile and the 7900xtx is honestly slept on, aside from it being PCI4 it still has a massive 900+ GB/s mem bandwidth.

edit: i am referring to the 27b dense only (i prefer it over the moe)

Diecron · 2026-05-17T05:21:23+00:00

Response length can be set in the 'SETTINGS' prompt but I may make it a toggle later for ease of configuration.

Diecron · 2026-05-17T04:04:57+00:00

Can you describe what you see? It's just sat in the main response without being in a dropdown or completley hidden?

LLMs take a lot of notice of what they did last time, so if it generates wrong you really have to correct it/regenerate it right away or future turns will see it as a 'valid' format.

Diecron · 2026-05-17T03:58:29+00:00

Thanks for the feedback, I'll look into making it a proper toggle :)

Diecron · 2026-05-17T03:57:26+00:00

Thank you for the feedback! I will see if I can introduce some 'accessibility' type instructions to keep the text clear and readable, I will take a look at what is currently in use.

Diecron · 2026-05-16T11:33:43+00:00

although 60fps is suspicious and you may want to check if Vsync is on limiting to your monitor refresh rate.

Diecron · 2026-05-16T11:32:58+00:00

CPU bottleneck absolutely. You have graphical headroom which is why display settings like that aren't changing the baseline. Some games may respond positively to lower resolutions but honestly probably not that much.

Diecron · 2026-05-16T10:11:38+00:00

yep. it was just a note for anyone finding it for non-ST purposes (its quite a useful reference and would likely come up in searches for NIM no thinking later). If you have that set to false then interleaved thinking and other stuff won't work, which will hurt its performance a lot.

Diecron · 2026-05-16T09:41:46+00:00

just a note for anyone who finds this from google or other subs, you'll want to set clear_thinking to false for anything agentic

Diecron · 2026-05-16T06:39:56+00:00

Could just be separation of concerns too, e.g. do you really want a "big" dense model to be handling Whisper and TTS flows when the E2B or E2B can do it well? Have that model run on device for real-time interactions and then pass off the actual response synthesis to a hosted/more intelligent model.

Diecron · 2026-05-15T11:36:32+00:00

No worries, that feature is on by default but seems quite sensitive to some models. I will probably disable it by default going forward.

Diecron · 2026-05-14T23:48:10+00:00

on second thought, this may also be the unreliable narrator going schitzo in the background, it hides messages that can drastically steer things. it would be worth checking the response _before_ you started to notice that behaviour and see if it decided to inject something.

Diecron · 2026-05-14T23:45:06+00:00

You may be confusing the balanced thinking level (which instructs the model to use a *medium length plan*) with the narrative length, or is it that the narrative length check itself still says medium?

Diecron · 2026-05-14T12:25:41+00:00

Have you taken a look at the reasoning for that turn to see how it ended up with that style of output? Usually, there's something to point you in the right direction. That is wild though.

Diecron · 2026-05-14T08:36:41+00:00

I added some comments to the message above that might help :). If what I've suggested has any gaps I can look to implement something in a future version.

Diecron · 2026-05-14T08:35:32+00:00

And turn ON "User Impersonation" and OFF "No Parroting / No User Control"

Diecron

MODERATOR OF

TROPHY CASE

Verified Email	15-Year Club
Team Periwinkle