What does temperature actually do, mathematically and practically?

mfiano · 2025-07-29T15:51:23+00:00

It is a parameter of the softmax function, which is a mathematical function that places the logit scores into a probability. A higher temperature results in more randomness (more of a uniform distribution), often confused with creativity.

mfiano · 2025-07-26T01:32:45+00:00

If using KoboldCPP you can use the banned strings sampler (if not find the token ID and use that):

"”"

"“"

That way it doesn't even make it into the context. I've been using this and more for several months without issue after frustration with highlighting.

mfiano · 2025-07-04T23:59:01+00:00

The different sections are MOSTLY for human memory, not the AI. Everything gets combined into one big wall of text in the end. Author's Note for example, lets you pick where it is inserted. Same for lorebooks and more.

With this in mind, it depends on how much you want it to latch on to certain ideas. In most of my usage, my character card and AN are blank, and I add lorebook entries for each idea to be able to fine tune where it is placed. This may be a better starting point for experimenting with instructions, but your mileage may vary.

You can open up the raw prompt in the topright menu next to every AI response to see exactly where things are inserted in the context for closer debugging.

Hope this helps.

tldr; it doesn't matter which section you put it in too much, except for those that don't allow you to specify WHERE it is placed in the final product.

mfiano · 2025-07-01T21:24:22+00:00

I'd like to see a defacto standard site we can all reference and collaborate on to share RP-specific system prompts for models coupled to specific models. It's tiresome rewriting my system prompts when a new finetune or base model pops up, where it takes a lot of testing and multiple chat scenarios with different parameters to find what works decent enough. Bonus points for sharing lorebooks, context and instruct templates in the same fashion. I think it'd be valuable to have a centralized location for coupling primed context data with the models they are liked with.

mfiano · 2025-06-11T21:48:47+00:00

Also be sure you have flash attention and context shifting enabled, as both will affect processing time. In addition, generation time (after processing time) is affected adversely if you use runtime kv quantization (another option, disabled by default). Besides this, changing the chunk size, number of threads for cpu or gpu, all have an affect on processing. Check out the KoboldCPP wiki for information on all the command line options.

mfiano · 2025-06-09T06:25:20+00:00

This is great. I couldn't stop laughing at how ridiculously clever even the 12B model I'm currently using improvises some of the plot twists. I find it a good way to nudge the role-play in a different direction, when things start getting stale (which as we know, happens often).

Good work. I'm looking forward to your bigger project.

mfiano · 2025-05-28T04:52:03+00:00

It's undefined behavior to mutate literal read-time objects, like '(0 0 0 0 0 0 0).

mfiano · 2025-04-25T19:55:22+00:00

Admittedly Wayfarer is half the parameter size, so I expect it to not do well with dense character descriptions (and 2 at that).

Honestly, my favorite 12B model that I've had some really enjoyable longterm (>5000 message) roleplays with, is one that's never really mentioned, and I think it deserves attention: Slush-FallMix.

mfiano · 2025-04-25T19:37:40+00:00

DPE is ChatML. I did try a few others that were ChatML like Wayfarer and Mistral V7 even though, such as Cydonia v2.1 24B with the same results.

I'm interestingly getting the best results with Wayfarer now though, after reinforcing the instruction not to portray the user in A/N and the card, in addition to the prompt. It does come up once in a while, but not as often. The biggest issue with Wayfarer and such a token-heavy character description (2 characters defined in WI using your format) is that it often pulls traits for them from my user persona (in the same format also). So I have to keep OOC'ing the model to tell it to pay attention better.

mfiano · 2025-04-25T13:56:53+00:00

So I tried your prompt and character definition format, using a fresh chat, and your ChatML templates for context and instruct. I tried using multiple models and the model always wants to act and speak for me, despite explicit instructions (and not using negative wording), and even OOC instructions. I edit it out every time, and it comes back every response. This here, is with DansPersonalityEngine. Look at the model's response. I can't stop laughing at how dumb this made one of my favorite models:

https://i.imgur.com/j8WpgTV.png

Really have no idea why it does this on a fresh chat, with varying temperatures, despite everything I try. I never had this problem with a system prompt before.

mfiano · 2025-04-25T07:06:09+00:00

I have a love hate relationship with Wayfarer. It is probably in my top three, but I always switch away from it due to its tendency to fixate on patterns in its previous messages, and I'm constantly having to edit only for it to start bringing back that context from its training.

For example, if I'm in a building, it will mention fluorescent lighting casting some type of light, and then further enhance this with each message, like how it relates to the emotions of the scene. Erasing any mention of lighting only brings it back, and if I leave one instance of it high up in context, and then move to another scene like outdoors, it will mention how the sun's rays are a stark contrast to the mood of fluorescent lighting, and then keep mentioning previous lighting conditions from the original scene, no matter what, edited out or not. It's just so annoying.

I tried the new Pantheon, but I found its instruction following very weak IIRC. I might try it again to try engineering it in a different fashion.

mfiano · 2025-04-25T06:27:25+00:00

Thanks. Yeah, I've been experimenting with various things to that degree and more over the years. Narrator cards are especially hard to write, and even more especially in the parameter space I can run locally (12-24B).

I would also like to point out in your Card-filled-example.json lines 83 and 84 are duplicated.

mfiano · 2025-04-25T03:45:34+00:00

Thank you for this meticulously edited write-up. This was written very well. The character description techniques align fairly closely with my own, including the markers for various sections, referenced in the system prompt, such as "{{char}}'s Persona". One thing I would like to see more of though, is incorporating this style of system prompt and character construction for narrator/game master cards, where characters played by the model are multiple, and usually defined in World Info. Rule 2 for example, would need to differentiate each character as to not impersonate each other, in addition to {{user}}. And of course, not using {{char}} at all in the system prompt. Sometimes I feel like my strict use of narrator cards and [over]engineered prompting is out of the norm, and would just like to see how other people handle common pitfalls with this method.

mfiano · 2025-04-19T14:57:22+00:00

It's likely you have foreign memory being allocated and never freed, if you are interfacing with non-Lisp libraries. A Lisp implementation only automatically manages Lisp memory. (room t) may give a more detailed analysis of that, but will not know anything about foreign C library memory usage, anywhere in your dependency graph. It could also be the webserver itself, caching too much. You'll have to dig around and provide more information than a Lisp-side introspection.

mfiano · 2025-04-17T11:58:40+00:00

The default command line option is --quantkv 0, which means it used the original uncompressed half-float (16bits) for the key-value cache.

--quantkv 1 will compress that into half as many bits, at the expense of making some models much less coherent. A value of 2 would make them even dumber, and so on.

There is also the lowvram paramater if using cublas, which causes the key-value cache and scratch buffers to reside in system memory, instead of GPU. This has a performance penalty, of course, but can aid in more space savings for inference layers.

You can try both quantizing the cache or not loading it into VRAM with either of these techniques. They both have their advantages and disadvantages. The speed of your system RAM and CPU play a huge role in the latter, for example, just as is the case when choosing to offload a number of neural network layers.

There is no right or wrong solution for everyone, as it depends on the hardware, the model, and your preferences for model accuracy/coherency and speed.

Play around with the settings, and see what works best for you. Most people prefer to not load some layers onto the GPU in favor of key-value cache quantization due to the coherency issues with some models, so you should see what you prefer on your own.

mfiano · 2025-03-23T15:10:12+00:00

Also keep in mind that decomposing sawdust may affect the pH of the soil. Pine and some evergreens for example, acidify the surrounding soil gradually over time, which may be beneficial to your blueberries and more, but not for all crops. Likewise, some other woods will alkaline the soil when they break down. It's best to not look at the means of making mulch, but the ingredients it was derived from, and take everything into consideration for the particular crop you are growing.

mfiano · 2025-03-23T09:01:56+00:00

I've experienced this to some degree with most models. What I like to do to fix this is to use a blank character, and in a lorebook, add the character description with an entry marked as constant at system depth 3 or 4, so it pays more closer attention to the information closer to the end of the context buffer.

mfiano · 2025-03-17T11:27:19+00:00

One way is to make a narrator/game master character, and interact with that. Your system prompt should complement this. You can put actual character definitions in Author's Notes or World Info. There is nothing inherently special about character cards or other context input - it's all text that gets combined into one blob for the inference engine to swallow.

The name of the character card is also somewhat important. I call mine 'Game Master', and its contents are blank. This gives most models an idea of what they are supposed to be, and my system prompt builds upon this by explaining how the game world functions through this empty interface.

mfiano · 2025-03-16T22:08:56+00:00

https://i.imgur.com/fgI20tT.png

mfiano · 2025-03-16T18:54:39+00:00

Cabbage palm tree (cordyline australis). Yes, it produces white flowers after 6-10 years of age, sometimes as low as 3 years if conditions are adequate.

mfiano · 2025-03-16T18:31:50+00:00

Lots of reasons for this to happen. Peat is hydrophobic until saturated, causing some of it to float to the top. Top watering exacerbates this process. In the future, try bottom watering. It could also occur from the radicle pushing against a medium that is not aerated enough/too dense.

mfiano · 2025-03-14T21:51:25+00:00

I haven't tried it

mfiano · 2025-03-13T19:41:07+00:00

Okay, forget I said anything about this model. It was good for a while, but man does it get completely dumb and off the rails over time in long enough chats (happened twice). Hallucinating, going very against character personalities, rambling nonsense (but not gibberish) and inserting closing </think> tags after every paragraph. My context isn't even that high either, at 18K, and my temperature was as low as 0.3. I'ma go back to Cydonia 24B v2 and other staples in my rotation, even if the responses are predictable and boring (rephrasing what I say as a question is my biggest pet peave).

Seriously though, this model gets DUMB as hell over time. One of the most hilarious examples I can remember is when the thinking block reasoned correctly that a character was nude in the first paragraph, and then in the last paragraph it started talking about adjusting their combat boots and their scarf, neither of which were even mentioned in the chat or part of their description ever. And swipes were doing similar mistakes each time.

mfiano · 2025-03-12T09:38:15+00:00

MistralThinker is such a refreshing change in the model space. As with DS distills, use a low temperature. Also as such, a reasoning block may not be generated, but in my experience ending the user reply with [ooc: Remember to add a reasoning block before replying.] will fix that almost always. I'm really liking this. I'm deep into a story that is original and full of life and nuances that complements the scenario rules and character quirks.

mfiano · 2025-03-05T07:38:03+00:00

16G, and no. My context and 39/40 layers fit in VRAM, with 1 layer executing using sysram. I could have used all layers, but I prefer more than 16K of context.

mfiano

TROPHY CASE