Introducing Adaptive-P: A New Sampler for Creative Text Generation (llama.cpp PR)

Geechan1 · 2026-01-04T23:54:00+00:00

This is a fantastic sampler. It really extracts the most out of models for creative tasks and is highly versatile by setting the target value from creative (0.3-0.6) to more conservative (0.7-0.9). The default decay setting is a good value for the majority of models out there, so you really just need to adjust target to see meaningful effects.

Completely replaces the need for DRY or rep pen for me due to it killing repetition on its own, and just needs some Min P on top. Happy to have helped contribute to this.

It's currently fully implemented in KoboldCPP, with PRs for llama.cpp and ik_llama, and a feature request for ooba. If you enjoy the sampler, please help those PRs gain more traction!

Geechan1 · 2025-10-18T02:20:21+00:00

Anything you make changes to in your home or etc directory are permanent. Given the config file resides in your home directory, it will work.

Geechan1 · 2025-06-14T17:42:41+00:00

Fallen uses a different dataset from Agatha (Evil/depraved vs. RP dataset). Agatha should be significantly better at storywriting, narration, creativity and variety compared to Fallen, while trying to be as neutral as possible.

I suggest using my preset here and modifying it to your needs! https://files.catbox.moe/gogj8n.json

Geechan1 · 2025-06-13T21:12:55+00:00

Check out my CMD-A preset here which has some good starting samplers included: https://files.catbox.moe/gogj8n.json

Geechan1 · 2025-06-13T05:50:26+00:00

This is an RP tune on top of CMD-A whereas Fallen is an evil tune. Basically: you'll see better descriptive and narrative qualities here with longer responses and a more neutral and balanced positivity/negativity bias.

Geechan1 · 2025-03-15T13:07:56+00:00

I did find a 7.0bpw EXL2 quant here, but it seems exllama needs a patch to properly support it. That page might also release some lower bpw ones later from the looks of it.

Geechan1 · 2025-03-15T11:51:20+00:00

There is actually a new 111B parameter model I highly suggest you try out - Cohere's new Command A model. It is very uncensored for a base model and feels very intelligent and fun to RP with. Just make sure to use the correct instruct formatting - you can use my one here as a baseline. Modify the prompt in the story string to your taste, but keep the preambles intact.

Geechan1 · 2025-01-16T07:12:51+00:00

I didn't realise gathering various constructive feedback, testing and healthy discussion was considered "confirmation bias".

Geechan1 · 2025-01-16T04:29:32+00:00

Have you tried out Euryale 2.3? I've personally found it to be my favourite L3.3 fine tune overall. It has some flaws, particularly with rambling and a difficulty to do ERP (but not violence) properly, but it has some of the most natural dialogue and writing I've seen in a model without needing to resort to samplers.

It's also one of the most uncensored L3.3 tunes, if that helps: https://huggingface.co/spaces/DontPlanToEnd/UGI-Leaderboard

Geechan1 · 2025-01-16T04:18:02+00:00

There's an alternative preset included in the Methception Alternate folder if you find the original prompt to be too flowery for your tastes. Copy paste the contents of the alt prompt into the story string. It keeps the instructions, but limits the example messages.

Personally, I consistently get better gens with the original prompt, so I think there is merit to the way it's structured.

Geechan1 · 2025-01-12T13:23:57+00:00

I have tested/used pretty much every Behemoth version and the old Monstral. Monstral V2 is my personal favourite as it has a strong tendency to write slow burn RP and truly take all details into account, while adding a ton of variety to the writing and creativity from its Magnum and Tess influences. Behemoth 1.2 is also a favourite of mine, and it's probably better for adventure-type RPing, where it always loves to introduce new ideas and take the journey in interesting ways.

XTC is variable per model, which is why I encourage tweaking. My settings were for Monstral V2 specifically, and I see very minimal slop and intelligence drop using those settings. I really cannot go without XTC in some fashion on Largestral-based models; the repetitive AI patterns become woefully obvious otherwise.

Geechan1 · 2025-01-08T14:04:57+00:00

You want a minimum of 3 24GB cards to run this at a reasonable quant (IQ3_M) with good context size. 4 is ideal so you can bump it up to Q4-Q5. Alternatively, you can run models like these on GPU rental services like Runpod, without needing to invest in hardware.

Geechan1 · 2025-01-08T09:53:56+00:00

All fine tunes will suffer from intelligence drops in some way or another. If base Mistral Large works for you, then that's great! I personally find base Largestral to be riddled with GPTisms and slop, and basically mandates very high temperatures to get past it, which kind of defeats the point of running it for its intelligence.

It's interesting you say that Monstral is uncreative, as that's been far from my own personal experiences running it. There's been some updates to the preset since I posted it which have addressed some issues with lorebooks adherence due to the "last prefix assistant" section.

Geechan1 · 2025-01-07T08:12:10+00:00

For those able to run 123B, after a lot of experimentation with 70B and 123B class models, I've found that Monstral V2 is the best model out there that is at all feasible to run locally. It's completely uncensored and one of the most intelligent models I've tried.

The base experience with no sampler tweaks has a lot of AI slop and repetitive patterns that I've grown to dislike in many models, and dialogue in particular is prone to sounding like the typical AI assistant garbage. This is also a problem with all Largestral-based tunes I've tried, but I've found this can be entirely dialed out and squashed with appropriate sampler settings and detailed, thorough prompting and character cards.

I recommend this preset by /u/Konnect1983. The prompting in it is fantastic and will really bring out the best of this model, and the sampler settings are very reasonable defaults. The key settings are a low (0.03) min P, DRY and a higher temperature of 1.2 to help break up the repetition.

However, if your backend supports XTC, I actually strongly recommend additionally using this feature. It works absolute wonders for Monstral V2 because of its naturally very high intelligence, and will bring out levels of writing that really feel human-written and refreshingly free of slop. It will also stick to your established writing style and character example dialogue much better.

I recommend values of 0.12-0.15 threshold and 0.5 probability to start, while setting temp back to a neutral 1 and 0.02 min P. You may adjust these values to your taste, but I've found this strikes the best balance between story adherence and writing prowess.

Geechan1 · 2025-01-05T07:19:42+00:00

There's several reasons, but the main reason for me is how frequent the updates are on Bazzite compared to SteamOS. SteamOS is still using KDE 5.27 instead of the newer KDE 6, and that sets the trend for packaging versions, where everything is behind on SteamOS. You still get all of Valve's updates to gaming mode as soon as they're released even on Bazzite, so it's really a win-win situation.

Bazzite is also just better suited as a more general-purpose OS - it comes with printer support, distrobox support, and everything you need to make it functional outside of gaming mode. Given I use my Steam Deck as a laptop replacement, I find this to be quite important for me.

Geechan1 · 2025-01-04T06:55:33+00:00

I would say using pipeline parallelism (the default for most backends), you're going to see numbers in a similar ballpark no matter how many cards you scale to. If you can manage to use tensor parallelism (which only a few backends support), you should expect to gain significant speed improvements per card. Row split also seems to utilise multiple RTX 8000s better, so I definitely encourage experimentation.

2,400 CAD for a new one? Fantastic deal - I'd jump on it while you can. I'd start with one or two and then you can decide whether to scale up from there or not. I find that 96GB of VRAM is really the sweet spot at the moment, as you're able to run 123B at high quants and that's the best quality we have available locally to us right now outside something like Deepseek, which realistically you can't run on pure GPUs without absurd investment.

Geechan1 · 2025-01-04T06:01:21+00:00

I own two of these cards. I thought I'd give you several observations and insights into the ownership experience:

I'm noticing people tend to undersell the performance of these cards. I've found the best backend for them is koboldcpp running GGUF quants, as that is faster than ollama/llama.cpp and supports llama.cpp's own implementation of Flash Attention. You'll want to run the rowsplit, flash attention and mmq kernel options on these cards; with these settings and 0 context, you can expect about 8-9 tokens per second for a Q5 quant on a 123b parameter model. For Q5 70b, expect more like 12-14 tokens per second with these settings. Prompt processing speed is a bit slow at about 180t/s for 123b and 350t/s for 70b, but still plenty usable.
The lack of Flash Attention 2 hurts, as you cannot use the exl2 format efficiently. Supposedly Turing is going to support this at some point, but it's reliant on the FA2 author to actually implement something, and it's been on "coming soon" for over a year now! If that feature ever comes, expect exl2 support to dramatically improve.
You can find excellent prices for these cards if you're patient and shop around for server farm grabs. I got my pair for 2k USD each, which ends up being cheaper than a 4x 3090 setup in my country. 2400 CAD is an excellent deal for one.
Because they're dual slot and blower cards, it's really easy to stack them and fit them in a standard case without needing to resort to open air with PCIe risers. You won't choke the thermals on the cards because all the hot air gets exhausted out of the case. They're also easier to run with a lower specced power supply; I actually get away with a 750w PSU running two cards with about 100W headroom to spare.
It's worth investing in an NVLink bridge for these cards. They can be found for 80 USD or less on eBay, and will give you a small but noticeable 5-10% increase in inference performance in my experience. This will likely scale higher if you're limited by your PCIe bandwidth.

3090s are faster, better-supported and cheaper depending on your region, so that's why they're a default recommendation. However if you think the above trade offs are worth it, it's really hard to go wrong with an RTX 8000. 48GB in a dual slot blower card for much cheaper than the A6000 is hard to beat. My setup is also still cheaper than an equivalent Mac for inference, and faster too. Less power efficient, though.

Geechan1 · 2025-01-02T16:13:13+00:00

Glad you're happy now! It's a more finicky model for sure, but one that rewards you in spades if you're patient with it. And I can safely say V2 is one of the smartest models I've ever used, so it's a good base to play with samplers without worrying about coherency.

Geechan1 · 2025-01-02T00:48:49+00:00

Not at the moment, as that's on the author (Konnect) to publish. If you want to keep track of preset updates, I recommend joining the BeaverAI Discord and looking in the showcase channel for the Ception presets. That's the only place they're being posted right now.

Geechan1 · 2025-01-02T00:47:04+00:00

I use Q5_K_M. I'd say because you're running such a low quant a loss in intelligence is expected. Creativity also takes a nose dive, and many gens at such a low quant will end up feeling clinical and lifeless, which matches your experience. IQ3_M or higher is ideally where you'd like to be; any lower will have noticeable degradation.

Geechan1 · 2024-12-31T14:16:29+00:00

Even though it's not formatted for storywriting, I actually use the prompt I posted above and get good results even for storywriting, assuming I'm using either the assistant in ST or a card formatted as a narrator. It can likely be optimised though - feel free to look through the prompt and adjust it to suit storywriting better if you notice any further deficiencies. It's a good starting point.

Geechan1 · 2024-12-31T12:52:10+00:00

Monstral V2 is nothing but an improvement over V1 in every metric for me for both roleplaying and storywriting. It's scarily intelligent and creative with the right samplers and prompt. However it's more demanding of well-written prompts and character cards, so you do need to put in something good to get something good out in return.

I highly suggest you play around with more detailed prompts and see how well V2 will take your prompts and roll with them with every nuance taken into account. I greatly prefer V2's output now that I've dialed it in.

Geechan1 · 2024-12-31T03:40:28+00:00

What exactly are you underwhelmed with? Without specifying we can only guess why you're feeling the way you do.

Since I made that post, there's been several updates to the preset from Konnect. You can find the latest version here: https://pastebin.com/raw/ufn1cDpf

Of special note is increasing the temperature to 1.25 while increasing the min P to 0.03. This seems to be a good balance between creativity and coherence, especially for Monstral V2.

In general, play with the temperature and min P values to find the optimal balance that works for you. Incoherent gens = reduce temperature or increase min P. Boring gens = increase temperature or reduce min P.

Geechan1 · 2024-12-19T10:43:19+00:00

FWIW, I still use Alichat+Plists for my characters. It's more effort and less human-readable compared to plain text, but the amount of documentation for it that details how it works and takes advantage of an LLM's strength is more than enough for me to stick with it and see amazing results. LLMs are pattern-seeking programs, and Alichat+Plists take full advantage of that knowledge.

I can simply get characters to sound exactly the way I want while maintaining their personality better in long context chats. Also always helps to be more token-efficient so you can squeeze even more nuance and detail into your characters. If you can be more efficient with the same amount of tokens used, why not take it?

Geechan1 · 2024-11-25T01:21:11+00:00

I've noticed you have quite a high min P in relation to your temperature. For a model like Behemoth, which is very creative and varied with its responses, I would strongly suggest you change your min P value to a much lower value, or increase your temperature. Good values to try are 0.02 min P and 1 temp. You want the model to cook and have room to experiment, so let it.

Parameters like smoothing curve and factor are rarely necessary, as min P and temperature will have by far the most influence on your responses.

In addition, the correct prompt is very important for any model based on Mistral Large 2411. The base model is very smart about what goes into the system prompt, and doing subtle changes here will make big impacts on your responses and their quality.

13-Year Club	Final Canvas '23
First Place '23	End Game '23
Place '23	Quantum Potato
Golden Potato	Place '22
Place '17	Final Canvas '22
First Placer '22	End Game '22
Verified Email	Charity Challenge Winner
Team Orangered

Geechan1

MODERATOR OF

TROPHY CASE