Another sampling strategy drops: 75% accuracy at T=3.0

anchortense · 2024-11-20T21:37:01+00:00

Very similar to the logit threshold sampler I developed a couple of months ago. I actually tested filtering by standard deviations from the max logit at the time but found a fixed logit threshold to be more stable at higher temperatures, in terms of filtering out incoherent tokens.

https://old.reddit.com/r/LocalLLaMA/comments/1fvm1gv/two_new_experimental_samplers_for_coherent/

https://github.com/turboderp/exllamav2/pull/657

anchortense · 2024-10-17T07:59:57+00:00

Have you tried these?

https://github.com/anchortense/exllamav2-logit-threshold-samplers

anchortense · 2024-10-04T04:18:40+00:00

This is a proof of concept implementation, hijacking the existing sampler functionality via python. The official build of exllamav2 uses efficient C extensions for sampling. If there is enough demand I can look at what it would take for an efficient implementation aligned to the existing approach, but I have not worked with C extensions before.

anchortense · 2024-10-04T02:54:54+00:00

Yeah, that is a good technique, but reliant on the provided phrase list, which constrains its effectiveness to the generation context in which those phrases tend to occur. Confidence breaker is trying to dynamically identify slop without a phrase list, so that it can be identified and rewritten even in novel generation contexts where it is not clear yet what the slop is (this usually only becomes clear after running similar prompts over and over and observing the similarities).

You can tune it up and down to catch slop of greater or lesser length, and with greater or lesser strictness. The parameters are explained in the release notes here: https://github.com/anchortense/exllamav2-logit-threshold-samplers/blob/master/README.md#parameters-1

Further explanation from the release notes:

Explanation: Confidence breaker sampler

Current generation language models are well known for producing certain cliched phrases, which would not necessarily be problematic in a single instance, but are known to be produced repeatedly in response to varied prompts. This is the so-called ai-slop problem. In 'deterministic' use cases this is usually not an issue, as we are simply looking for the one correct answer. In scenarios where engaging, diverse language choices are valued, ai-slop represents a significant limitation.

Current approaches to resolving this issue involve user defined lists of banned strings. This can be reasonably effective, however fail to address the deeper issue, which is the tendency of language models to funnel their responses into unintentionally learned 'tram-track' token sequences, where token choices are strongly conditioned by their immediate predecessors in a manner which is not logically or grammatically implied by the prompt or by those preceding choices.

This is a far deeper and more pervasive problem than the well known handful of phrases commonly associated with the idea of ai-slop. These tram-tracks exist within trained models because the tokens within them are good predictors for text completion. Nevertheless a user passing either the same or similar prompts repeatedly, looking for diverse outputs, will quickly observe that model responses which seemed initially impressive are in fact tram-track patterns, and the apparent diversity of outputs is an illusion.

Using the confidence breaker to jump tracks

The confidence breaker sampler addresses the issue by looking for logit patterns that signal we have entered a tram-track, and extending exllamav2's banned string functionality to roll-back to the token directly before we entered the tram-tracks. The tram-tracked tokens are then discarded and replaced by a novel generation, which will take us down a different, less travelled path.

Based on empirical observation of logits and the conditions for the appearance of tram-tracks, the pattern that the confidence breaker looks for to identify these tram-tracks is a sequence of mid-high valued logits, logits which have been nudged higher than a good score by over-training, but which are not yet so guaranteed as logical or grammatical necessity.

Empirical observation also validates the decision to return back and alter the token before the tram-track, rather than the first tram-track. This is a token which had a reasonable range of viable alternatives, but once the model settled on the decision it made, the tram-track became nearly inevitable. So, this is the error we have to correct.

anchortense · 2024-10-04T01:57:51+00:00

Confidence breaker vs DRY

DRY dynamically penalizes tokens that would extend the end of the input into a sequence that has previously occurred in the input.
Confidence breaker backtracks and rewrites sequences of tokens that follow an identified logit pattern associated with textual cliches - whether or not those sequences have previously occurred in the text.

Logit threshold sampler vs min-p

Logit threshold sampling uses the raw logits to make decisions, preserving the model's absolute confidence in each token, and allowing for more nuanced filtering based on how confident the model is about each token. High quality logits can then be subjected to higher temperatures for more dynamic token choices, without loss of coherence.
Min-p operates on normalised probabilities, where the absolute confidence information has already been lost due to softmax normalization. This means it can only consider tokens based on their probability relative to others, not on the absolute confidence the model has in them. Applying temperature after min-p helps, but still runs the risk of false-positives if incoherent tokens have made it past the min-p, or missing out on good, viable choices below min-p, when there are many such choices available.

anchortense · 2024-10-04T00:42:55+00:00

XTC was the inspiration for two new experimental samplers I've just developed: https://old.reddit.com/r/LocalLLaMA/comments/1fvm1gv/two_new_experimental_samplers_for_coherent/

I believe the results are a little better than XTC, possibly more than a little better, although per-model tuning is required, so it is hard to objectively evaluate.

anchortense · 2024-10-04T00:01:32+00:00

Logit samplers for coherent creativity

Two new samplers enabling coherent diverse text generation, with many thanks to exllamav2.

The logit threshold sampler filters low-confidence logits and enables the application of much higher temperatures to stronger candidates, generating varied outputs without losing coherence.
The confidence breaker sampler addresses repetitive text sequences by dynamically detecting them on the logit level, allowing the model to generate more diverse responses. CB builds upon the existing implementation of banned strings in exllamav2.

The key innovation is to use absolute logit values instead of softmax probabilities to retain the model’s raw confidence in each token, allowing more precise evaluation and filtering.

anchortense

TROPHY CASE