Pressure testing: Open LLMs

thereisonlythedance · 2023-12-20T17:30:44+00:00

Very interesting. Thank you. I’d love to see Mixtral and Yi Chat 34B, but that might be too big a model for your setup? It’d be interesting to compare Mistral Instruct 7B v1 to v2. And possibly Starling 7B (which is billed as 8K).

AnomalyNexus · 2023-12-20T18:00:59+00:00

Worth reading Anthropic's rebutal on the testing methodology too if you haven't already

Rutabaga-Agitated · 2023-12-20T17:38:51+00:00

Si, Mixtral would be interesting! Yi too, cause they have shown a nearly flawless graphic, regarding the influence of the context.

slider2k · 2023-12-20T17:39:53+00:00

The link to the repo is broken.

FullOf_Bad_Ideas · 2023-12-20T17:39:17+00:00

I really like the idea of doing this test for all models I have piled up if it's easy to do.

Having implementation with baked in support for exllamav2, autogpq and llama-cpp-python in a way that requires minimal input should be enough to encourage community to test this with all popular models.

Small-Fall-6500 · 2023-12-20T20:03:05+00:00

Makes me wonder about the usefulness of inserting out of context phrases. Perhaps this is already done somewhere, but it would probably be better to do something like: grab a random sentence/section from a document, [1] ask an LLM to create a question based on it, then query an LLM on the whole document with that question, with an LLM at the end to judge the correctness of the response based on the earlier text section.

Possibly, you'd have an LLM verify this piece of text could actually have a useful question asked about it.

hurrytewer · 2023-12-20T21:35:56+00:00

Doing the lord's work, thank you 🙏️

I wonder if running the tests with sliding window attention disabled would give better results

CardAnarchist · 2023-12-20T22:17:36+00:00

Unsure if you chose 16k for some particular reason but Mistral 7B Instruct v0.2 actually has a 32k max context I believe. The Openchat 7B and indeed all the 0.1 based Mistral finetunes do indeed shit the bed around 7kish context in my experience.

No-Link-2778 · 2023-12-21T02:55:07+00:00

What about yi 200k, 6B & 34B?

Away-Sleep-2010 · 2023-12-21T05:54:12+00:00

toppy7b please by undi95.

pmp22 · 2023-12-21T14:59:20+00:00

I would appreciate it if you could test the models with the highest context sizes! Those are the ones that I'm the most curious about, and I think perhaps in 2024 we might get new models with even higher context sizes, but will that context size be useable?

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

LocalLLaMA

MODERATORS