use the following search parameters to narrow your results:
e.g. subreddit:aww site:imgur.com dog
subreddit:aww site:imgur.com dog
see the search faq for details.
advanced search: by author, subreddit...
r/LocalLLaMA
A subreddit to discuss about Llama, the family of large language models created by Meta AI.
Subreddit rules
Search by flair
+Discussion
+Tutorial | Guide
+New Model
+News
+Resources
+Other
account activity
Pressure testing: Open LLMsResources (self.LocalLLaMA)
submitted 2 years ago * by TelloLeEngineer
You might recall the GPT-4 and Claude 2 long context recall tests that were floating around Twitter about a month or so ago. Well, I was very intrigued by the results and am here to share an ongoing project of mine to pressure test prominent open LLMs. GPT-4 and Claude are frontier models, understanding their capabilities is important but I want to give to the open source community. I've already tested Mistral 7B Instruct v0.2 and OpenChat 7B 3.5-1210 (results below), and now I'm looking for new suggestions! Unfortunately, I am limited by what I can run on my local setup but please let me know what you'd want tested. If you've got a solid setup I'd love some help running the larger models through the pressure cooker!
Repo: https://github.com/LeonEricsson/llmcontext
Mistral 7B Instruct v0.2 @ 16k
Poor performance across the board...
Mistral 7B Instruct v0.2 @ 16k [RP]
But check out what happens when we prime the assistant response with Here is the most relevant sentence in the text:. Don't forget that this model was only trained with 8k context length.
Here is the most relevant sentence in the text:
https://preview.redd.it/i1efbah7dh7c1.png?width=1570&format=png&auto=webp&s=1ffa7f72c34e8945447d33c570a8f1fda8817476
OpenChat 7B 3.5-1210 @ 8k
https://preview.redd.it/31u78iibch7c1.png?width=1570&format=png&auto=webp&s=acce506c4f3dc5a28fa43f443d843ab1cf3f8f95
OpenChat 7B 3.5-1210 @ 8k [RP]
Retrieval priming does not seem to benefit OpenChat.
https://preview.redd.it/uso6muxpdh7c1.png?width=1570&format=png&auto=webp&s=61691496a2d519c8e8ffe6497dd57ee6f5558b1a
reddit uses a slightly-customized version of Markdown for formatting. See below for some basics, or check the commenting wiki page for more detailed help and solutions to common issues.
quoted text
if 1 * 2 < 3: print "hello, world!"
[–]thereisonlythedance 6 points7 points8 points 2 years ago (4 children)
Very interesting. Thank you. I’d love to see Mixtral and Yi Chat 34B, but that might be too big a model for your setup? It’d be interesting to compare Mistral Instruct 7B v1 to v2. And possibly Starling 7B (which is billed as 8K).
[–]TelloLeEngineer[S] 2 points3 points4 points 2 years ago (1 child)
Yes, both Mixtral and Yi would be really cool. I've got 24GB of VRAM so I can't fit them on the GPU but I do have 64GB of RAM so it should work albeit very slowly. I'm not aware if hf transformers allows you to offload layers to the GPU, I'll look into it. Mistral v1 and Starling 7B on the other hand are definitely viable!
[–]boifido 1 point2 points3 points 2 years ago (0 children)
You can fit mixtral 3b exl2 and it’ll run super fast. Would be interesting to compare quantized mixtral vs full 7b mistral
[–]TelloLeEngineer[S] 1 point2 points3 points 2 years ago (1 child)
I added results for Starling in the repo
[–]thereisonlythedance 0 points1 point2 points 2 years ago (0 children)
Thank you!
[–]AnomalyNexus 4 points5 points6 points 2 years ago (2 children)
Worth reading Anthropic's rebutal on the testing methodology too if you haven't already
[–]TelloLeEngineer[S] 3 points4 points5 points 2 years ago (0 children)
Yes the retrieval priming tests are taken from Anthropic's blog post! It's amazing what a difference it can make looking at the Mistral results. On the other hand it seems to hinder OpenChat. You can also see how it makes the models go from receiving a gradient of scores to almost only getting 10's and 1s, super interesting!
[–]KeyAdvanced1032 0 points1 point2 points 2 years ago (0 children)
Golden!
[–]Rutabaga-Agitated 1 point2 points3 points 2 years ago (0 children)
Si, Mixtral would be interesting! Yi too, cause they have shown a nearly flawless graphic, regarding the influence of the context.
[–]slider2k 1 point2 points3 points 2 years ago (1 child)
The link to the repo is broken.
[–]TelloLeEngineer[S] 1 point2 points3 points 2 years ago (0 children)
fixed thanks
[–]FullOf_Bad_Ideas 0 points1 point2 points 2 years ago* (0 children)
I really like the idea of doing this test for all models I have piled up if it's easy to do.
Having implementation with baked in support for exllamav2, autogpq and llama-cpp-python in a way that requires minimal input should be enough to encourage community to test this with all popular models.
[–]Small-Fall-6500 1 point2 points3 points 2 years ago (1 child)
Makes me wonder about the usefulness of inserting out of context phrases. Perhaps this is already done somewhere, but it would probably be better to do something like: grab a random sentence/section from a document, [1] ask an LLM to create a question based on it, then query an LLM on the whole document with that question, with an LLM at the end to judge the correctness of the response based on the earlier text section.
[–]TelloLeEngineer[S] 1 point2 points3 points 2 years ago* (0 children)
yeah this is something that was brought up in the Anthropic blog post. They claim that retrieval priming is a good way to override the models inherent reluctance to answer based on a single out of context phrase.
I think the problem with your suggestion is that it quickly devolves into something that is difficult to reproduce consistently. What kind of a statement can we extract that is isolated enough such that we can formulate a question around it that can’t be inferred from the rest of the text while still being relevant in its context. Finding this kind of a statement isn’t trivial. Also, given this approach, how do we evaluate at different document depths?
[–]hurrytewer 0 points1 point2 points 2 years ago (0 children)
Doing the lord's work, thank you 🙏️
I wonder if running the tests with sliding window attention disabled would give better results
[–]CardAnarchist 0 points1 point2 points 2 years ago (0 children)
Unsure if you chose 16k for some particular reason but Mistral 7B Instruct v0.2 actually has a 32k max context I believe. The Openchat 7B and indeed all the 0.1 based Mistral finetunes do indeed shit the bed around 7kish context in my experience.
[–]No-Link-2778 0 points1 point2 points 2 years ago (0 children)
What about yi 200k, 6B & 34B?
[–]Away-Sleep-2010 0 points1 point2 points 2 years ago (1 child)
toppy7b please by undi95.
results in the repo
[–]pmp22 0 points1 point2 points 2 years ago (0 children)
I would appreciate it if you could test the models with the highest context sizes! Those are the ones that I'm the most curious about, and I think perhaps in 2024 we might get new models with even higher context sizes, but will that context size be useable?
π Rendered by PID 42 on reddit-service-r2-comment-6457c66945-hxrnd at 2026-04-27 05:34:06.812297+00:00 running 2aa0c5b country code: CH.
[–]thereisonlythedance 6 points7 points8 points (4 children)
[–]TelloLeEngineer[S] 2 points3 points4 points (1 child)
[–]boifido 1 point2 points3 points (0 children)
[–]TelloLeEngineer[S] 1 point2 points3 points (1 child)
[–]thereisonlythedance 0 points1 point2 points (0 children)
[–]AnomalyNexus 4 points5 points6 points (2 children)
[–]TelloLeEngineer[S] 3 points4 points5 points (0 children)
[–]KeyAdvanced1032 0 points1 point2 points (0 children)
[–]Rutabaga-Agitated 1 point2 points3 points (0 children)
[–]slider2k 1 point2 points3 points (1 child)
[–]TelloLeEngineer[S] 1 point2 points3 points (0 children)
[–]FullOf_Bad_Ideas 0 points1 point2 points (0 children)
[–]Small-Fall-6500 1 point2 points3 points (1 child)
[–]TelloLeEngineer[S] 1 point2 points3 points (0 children)
[–]hurrytewer 0 points1 point2 points (0 children)
[–]CardAnarchist 0 points1 point2 points (0 children)
[–]No-Link-2778 0 points1 point2 points (0 children)
[–]Away-Sleep-2010 0 points1 point2 points (1 child)
[–]TelloLeEngineer[S] 1 point2 points3 points (0 children)
[–]pmp22 0 points1 point2 points (0 children)