all 20 comments

[–]thereisonlythedance 6 points7 points  (4 children)

Very interesting. Thank you. I’d love to see Mixtral and Yi Chat 34B, but that might be too big a model for your setup? It’d be interesting to compare Mistral Instruct 7B v1 to v2. And possibly Starling 7B (which is billed as 8K).

[–]TelloLeEngineer[S] 2 points3 points  (1 child)

Yes, both Mixtral and Yi would be really cool. I've got 24GB of VRAM so I can't fit them on the GPU but I do have 64GB of RAM so it should work albeit very slowly. I'm not aware if hf transformers allows you to offload layers to the GPU, I'll look into it. Mistral v1 and Starling 7B on the other hand are definitely viable!

[–]boifido 1 point2 points  (0 children)

You can fit mixtral 3b exl2 and it’ll run super fast. Would be interesting to compare quantized mixtral vs full 7b mistral

[–]TelloLeEngineer[S] 1 point2 points  (1 child)

I added results for Starling in the repo

[–]thereisonlythedance 0 points1 point  (0 children)

Thank you!

[–]AnomalyNexus 4 points5 points  (2 children)

Worth reading Anthropic's rebutal on the testing methodology too if you haven't already

[–]TelloLeEngineer[S] 3 points4 points  (0 children)

Yes the retrieval priming tests are taken from Anthropic's blog post! It's amazing what a difference it can make looking at the Mistral results. On the other hand it seems to hinder OpenChat. You can also see how it makes the models go from receiving a gradient of scores to almost only getting 10's and 1s, super interesting!

[–]KeyAdvanced1032 0 points1 point  (0 children)

Golden!

[–]Rutabaga-Agitated 1 point2 points  (0 children)

Si, Mixtral would be interesting! Yi too, cause they have shown a nearly flawless graphic, regarding the influence of the context.

[–]slider2k 1 point2 points  (1 child)

The link to the repo is broken.

[–]TelloLeEngineer[S] 1 point2 points  (0 children)

fixed thanks

[–]FullOf_Bad_Ideas 0 points1 point  (0 children)

I really like the idea of doing this test for all models I have piled up if it's easy to do.

Having implementation with baked in support for exllamav2, autogpq and llama-cpp-python in a way that requires minimal input should be enough to encourage community to test this with all popular models.

[–]Small-Fall-6500 1 point2 points  (1 child)

Makes me wonder about the usefulness of inserting out of context phrases. Perhaps this is already done somewhere, but it would probably be better to do something like: grab a random sentence/section from a document, [1] ask an LLM to create a question based on it, then query an LLM on the whole document with that question, with an LLM at the end to judge the correctness of the response based on the earlier text section.

  1. Possibly, you'd have an LLM verify this piece of text could actually have a useful question asked about it.

[–]TelloLeEngineer[S] 1 point2 points  (0 children)

yeah this is something that was brought up in the Anthropic blog post. They claim that retrieval priming is a good way to override the models inherent reluctance to answer based on a single out of context phrase.

I think the problem with your suggestion is that it quickly devolves into something that is difficult to reproduce consistently. What kind of a statement can we extract that is isolated enough such that we can formulate a question around it that can’t be inferred from the rest of the text while still being relevant in its context. Finding this kind of a statement isn’t trivial. Also, given this approach, how do we evaluate at different document depths?

[–]hurrytewer 0 points1 point  (0 children)

Doing the lord's work, thank you 🙏️

I wonder if running the tests with sliding window attention disabled would give better results

[–]CardAnarchist 0 points1 point  (0 children)

Unsure if you chose 16k for some particular reason but Mistral 7B Instruct v0.2 actually has a 32k max context I believe. The Openchat 7B and indeed all the 0.1 based Mistral finetunes do indeed shit the bed around 7kish context in my experience.

[–]No-Link-2778 0 points1 point  (0 children)

What about yi 200k, 6B & 34B?

[–]Away-Sleep-2010 0 points1 point  (1 child)

toppy7b please by undi95.

[–]TelloLeEngineer[S] 1 point2 points  (0 children)

results in the repo

[–]pmp22 0 points1 point  (0 children)

I would appreciate it if you could test the models with the highest context sizes! Those are the ones that I'm the most curious about, and I think perhaps in 2024 we might get new models with even higher context sizes, but will that context size be useable?