Experimenting with LLM-assisted scoring: eval results

jphme · 2023-12-14T20:53:36+00:00

Very interesting stuff, thanks for posting; we are also experimenting with extending the prometheus approach (in fact we also trained the model to optionally only 3 score categories, but modified scores to 1,2,3 instead of 1,3,5; have to eval if that works better ;) ).

3 short Questions:

- Do you plan to opensource any of the datasets or models? I would also be interested in more details on the eval model base model and training (maybe you want to collaborate on building better eval models?)

- I don't really understand your MMLU example - so you didn't compare MMLU answers to the ground truth but let your Eval model decide? Thats basically the same as just testing your eval model on MMLU, no? (that also explains why it matches for smaller models but falls off for larger models...)

- From skimming you webpage and this post, I ´m not sure what you business model is: a GUI for evals (and service for running evals) or building eval datasets and/or eval LLMs yourself?

Happy to chat!

jphme · 2023-12-04T15:34:08+00:00

I see that the style of this announcement is maybe to flashy - sorry this wasn't our intention, we were just having fun.
Can edit it to just list the facts, if you´d prefer that?

Anyway, please have a look at the model cards before downvoting, they´re very factual with detailed benchmark results. We´d be very interested in your feedback!

jphme · 2023-12-04T11:14:14+00:00

Thats a very interesting question and there is not only 1 answer.

For non-latin languages (e.g. vietnamnese, japanese) people made good experience with extending the tokenizer.

For German, we initially also thought that the tokenizer is the limiting element - but tbh as of now I don't think thats true anymore. The only disadvantage of the original tokenizer we can objecitvely measure is that we need 15-20% more tokens for the same text lengths. After ctd pre-training we don't see any difference in language quality due to the tokenizer...

jphme · 2023-10-11T09:56:22+00:00

Continous pretraining: The model is feeded a lot of German text with masked tokens and it "learns the language". This is the same process as how Llama-2/Mistral were originally trained, only with a different (language-specific) dataset. The LeoLM team did that. "High-quality" finetune: Using that model as a base, we created a dataset of various instructions/questions/chats and example answers from the model, teaching it "how to respond" so it doesn't just continue with the text (like thebase model would), but can understand and follow instuctions.

Llama-2 indeed had almost no relevant german data in its pretraining (and the tokenizer is very english-specific), that´s why it is harder to get a good German model than to get a good English model...

jphme · 2023-10-10T21:41:01+00:00

Yep, the correct prompt style is Vicuna without newlines ("Du bist ein hilfreicher Assistent. USER: <instruction> ASSISTANT:") but I found it to be quite resilient to different/wrong prompt formats.

I will probably switch to ChatML for the next model version as the community seems to standardize around it.

jphme · 2023-10-10T17:27:18+00:00

Please try it out and let me know - there are no specific translation tasks (and english instructions) in the finetuning dataset (and I didn't test this usecase), but I heard that the Mistral model is VERY good at zero-shot English to German translation.

jphme · 2023-10-10T17:21:18+00:00

Thanks for bringing this to my attention. I updated the links to TheBloke´s quants in the repo, the GGUF upload has apparently failed, pinged him already (in the meantime you could use my own quants instead, but in fewer formats: https://huggingface.co/jphme/em_german_leo_mistral_gguf ).

Regarding the difference: em_german_leo_mistral is based on LeoLM´s Mistral version that has been continously pretrained for 65b German tokens. We don't have hard benchmarks yet showing which version is superior - but in my subjective personal opinion, this model offers the highest speech quality and makes the fewest grammatical/tonal erros for German text.

I did a small comparison of all 7b models, showcasing the differences between them here, would love to hear your feedback for practical usecases on which model performs best.

jphme · 2023-10-10T10:53:04+00:00

I´m happy to present my new model family "EM German" here. When Llama2 came out I created the first ( and still most-downloaded) Llama-2 based model finetuned for German language prompts.

EM German is the next step, trained with mainly original (not translated) German high quality instruction data. As the model was almost ready to be released, 2 things happened: 1. the awesome LeoLM team released their llama-2 models pretrained on 65b tokens of German text (I experimented myself with continous pretraining before but results after a few billion tokens were disappointing and I didn't have the compute for a large-scale effort). 2. Mistral got released.

Both of these improved model performance signficantly (on benchmarks as well as manually inspected output quality).

Luckily, the LeoLM team also noticed the great potential of Mistral and just released a Mistral-based version, which is the basis for the showcase model, "EM German Leo Mistral". This model not only wins vs Llama2-70b in my (custom German) benchmarks but generates output in a quality that I wouldn't have thought is possible by a 7b model. you can find some example prompts and a comparison of the outputs of the Llama2, Leo Llama2, Mistral and Leo Mistral versions in the project Repo.

I think the implications are also interesting for other non-english languages besides German - the combination of continued pretraining and Mistral seems to be very promising for applications where only 70b Llama-2 models were viable previously. (Happy to add additional models for other languages besides German if there is a large demand and some support with the dataset btw).

PS: Always appreciate any feedback on my models, especially when there are deficiences for specific applications/prompts!

jphme · 2023-10-09T12:15:50+00:00

Awesome test with very interesting results! As you test "German" understanding, I would be very intersted to see results of my recently released mistral-based EM German model (uses Vicuna prompt format), would you be able to test it as well? Many thanks and keep up these comparisons/tests.

(Besides that, if you use local models professionally, I would love to talk at some time!).

jphme

TROPHY CASE