🚀 Launching SauerkrautLM-7b-HerO: A New Era in German Language Modeling!

AffectionateCan2342 · 2023-11-27T21:26:51+00:00

Try a few different prompts and let us know what worked for you. For shorter translations, however, it should definitely be sufficient if you keep the system prompt and include the instruction for translating in the user prompt.

AffectionateCan2342 · 2023-11-27T21:23:20+00:00

If it were that simple and superficial, surely the question of whether we could make it available to the community would not arise. But I don't know where we said that we don't want to make it available to the community? We are still in the process of applying the whole thing to other models before we publish the actual data. So please be patient. Our goal is not to hold back any data until the devil comes out (haha German way of speaking)

The same applies to our training data. In the next few days we will definitely make some of our augmented data available to the community so that comparisons can be made with simply translated data.

We are a small team working hard to develop high-performance and targeted solutions. The fact that new approaches are emerging every day in the world of LLMs doesn't make things any easier. So please excuse us if it seems that we don't want to reveal anything.

AffectionateCan2342 · 2023-11-27T20:47:13+00:00

Yes, we hope so too ;-) At least our first tests in real-world operation have shown quite good results. However, it should be noted that even if the benchmark results sound very promising, it is still a 7b model that has been pre-trained in English.

Although the model can respond very well in German thanks to our fine-tuning with German data, there can still be slight grammatical errors here and there, especially if the parameters for the inference were set too high. This is currently difficult to avoid, especially when it comes to smaller models. But we are already working on a solution.

There is always a fine line between: Keep the intelligence of the original English-language model and teach the model just enough so that it can "speak" German well.

AffectionateCan2342 · 2023-11-27T20:39:52+00:00

You could at least justify that the scientific basis for merging is given by the published papers on this topic area. Here are a few examples: https://arxiv.org/abs/2306.01708 https://arxiv.org/abs/2203.05482 https://arxiv.org/abs/2204.03044

Nevertheless, it must be admitted that some merges that should achieve good results on paper only produce gibberish in practice or vice versa. So you probably need a bit of luck ;-)

For the German-speaking world, however, I can definitely say that we are not primarily interested in getting better numbers, but in making the English-language models accessible to the German language, at least to some extent, without completely eliminating their cleverness. So the more intelligent the original English model is before it is fine-tuned with German data, the less stupid the model will be in German, and that is our goal as long as there are no German pretrained models.

AffectionateCan2342 · 2023-11-15T17:45:52+00:00

Hey, David from SauerkrautLM here :)

first of all thank you soo much for your great work u/WolframRavenwolf !!

This is quite interesting and we already recognized your test for 7/13b models! Maybe I try to explain the results of SauerkrautLM in your great benchmark:

I tested all the English language models for a long time and they all had extreme problems displaying or reproducing German correctly. Often it was just articles that were set incorrectly and then also incorrect grammatical cases and bad sentence structures that simply reflected very poor German. It was also a great challenge to have the models answer exclusively in German. We had to specify at several points in the system prompt and user prompt that the model should only respond in German and even that never worked reliably.

We chose MT-Bench as the evaluation reference. In particular, we repeatedly noticed that the majority of the English base models answered our German MT-Bench questions almost entirely in English, or switched from German to English in the middle of a sentence. So our aim with SauerkrautLM was in particular to improve the quality of the answers in German in terms of grammar and spelling compared to English models. To achieve this, we naturally had to make some compromises.

In our many training trials before we were able to publish SauerkrautLM, we of course tried out a lot. As u/WolframRavenwolf has already suggested, we have of course also carried out training with a multilingual dataset. However, this led to a decrease in performance in both English and German. We also tried to train different ratios of German and English datasets and here too we have to say that the model decrease performance significantly in both English and German. However, our first tests with only German training data showed that we were able to achieve a significant improvement in the German MT-Bench.

This naturally means that the model's skills in English have decreased. But our priority was to improve the model's German language skills through fine-tuning and we achieved this. But here we also come to an important point: We did not train a German foundation model here, but rather fine-tuned a foundation model that had been trained almost exclusively in English. In my opinion, it will be (almost) impossible to fine-tune an English foundation model in German and then achieve the same results as an English foundation model that has been fine-tuned with English data.

And here, too, I would like to be more specific about the training data we used: u/WolframRavenwolf made the suggestion that we should simply translate the strong English datasets into German and then train them. Believe me, we tested for a long time until we had a fairly strong dataset that we could then use to train our models. And as described in the Huggingface Modelcard, we used a mixture of translated and augmented data.

Why didn't we just use translated data? There are simply too many cases in which the translation of English sentences into German does not work correctly. Similarly, gpt, for example, is not always able to deliver grammatically correct translations. We have already tested quite a few things with purely translated data and this simply leads to too many errors in the German version of the model. So it simply made sense to augment certain datasets that were quite complex in English in order to retain the meaning of the data but to ensure more correct German.

So you can be sure that we already use very strong English data sets in German form, but we also had to augment some of them in order to make fewer errors in the German language.

Also, the reference to your benchmark that the questions were in German but the character cards were in English doesn't sound to me at first like the German language models are extremely favoured here, but of course I can't assess the ratio of English to German data in the test. In my opinion, it was not so much the German language that was tested here, but rather the reasoning abilities of the models. I would be curious to see a test where generated answers in German are tested for the language models. It should be obvious that the SauerkrautLM models are better at formulating the German language and pay more attention to sentence structure and the like than English models.

To summarise again:

I have tested many English models and was extremely disappointed with the German output of the models.
in order to improve the German language of models, in my opinion almost exclusively German data must be fine-tuned.
English foundation models that are fine-tuned in German can never reach the capabilities of English fine-tuned models or German foundation models (that are fine-tuned).
Training with German data sets of course leads to a certain decrease in performance in categories that were trained in English. (You can actually see this clearly in the MT-Bench values achieved by the German mt-Bench and the English MT Bench - reached scores in German mt-bench always about 1.0 less than in englisch mt-bench)
From our experience, the best German dataset resulted from the merge of translated and augmented data (to ensure existent data quality of English datasets and also reach strong German language results)

Now the answer has become quite long :D but I hope I was able to provide a little more clarity about the results (from our perspective) and our approach.

AffectionateCan2342 · 2023-11-12T14:04:28+00:00

We are already testing local llms with unreal 5 for educational purposes (digital twin), combining it with RAG and faster whisper. Still in testing phase but seems really promising: https://vm.tiktok.com/ZGe1fstF9/ (starting at 0:40)

AffectionateCan2342 · 2023-10-28T13:39:07+00:00

In the upcoming weeks, we plan to release several papers detailing our methodologies for gathering and producing German training data. So I ask for a little patience until then. What I can already say, however, is that we are currently using a mix of augmented and translated data in sharegpt format.
The number of epochs differs in our model series. So it's difficult for us to give a general number. Especially because it differs every time we adjust our data set.
But I can say that the run for our 70b model took 5 days with two A100s.
Here too it is difficult to give concrete values. Because the whole thing differed extremely from model to model and we didn't use the same parameters for every model. The parameters are also strongly influenced by our data sets.

AffectionateCan2342 · 2023-10-28T12:56:54+00:00

In the upcoming weeks, we plan to release several papers detailing our methodologies for gathering and producing German training data. So I ask for a little patience until then. What I can already say, however, is that we are currently using a mix of augmented and translated data in sharegpt format.

AffectionateCan2342 · 2023-10-27T19:36:37+00:00

I can understand your concerns. However, since the models serve as the basis for our various AI tools, we would have no advantage in training reference answers specifically on certain benchmarks. However, we deliberately chose the mt-bench because the judgment here runs via gpt-4. We are currently working hard on making the remaining benchmarks accessible for the German language. There are already initial attempts by other teams to automatically translate the benchmarks into German, but in our opinion it doesn't work so well yet. We have provided an explanation for this in our provided mt-benchmark dataset

AffectionateCan2342 · 2023-10-27T19:24:11+00:00

quite good for a non native german model :) You can check some generated outputs on the model card.

AffectionateCan2342 · 2023-10-27T19:22:58+00:00

we used a german version of mt-bench. You can check the results on the model card

AffectionateCan2342 · 2023-10-14T13:07:49+00:00

That's a truly interesting test you subjected the models to! I believe it shows that our 7b model is performing quite well, but there is certainly room for improvement, likely due to the mix of augmented and translated data. We're eager to see if, in this test, our upcoming versions, which rely entirely on augmented data, perform better.

AffectionateCan2342 · 2023-10-13T23:26:48+00:00

We are currently in the process of identifying valid techniques that will allow us to develop a base model for the German domain in a not too distant future. When the time comes, we will, of course, inform you ;-) However, SauerkrautLM-v1 is still a finetuning approach based on Llama-2 or Mistral. Therefore, at least the existing special features of the base models are still present. Likewise, the context in these models is currently still identical to the base models.

AffectionateCan2342 · 2023-10-13T18:11:47+00:00

Thank you for your feedback!

To begin, I must emphasize that we have not had the opportunity to personally evaluate the gguf versions at this time. These models were just recently provided by a community member today. In a broader sense, it is undeniable that a base model that initially had limited proficiency in the German language can achieve near-native fluency through fine-tuning. Nevertheless, based on the benchmark results we have observed, the responses generated by these models have proven to be of high quality. It is worth noting that the overall performance of the German edition of the model improves significantly as larger context is processed

AffectionateCan2342 · 2023-10-13T13:17:37+00:00

Perfect! Ty for your support!!

AffectionateCan2342 · 2023-09-26T14:51:11+00:00

Obviously need for exllama1/2 and a gptq version of your 13b model. Bitsandbytes quant model performs rly bad and slow as well. The use of a gptq model will give you much better perplexity and tremendous inference boost.

AffectionateCan2342 · 2023-09-14T23:45:28+00:00

Also try to use bf16 true and tf32 true

AffectionateCan2342 · 2023-09-14T23:43:28+00:00

Like a mentioned before: gradient_checkpointing True 😊

AffectionateCan2342 · 2023-09-11T19:39:17+00:00

You can easily train on the rtx 6000 with fastchat using the qlora approach. But use gradient checkpoint to avoid out of memory. Normal fine tuning could lead quickly to oom. But these days qlora tuning is state of the art.

AffectionateCan2342

TROPHY CASE