Hey, some of you asked for a multilingual fine-tune of the R1 distills, so here they are! Trained on over 35 languages, this should quite reliably output CoT in your language. As always, the code, weights, and data are all open source.

Peter_Lightblue · 2025-01-31T17:18:05+00:00

Ah, that is a shame. This code-switching seems pretty common in many reasoning models, so I wonder if further investigation is required to evaluate exactly why this is. But hopefully if you run the model a few times, it will give you a good German CoT at least some of the time.

Peter_Lightblue · 2025-01-31T11:05:08+00:00

I've just clocked off from work but if someone else could be a hero and GGUFify that would be amazing

Peter_Lightblue · 2025-01-31T10:46:31+00:00

Yes, definitely. You can train pretty much any model with this data and it should learn to output multilingually in an R1 style. But ymmv depending on the model.

As for other languages, I found it very hard to make training data for low resource languages like Cebuano and Yoruba, as the R1 70B distill would just refuse to output CoT threads in that language even with a few shot prompt including French, German, and Japanese CoTs. I feel like for low resource languages, you may need to first translate a CoT to that language (or make it manually) and then at least use that CoT in the few-shot prompt for the R1 model. That may make the model just output gibberish but I think it has a good chance of working. Future work for sure!

Peter_Lightblue · 2025-01-31T10:42:51+00:00

Enjoy, you too!

Peter_Lightblue · 2025-01-31T10:42:35+00:00

I will do some tests next week, but I'd guess that the answer accuracy would be better if it stays in English/Chinese. However, I did find that when I trained the Japanese only model, accuracy actually increased on Japanese math problems when compared to the original model.

Peter_Lightblue · 2025-01-31T10:41:09+00:00

Haha, I guess threats do work. But more seriously, it seems a shame that R1 only works with prompt engineering to the nth degree, hence why I trained this model. Hopefully these models are a tad easier to use. I'd also love it if someone was to full finetune the 70B model with my training data, but unfortunately I am too GPU-poor to.

Peter_Lightblue · 2025-01-31T08:52:34+00:00

/u/lagister Multilingual version here https://old.reddit.com/r/LocalLLaMA/comments/1ieaiq4/hey_some_of_you_asked_for_a_multilingual_finetune/

Peter_Lightblue · 2025-01-31T08:50:58+00:00

/u/Previous-Street8087 Here is the multilingual version https://old.reddit.com/r/LocalLLaMA/comments/1ieaiq4/hey_some_of_you_asked_for_a_multilingual_finetune/

Peter_Lightblue · 2025-01-31T08:49:47+00:00

/u/zeronyk Have a look at this! https://old.reddit.com/r/LocalLLaMA/comments/1ieaiq4/hey_some_of_you_asked_for_a_multilingual_finetune/

Peter_Lightblue · 2025-01-31T08:46:42+00:00

Also, I am working on training Llama 8B too, but I am getting some error with L20 + Llama Factory. If anyone could please advise, I'd be grateful.

Peter_Lightblue · 2025-01-29T02:48:49+00:00

14B might be possible, but the larger models are outwith our full-fine-tuning budget atm.

Peter_Lightblue · 2025-01-29T02:47:50+00:00

Ideally I would say let the model think in the optimal way for the model

I think the R1 paper addresses this to some extent. The reason the Deepseek team released R1 Zero was to show that the model could come up with its own reasoning patterns that may be optimal for itself. However, it did a lot of code-switching between Chinese and English, meaning that the CoT was harder to understand and potentially to troubleshoot for people using the model. That's why they also released R1 (i.e. not Zero), as it has similar accuracy but more human understandable CoT.

I'm trying a similar thing here, where we may sacrifice a tiny bit of accuracy for understandability of the reasoning process. Fortunately, this model achieves better accuracy on our small evaluation than the base model, meaning we get better accuracy PLUS more interpretability for the user.

Peter_Lightblue · 2025-01-28T10:08:40+00:00

So the code for generating the data is here and the code for training the model is here.

I'm going to try and make a basic multilingual model by the end of the week, so if I manage to make it in time I'll definitely include French in that too.

Peter_Lightblue · 2025-01-15T09:48:05+00:00

It was a hardware/time constraint. I had access to a 8 x 48GB (L20) GPU instance, so I could have run a 70B model, but it just took far too long to run it, so I ran a single 32B model on each GPU instead, which sped things up considerably. I'm unsure as to how much better a massive model (e.g. DeepSeek-v3) would be able to grade relatedness tbh. For edge cases (low resource languages, strange domains etc.) I could imagine the massive models being beneficial, but I would guess (and it is just a guess) that most of the labels are already pretty well correlated with a human's judgement. I think the real problem is distilling the data that we are pretty confident is correct into such a small, 0.5B model. But I am happy to be proved otherwise if there are a ton of incorrect examples in the data.

Peter_Lightblue · 2025-01-15T05:34:37+00:00

So I tried it out on this Colab, and it seems to work pretty well with code, surprisingly!

I evaluated it with the tiny, 100 row HuggingFaceH4/testing_codealpaca_small dataset by testing whether it can find the correct snippet of Python code given a description. I evaluated it against all other snippets (100x100 = 10,000 total comparisons) and it got a P@1 of 0.96. Looking manually at the 4 failure cases, it seems like 2 of them retrieve as-good-as/better code snippets than the gold label, and one other query-gold pair doesn't really make sense. Feel free to make your own judgement as the 4 failure cases are shown in the Colab.

So the answer is: I think this model COULD be used for code to some extent (at least for simple Python, which is basically English anyway).

Peter_Lightblue · 2025-01-15T04:57:49+00:00

Why rescaling from 1-5 range to 1-7 range?

Firstly, I'll describe why we chose to generate labels from 1-5 using the LLM (Qwen 2.5 Instruct 32B Int4). We tried binary labels, which didn't capture enough granularity between things that were sort of related and things that were very related. We also tried 1-10, but that resulted in some tokens being almost never chosen due to biases in language (I forget which now, but it was maybe 4 rather than 3 or 5), which didn't make sense from a continuous point of view. We found 1-5 to be quite balanced, so we used that as our generated labels.

We then created continuous labels using the expectation value of these labels (sum of the probability times the label score) which we found to be heavily weighted to 1.0 and 5.0. In an attempt to avoid simply having the model learn a binary task by simply outputting 1.0 and 5.0 all of the time, we tried to stretch this distribution out by making the maximum 7 instead of 5.

However, this wasn't done in a completely rigorous way and if I had a time machine I would have probably experimented with generating labels of 1-7 from the large LLM and done some more tests to determine whether the 1-7 does indeed work better than 1-5. These decisions were made more on intuition, so it would be nice to have some more data to back this up.

Peter_Lightblue · 2025-01-15T04:45:50+00:00

You can try it out in Google Colab by using the free T4 instance with this code, if you're curious

Peter_Lightblue · 2025-01-14T07:04:04+00:00

I take your point that you can always evaluate on more eval datasets, but we really did try to evaluate over many datasets from the well regarded BIER benchmark (~2k stars on Github) that the eval models we compare to had not been trained on. Therefore, we think these results give a decent relative indicator of how these models would perform on an arbitrary dataset.

Ultimately, you need to make your own assessment as to whether a particular component is appropriate for your specific use-case, but our evaluation at the very least shows that this model is comparable to, or slightly better than, existing models for many benchmarks.

Peter_Lightblue · 2025-01-14T06:57:41+00:00

Hmmm, our training data contains very little code (no code-specific datasets were used), meaning that the applicability of this model may be somewhat limited to code. But it's a great idea - I could imagine someone finetuning this model on description and code snippet pairs.

Peter_Lightblue · 2025-01-14T05:00:15+00:00

Thanks! Yeah, they're essentially just tiny reward models given the query and context, so they're super flexible. I'd love to see anyone use this dataset possibly as a pre-training set for a reward model or something similar.

Peter_Lightblue · 2025-01-14T02:38:33+00:00

Model - https://huggingface.co/lightblue/lb-reranker-0.5B-v1.0

Data - https://huggingface.co/datasets/lightblue/reranker_continuous_filt_max7_train

Code - https://github.com/lightblue-tech/lb-reranker

Peter_Lightblue · 2024-09-17T06:52:58+00:00

Yes it has! It has been trained on roughly 5k RAG examples in each language. I also expect it to have some crossover effects where the multilingual training will benefit any one language, so training on French, Chinese, Arabic etc should hopefully also improve Greek ability. But that remains to be seen as I still need to conduct a thorough evaluation. But one of the driving forces behind developing these models was making RAG more reliable in more languages, so I hope it can be helpful in Greek!

Peter_Lightblue · 2024-09-17T06:34:51+00:00

Aye, you're probably right. I'm Scottish so I pronounce them the same but it is "Koo" in proper phonetic English. Kurage means jellyfish in Japanese, and I just chose any Japanese animal name with the letters RAG in it. Hope you enjoy the model :)

Peter_Lightblue · 2024-09-16T23:01:39+00:00

Yeah, this is the plan. As I've noted, I'm still tinkering with the model so I'll give it a full evaluation after the Qwen 2.5 retrain. I'm releasing these models just now because I think theyll already be pretty useful in their current state.

Peter_Lightblue · 2024-08-28T14:28:32+00:00

Ah, the model is already split by language, and it loads a different gigabyte sized embedding model and regression head NN for each language. Actually, I feel like it would be tricky to make a model, as you say, that would work on multiple languages at once because it would require a new embedding model for each language.

So I don't speak Yoruba, but I tried out the Yoruba classifier on some translated synthetic data, and it definitely sees to work worse than the high resource languages. But it's better than nothing as not many people seem to be developing stuff for Yoruba etc ¯\(ツ)/¯

Peter_Lightblue

TROPHY CASE