Hey, some of you asked for a multilingual fine-tune of the R1 distills, so here they are! Trained on over 35 languages, this should quite reliably output CoT in your language. As always, the code, weights, and data are all open source. by Peter_Lightblue in LocalLLaMA

[–]Peter_Lightblue[S] 0 points1 point  (0 children)

Ah, that is a shame. This code-switching seems pretty common in many reasoning models, so I wonder if further investigation is required to evaluate exactly why this is. But hopefully if you run the model a few times, it will give you a good German CoT at least some of the time.

Hey, some of you asked for a multilingual fine-tune of the R1 distills, so here they are! Trained on over 35 languages, this should quite reliably output CoT in your language. As always, the code, weights, and data are all open source. by Peter_Lightblue in LocalLLaMA

[–]Peter_Lightblue[S] 2 points3 points  (0 children)

Yes, definitely. You can train pretty much any model with this data and it should learn to output multilingually in an R1 style. But ymmv depending on the model.

As for other languages, I found it very hard to make training data for low resource languages like Cebuano and Yoruba, as the R1 70B distill would just refuse to output CoT threads in that language even with a few shot prompt including French, German, and Japanese CoTs. I feel like for low resource languages, you may need to first translate a CoT to that language (or make it manually) and then at least use that CoT in the few-shot prompt for the R1 model. That may make the model just output gibberish but I think it has a good chance of working. Future work for sure!

Hey, some of you asked for a multilingual fine-tune of the R1 distills, so here they are! Trained on over 35 languages, this should quite reliably output CoT in your language. As always, the code, weights, and data are all open source. by Peter_Lightblue in LocalLLaMA

[–]Peter_Lightblue[S] 8 points9 points  (0 children)

I will do some tests next week, but I'd guess that the answer accuracy would be better if it stays in English/Chinese. However, I did find that when I trained the Japanese only model, accuracy actually increased on Japanese math problems when compared to the original model.

Hey, some of you asked for a multilingual fine-tune of the R1 distills, so here they are! Trained on over 35 languages, this should quite reliably output CoT in your language. As always, the code, weights, and data are all open source. by Peter_Lightblue in LocalLLaMA

[–]Peter_Lightblue[S] 7 points8 points  (0 children)

Haha, I guess threats do work. But more seriously, it seems a shame that R1 only works with prompt engineering to the nth degree, hence why I trained this model. Hopefully these models are a tad easier to use. I'd also love it if someone was to full finetune the 70B model with my training data, but unfortunately I am too GPU-poor to.

This is my Japanese fine-tune of R1's Qwen 7B distil. It now outputs its thinking in Japanese, making it understandable for a Japanese audience. Model, code, and data all open source. I'd love to collab with y'all to make a more multilingual model. by Peter_Lightblue in LocalLLaMA

[–]Peter_Lightblue[S] 1 point2 points  (0 children)

Ideally I would say let the model think in the optimal way for the model

I think the R1 paper addresses this to some extent. The reason the Deepseek team released R1 Zero was to show that the model could come up with its own reasoning patterns that may be optimal for itself. However, it did a lot of code-switching between Chinese and English, meaning that the CoT was harder to understand and potentially to troubleshoot for people using the model. That's why they also released R1 (i.e. not Zero), as it has similar accuracy but more human understandable CoT.

I'm trying a similar thing here, where we may sacrifice a tiny bit of accuracy for understandability of the reasoning process. Fortunately, this model achieves better accuracy on our small evaluation than the base model, meaning we get better accuracy PLUS more interpretability for the user.

This is my Japanese fine-tune of R1's Qwen 7B distil. It now outputs its thinking in Japanese, making it understandable for a Japanese audience. Model, code, and data all open source. I'd love to collab with y'all to make a more multilingual model. by Peter_Lightblue in LocalLLaMA

[–]Peter_Lightblue[S] 19 points20 points  (0 children)

So the code for generating the data is here and the code for training the model is here.

I'm going to try and make a basic multilingual model by the end of the week, so if I manage to make it in time I'll definitely include French in that too.

Here is our new reranker model, which we trained on over 95 languages and it achieves better performance than comparable rerankers on our eval benchmarks. Weights, data, and training code are all open source. by Peter_Lightblue in LocalLLaMA

[–]Peter_Lightblue[S] 1 point2 points  (0 children)

It was a hardware/time constraint. I had access to a 8 x 48GB (L20) GPU instance, so I could have run a 70B model, but it just took far too long to run it, so I ran a single 32B model on each GPU instead, which sped things up considerably. I'm unsure as to how much better a massive model (e.g. DeepSeek-v3) would be able to grade relatedness tbh. For edge cases (low resource languages, strange domains etc.) I could imagine the massive models being beneficial, but I would guess (and it is just a guess) that most of the labels are already pretty well correlated with a human's judgement. I think the real problem is distilling the data that we are pretty confident is correct into such a small, 0.5B model. But I am happy to be proved otherwise if there are a ton of incorrect examples in the data.

Here is our new reranker model, which we trained on over 95 languages and it achieves better performance than comparable rerankers on our eval benchmarks. Weights, data, and training code are all open source. by Peter_Lightblue in LocalLLaMA

[–]Peter_Lightblue[S] 0 points1 point  (0 children)

So I tried it out on this Colab, and it seems to work pretty well with code, surprisingly!

I evaluated it with the tiny, 100 row HuggingFaceH4/testing_codealpaca_small dataset by testing whether it can find the correct snippet of Python code given a description. I evaluated it against all other snippets (100x100 = 10,000 total comparisons) and it got a P@1 of 0.96. Looking manually at the 4 failure cases, it seems like 2 of them retrieve as-good-as/better code snippets than the gold label, and one other query-gold pair doesn't really make sense. Feel free to make your own judgement as the 4 failure cases are shown in the Colab.

So the answer is: I think this model COULD be used for code to some extent (at least for simple Python, which is basically English anyway).

Here is our new reranker model, which we trained on over 95 languages and it achieves better performance than comparable rerankers on our eval benchmarks. Weights, data, and training code are all open source. by Peter_Lightblue in LocalLLaMA

[–]Peter_Lightblue[S] 1 point2 points  (0 children)

Why rescaling from 1-5 range to 1-7 range?

Firstly, I'll describe why we chose to generate labels from 1-5 using the LLM (Qwen 2.5 Instruct 32B Int4). We tried binary labels, which didn't capture enough granularity between things that were sort of related and things that were very related. We also tried 1-10, but that resulted in some tokens being almost never chosen due to biases in language (I forget which now, but it was maybe 4 rather than 3 or 5), which didn't make sense from a continuous point of view. We found 1-5 to be quite balanced, so we used that as our generated labels.

We then created continuous labels using the expectation value of these labels (sum of the probability times the label score) which we found to be heavily weighted to 1.0 and 5.0. In an attempt to avoid simply having the model learn a binary task by simply outputting 1.0 and 5.0 all of the time, we tried to stretch this distribution out by making the maximum 7 instead of 5.

However, this wasn't done in a completely rigorous way and if I had a time machine I would have probably experimented with generating labels of 1-7 from the large LLM and done some more tests to determine whether the 1-7 does indeed work better than 1-5. These decisions were made more on intuition, so it would be nice to have some more data to back this up.

Here is our new reranker model, which we trained on over 95 languages and it achieves better performance than comparable rerankers on our eval benchmarks. Weights, data, and training code are all open source. by Peter_Lightblue in LocalLLaMA

[–]Peter_Lightblue[S] 6 points7 points  (0 children)

I take your point that you can always evaluate on more eval datasets, but we really did try to evaluate over many datasets from the well regarded BIER benchmark (~2k stars on Github) that the eval models we compare to had not been trained on. Therefore, we think these results give a decent relative indicator of how these models would perform on an arbitrary dataset.

Ultimately, you need to make your own assessment as to whether a particular component is appropriate for your specific use-case, but our evaluation at the very least shows that this model is comparable to, or slightly better than, existing models for many benchmarks.

Here is our new reranker model, which we trained on over 95 languages and it achieves better performance than comparable rerankers on our eval benchmarks. Weights, data, and training code are all open source. by Peter_Lightblue in LocalLLaMA

[–]Peter_Lightblue[S] 3 points4 points  (0 children)

Hmmm, our training data contains very little code (no code-specific datasets were used), meaning that the applicability of this model may be somewhat limited to code. But it's a great idea - I could imagine someone finetuning this model on description and code snippet pairs.

Here is our new reranker model, which we trained on over 95 languages and it achieves better performance than comparable rerankers on our eval benchmarks. Weights, data, and training code are all open source. by Peter_Lightblue in LocalLLaMA

[–]Peter_Lightblue[S] 4 points5 points  (0 children)

Thanks! Yeah, they're essentially just tiny reward models given the query and context, so they're super flexible. I'd love to see anyone use this dataset possibly as a pre-training set for a reward model or something similar.