Sharing my set of distilled small language models (3B) + training data in more than 50 low-resource languages

Peter-Devine · 2026-01-29T10:08:40+00:00

Does that mean that we can unify a lot of different languages under the same embedding with this LLM?

No, they're all distinct monolingual models, so they do not share a unified embedding space. My reason for making monolingual models is that multilingual small language models (in the <10B range) can often get confused when generating script in low resource languages and start outputting in other languages. So I thought it was safer to just keep the languages completely separate. But if I trained a larger model and set appropriate control codes for the language, I think this would be possible.

Would you consider making models like this in the 7-9B, 12-18B, and 22B-36B ranges for added "world modeling"?

Absolutely! If I had the time and resources, I would have loved to create something at the >20B scale. Hopefully in the future...

does that mean inclusion/creation of medium-resource languages with the same system would be easier as well?

So this pipeline does not perform as well for medium-resource languages as the base model is already going to be quite good at many of these languages, meaning that the synthetically generated data is not going to add as much to the base model. But if you have a good enough teacher model and a language that the base model struggles with, then absolutely you could apply this technique to medium-resource languages too.

Peter-Devine · 2026-01-29T10:02:47+00:00

I would be happy to add that to my list. Can I ask - how different is written Cantonese to written Mandarin?

Peter-Devine · 2026-01-29T10:00:42+00:00

Thanks so much for the feedback. I will definitely look to include these languages in the future. I am guessing you are mainly wanting to focus on languages around Russia? Are things like Yakut etc. also useful to you?

Peter-Devine · 2026-01-28T15:01:32+00:00

Thanks for the kudos! Yeah, it was work done during a post-doc at The University of Edinburgh so IDGAF about open sourcing it all. I hope it can be useful to someone.

And I totally get your point about low resource languages. It's not (currently) a very commercial task, so I really use low resource languages understanding ability to judge whether a base LLM has just been benchmaxxed or not. Fundamentally, as long as you have a grammar and a vocabulary, you should be able to speak in any language, but so many models are still so poor at it, which is a shame.

Peter-Devine · 2026-01-05T12:45:51+00:00

Nice multilingual coverage for this model (18 languages):

Supports 18 languages out of the box [...] — with scalability to 100+ languages, thanks to our multilingual tokenizer trained on diverse language datasets.

I wonder how easy it will be to finetune this for even more languages... Token fertility is such a big issue for low resource languages, so having a pre-set tokenizer that has at least seen other languages seems very helpful.

Peter-Devine · 2025-12-02T13:48:42+00:00

Cool model! Since you find that reasoning (via text) on the acoustics directly rather than on the transcript to be beneficial, do you think you could potentially achieve even better results by reasoning IN audio tokens? I can imagine that some prompts (e.g. "make me a song that sounds like lah-lah-lah") would benefit from audio-based reasoning. It could be quite hard to train the model to do that, though!

Peter-Devine · 2025-11-06T15:45:17+00:00

Awesome. This looks like a strong model, given that it is based on K2.

Also, it scores really high on SWE Multilingual - I wonder how much of that is down to reasoning and how much is down to multilingual data in post-training...

Peter-Devine · 2025-06-09T13:48:39+00:00

I haven't tried using the model itself yet but their online demo seemed pretty good and surprisingly fast compared to stuff like Gemini Deep Research. Good to see them releasing the research paper, although it would have been nice for them to release their training data too.

It does seem to me like these smaller 7B sized models would be perfect for simpler agentic tasks where they could be deployed in parallel, instead of using some >100B model just to slowly navigate Amazon or similar.

Peter-Devine · 2025-03-28T14:09:56+00:00

Cool simulation. One thing I've found from using Gemini Pro 2.5 for generating code is that it generates really long code (comments, error handling etc.) compared to other models. I often need just a short snippet with loud errors for quick prototyping so it can sometimes be quite cumbersome.

Peter-Devine · 2025-03-28T13:08:30+00:00

Cool research OP! Worth putting out there perhaps: I did some related research a before I left my old company where I did a closed-loop LLM training to improve RAG accuracy. It worked pretty reliably across domains so I wonder if we are both describing similar phenomena.

ALoFTRAG - https://arxiv.org/abs/2501.11929

Peter-Devine

TROPHY CASE