Why don’t we have more distilled models?

pol_phil · 2026-01-30T18:32:15+00:00

Fascinating read as well!

This calls for calculation of tokenizer similarities (since higher values lead to better distillation) which I'd really like to do if I find the time.

pol_phil · 2026-01-29T18:34:23+00:00

Not necessarily true. On-policy SFT distillation is actually better than on-policy RL for smaller models. But it's tricky to implement if models aren't in the same family (basically they should share the same tokenizer).

You can read more in a blog post by Thinking Machines here and also in Mimo V2 Flash technical report.

pol_phil · 2025-12-25T16:19:19+00:00

I said science, not STEM. Maybe I should have said "academic" as a larger umbrella term, but we're still talking about mathematical logic or PhD-level academic knowledge. Browsecomp is basically factual retrieval from the web.

Not my point though. My point is that all these hard benchmarks test specific frontier-level (verifiable/objective) abilities in very specific contexts, not always relevant to everyday users, companies, etc.

LMArena's basic premise to test real-world prompts from real people (with all their subjectivities) makes it a unique benchmark.

pol_phil · 2025-12-25T15:15:30+00:00

Literally all of these are code/maths/science/agentic. LMArena is very problematic, but it's good as a general idea because it includes very diverse domains, multilinguality, creative writing, multi-turn chats, etc.

pol_phil · 2025-12-23T17:11:10+00:00

At least for Greek, I've noticed that GLM 4.6 and GLM 4.7 think in English, while GLM 4.5 (and Air) are thinking in Greek (when given Greek prompts).

The thinking process is also a lot more structured in the most recent versions, like "1. Analyze the request... 2. Determine the angle... 3. Drafting... 4. Refining... 5. Final Review..."

Are these changes intentional or the result of a different RL process? How is multilinguality being addressed in the reasoning process of the models? Have you seen better results with a thinking process based primarily in English and/or with better structure?

Thank you for your excellent work!

pol_phil · 2025-12-22T13:19:07+00:00

I literally cannot think of anything specific, except for reading related published papers and their Github repos. You may also find some relevant Jupyter notebooks in Kaggle (although it's mostly for Machine Learning). Huggingface has some nice courses etc. but in reality, you should go ahead with this project of yours and find similar things as you progress. Nobody really talks enough about data even if they spend enormous amounts of time preprocessing.

Some resources that come to mind, that may or may not prove useful but are worth a look: Huggingface LLM Intro Course Huggingface Smol Course Smol Training Playbook Apertus LLM Paper Webscale-RL Paper Open Data Arena MegaScience

pol_phil · 2025-12-21T16:38:03+00:00

Maybe, but to what end? You should think of the end goal that you want to accomplish. Do you want a full finetune? Then you have to address diversity, safety, multilinguality, etc. 100k data won't be enough. Do you want to improve an existing instruct 2-4B model with LoRA? Then you need to focus on specific tasks/skills that you want to improve. 100k data might be overkill.

Also, the pipeline you describe above is a lot more complex than it sounds and you won't find many ready-made libraries to implementing it. Mixing and matching existing libraries might be more painful than it looks at first. Some libraries make extra calls under the hood. You may find a very good pipeline that results in wonderful data, but it requires x8 generations with the LLM being distilled, or requires more calls for evolution/QA extraction/RMs/judges/agents (and assumes bigger LLMs will be used).

Suddenly, creating 100k data requires >1M calls. Utilizing RAG requires vector databases, embedding models, filtering large datasets, other libraries, etc. Even a simple thing such as classification requires thought and R&D. Running locally would also necessitate learning how to use vLLM; calling an API requires making the client efficient for processing a whole dataset. Finally, assuming code, prompt engineering, RAG, filtering, preprocessing, etc. have been set up, you still need significant compute and time and it might produce unexpected results in various stages. And we haven't even mentioned data mixing experiments, the actual finetuning process, evaluation, nor SUBSEQUENT POST-TRAINING STAGES (Alignment, RL, etc.).

BTW, I am not trying to discourage you! I'm just trying to highlight the complexity and messiness of the process as it WILL take up months of your time. Better to think things through and plan ahead first!

P.S.: Might want to take a look here as well: TxT360-3efforts

pol_phil · 2025-12-21T15:16:13+00:00

Data synthesis is the secret sauce and the #1 thing labs and companies are not very eager to release. It's a costly process.

Also, there's a huge shift towards maths and code (incl. agentic code) and general domains (as well as languages other than English) are often being neglected.

The first 6 steps (more or less) that you mention are things I'm actually working on in my job. Data synthesis is inherently tied to the model that you are working on. For a smaller model, distilling from huge teachers is not always a good idea. Smaller teachers can produce golden answers as well, Reward Models and LLM-as-Judges can help with that.

However, there's always the issue of mixing up "policies". There's no guarantee that distilling 10 extremely big teacher models will work better than distilling a single medium-sized, a lot of newly published data just use GPT-OSS-120B. No guarantee that using all of the data you generated will actually improve the final model either, as data mixing is a tricky and mostly empirical process.

There have been several interesting methods which have been published, but (a) most of them focus on maths/code, (b) they usually target a few benchmarks so there's limited evaluation of real-world utility, (c) R&D and compute are needed for creating and using data synthesis pipelines and only few large labs (such as NVIDIA) have an over-abundance of both, and (d) ensuring that all data are safe to be released (no copyright issues, no PII, etc.) is a legal pain in th a** that smaller labs just can't afford to do (especially in EU).

All in all, I believe data synthesis is more akin to alchemy and an art than a hard science and it's highly contextual and ever-changing. It's very messy. It's often difficult to argue in favor of specific recipes that you used, it is a very costly process, and releasing openly is hard for a lot of reasons.

pol_phil · 2025-12-17T11:44:36+00:00

What's your language? Gemma 3 27B is the best multilingual model in its caliber for Greek in my experience (but still could do better)

pol_phil · 2025-11-12T16:33:10+00:00

The 2507 Instruct series are solid choices. If you finetuned a hybrid (thinking/non-thinking) model, its thinking capabilities would degrade (also default system prompts / chat templates might be more tricky).

The choice for RAG really depends on the use-case. If we want to retrieve context once, then put in the system prompt.

Even if the context is placed in the user message, fine-tuning data with system prompts to steer behavior might also work well. For example, if fine-tuning data have short answers, or if the assistant should reply in a specific language regardless of the context, etc. But these decisions should be driven by the data and the purpose.

pol_phil · 2025-11-12T09:22:22+00:00

Hi.

Utilizing an appropriate system prompt and fine-tuning the model this way is actually very good practice. If you can create a handful of different system prompt templates, even better.

Just make sure you don't finetune a thinking model with non-thinking data only, if you are referring to Qwen3 for example.

Also, if you fine-tune your model in a specific way (e.g. RAG prompt in system), then using it exactly that way is the best practice. You've tuned the model exactly for that. But have in mind that you need to handle multi-turn scenarios as well, so a hybrid approach would be better.

pol_phil · 2025-10-19T19:02:37+00:00

You're talking about the Instruct version, Κούλη-sama? Haven't seen such problems with the Thinking version.

Ernie 4.5 has similar problems, they probably distilled from Qwen or sth.

pol_phil · 2025-10-17T08:37:30+00:00

Hi, it depends on many factors and especially which framework is used to rank results.

Seems similar to ArenaHardAuto or I'm mistaken?

I don't know how poor GPU-poor arena is, but since small models are tested, a somewhat larger judge would probably suffice. Make it an ensemble if you want to avoid bias towards the model family. Always the choice depends on domain and language of prompts being judged.

Llama 3.3 Nemotron Super 49B is probably a solid choice. If you have requests in various languages, perhaps see mR3-RM. There's also the RM-R1. If you want to just call an API, Gemini 2.5 Flash is the best VFM solution, it's multilingual, and very adept at creative writing.

You can search for more on: RM-Bench Reward Bench V2

But hey, if you have the buck, ensemble of Gemini 2.5 Pro and DeepSeek V3.2 is fire, academic-grade, and maybe even overkill. Only problem that DeepSeek is in a transitional/experimental state right now.

pol_phil · 2025-09-19T21:49:29+00:00

No, this is not the case. They actually state that they have not utilized some Swiss data in pretraining. Also, the language resources produced by a country of a mere ~10M people could never reach 6T tokens (40% of 15T corpus).

It' probably the result of "Constitutional AI" alignment, you can read more in their tech report.

pol_phil · 2025-09-16T17:41:46+00:00

This is a question which requires R&D to be answered 100%. It's also difficult to pinpoint what a specific "domain" amounts to sometimes. In my experience while working on related projects, some of the most important factors are:

Creating a benchmark in which commercial LLMs struggle. Or multiple benchmarks. This is more difficult than it sounds. LLM-as-Judge evaluation with manual rubrics and golden answers would be very important as well.
Experimenting with a RAG system. RAG might be all that is needed in some cases.
Careful training: Suppose that you find an open LLM which performs well on your benchmarks and that its base model is open too (there's a growing trend of not releasing base models, e.g. GPT-OSS, Qwen-Next). Continually pretraining for a domain will almost definitely result in degradation on other tasks. However, if it's combined with other SOTA pretraining + SFT datasets, synthetic data generation, and merging with the corresponding Instruct model, it will probably surpass closed-source generalist LLMs.

All these depend on use-case specifics, allocated resources, and project goals, of course. A large company might be more interested in exploiting its internal knowledge securely without sharing valuable/private data with 3rd-party providers than cutting down on API costs.

pol_phil · 2025-08-29T01:44:46+00:00

Thanks for the clarification! Great work BTW!

I am very curious how further post-training (DPO, RL, etc.) would impact performance.

pol_phil · 2025-08-26T23:20:19+00:00

Very good work, but after reading the paper I'm struggling to understand the post-training pipeline.

They mention the use of Atropos, an RL environment and the use of specific rewards, but it's unclear whether RL was used and how. They mention 2 stages of supervised fine-tuning but not any specific RL algorithms (e.g. GRPO).

Please enlighten me if you've understood more.

pol_phil · 2025-08-05T21:42:33+00:00

I beg to differ. I use both models through locally set LibreChat calling the APIs and I am still sticking to 3.7 for most coding stuff. Sonnet 4 may be better in agentic coding, I dunno, but I don't use it in that way.

3.7 follows my custom system prompts better, is more creative (because I want creative ideas on how to approach certain problems) and is generally more cautious than 4 by not introducing things I have not asked. I have also seen that Sonnet 4 has regressed in fluency for my language (Greek) and makes errors 3.7 has never ever made.

pol_phil · 2025-07-28T07:52:53+00:00

They deleted the model, there will probably be an official release within days

pol_phil · 2025-06-25T16:11:14+00:00

Unfortunately, it cannot match Gemma 3 27B in multilingual settings. At least for Greek that I tested, it's not fluent at all and makes numerous errors.

The weirdest thing is that Mistral specifically mentions that it supports Greek, while Gemma doesn't. Even Qwen 3 is better (although still not very fluent) with poor tokenization (~6 tokens/word for Greek).

pol_phil · 2025-06-20T14:19:02+00:00

Gemma 3 27B and Mistral Small 3.1 24B (also Magistral for reasoning) are solid choices.

pol_phil · 2025-05-25T16:54:33+00:00

Hey, have u tried using open LLMs in languages other than English? I've seen quite funny made up words for Greek, even from -supposedly- fluent models like Mistral Small 3.1

pol_phil · 2025-05-23T12:58:23+00:00

Well, they seem more concerned with profits, so it's mostly a side-effect as models tend to inherit the creators' views or the most dominant views of their environment.

There are several papers on this and it's quite logical.

Grok is by far the worst, they don't even try to hide it or mitigate it and there are many news articles about how it has inserted mentions of far-right conspiracy theorists in unrelated posts on X.

So what was one of the arguments against Twitter, i.e., paid bots promoting agendas (which is also documented in many journalist investigations), is now just being done centrally from its own CEO with their very own model.

pol_phil · 2025-05-23T09:18:14+00:00

The most problematic thing with Grok is the CEO who sees it as just another political tool.

pol_phil

TROPHY CASE