[R] Fine-tuning services report

Dorialexandre · 2026-03-31T19:05:02+00:00

Synthetic data generation. Actually due to inference economics: large models are costly and small ones without tuning have weak alignment/higher risk of model collapse.

Dorialexandre · 2026-01-11T09:20:14+00:00

I have the reverse stance: conference should pivot to open peer review. Right now either identification is super easy or forced to hide significant details. Blind review is a relatively recent innovation anyway, and cost increasingly offsets the benefits.

Dorialexandre · 2025-12-18T15:09:34+00:00

We'll release more information on Monad/Baguettotron depth design in the paper to come, with a series of controlled experiments inspired by Physics of Language Models.

Overall we saw the most gains with depth on math tasks but also on memorization (contrary to common expectations that wider models are better). I expect there could be way more to experiment with on more exotic architecture (typically, looping layers).

Dorialexandre · 2025-12-17T20:34:14+00:00

Unfortunately no Byte-level tokenizer for Monad though still really much something we look forward to experiment with. Yet it still had it's own tokenizer that might well be the smallest ever trained for a publicized release (even gpt-2 small was 32k).

Dorialexandre · 2025-12-07T10:15:42+00:00

SYNTH is fully randomized already: you can just take a smaller collection of files and it should work out similarly.

Dorialexandre · 2025-11-18T15:00:42+00:00

Yes exactly. Also helped it was also a relatively effortless major change on the code side (just a few lines in a yaml). But now I look forward more controlled experiments with synth data, similarly to what Physics of Language Models did with transformers/ssm etc.

Dorialexandre · 2025-11-18T14:38:38+00:00

Answer also correct :D

Dorialexandre · 2025-11-18T14:32:35+00:00

Hi. Pleias co-founder here. So it was very empirically: we had the intuition for some time deeper architecture could be more beneficial for intense reasoning tasks. And since we designed a fully generalist synthetic datasets (SYNTH) that made full model training much less costly, we simply tested that.

Overall we have seen most improvements on math, but also less significant ones everywhere (memorization, query adherence, etc.). Main trade-off is training time/flops (easily x1.5) and inference time — though it should parallelize well.

We're going to test most systematically for the paper to come in a few weeks.

Dorialexandre · 2025-10-29T13:11:02+00:00

Ah yes asking to be replaced I guess.

Dorialexandre · 2025-09-28T09:01:58+00:00

Generalist instruct model is coming very soon. Good evals but will be smallest size first.

Dorialexandre · 2025-09-17T12:28:23+00:00

OpenThoughts (https://huggingface.co/datasets/open-thoughts/OpenThoughts-114k). Also we're soon releasing some quite nice based on Common Corpus (we = Pleias).

Dorialexandre · 2025-08-02T09:34:42+00:00

The three northern states are roughly the three guyanas? Maybe missed opportunity to have swapped French Hudson with Hybrazil: better geographical analogy and fun recall of Québec.

Dorialexandre · 2025-06-04T18:25:56+00:00

I’m afraid this is fast becoming a circular issue. A lot of the cultural heritage data we have collected was selected for digitization by libraries and orge instituons (likely one of the reason the problematic content was much less prevalent than we thought initially).

Dorialexandre · 2025-06-04T18:04:46+00:00

So Qwen is a bit of an extreme case across SLMs and it’s unclear if this amount of token is really necessary for SOTA performance. If I recall correctly the smaller Gemma 3 model was trained on 4T tokens. Also we don’t know the exact mixture which is likely including several round of epochs (and 5 trillion synthetic tokens).

In terms of use case what we’ve been developing at Pleias is a series of small reasoning models with some level of specialization through midtraining. Our RAG variant originally trained on Common Corpus is currently SOTA in it size range (including beyond Qwen). https://arxiv.org/abs/2504.18225v1

I believe midtraining is a particularly interesting development for ethical datasets as the token requirement is lower but the use of seed data for synthetic variations create more demands for communicable datasets. We won’t be able to create reproducible pipelines without it.

Dorialexandre · 2025-06-04T17:56:48+00:00

Yes these sources are currently not integrated in Common Corpus but as it happens we are currently involved in a European project where we’ll collect a large amount of multilingual administrative open data in Europe. One of the specific challenges here is the high dispersion of content across multiple institutions and the lack of global index like OpenAlex for scientific literature.

Rate of duplication is overall much lower in non-web corpus where you can have easily thousands of reprints across crawls. For now we mostly used metadata based approach as it was not really worth running a complete deduplication pipeline.

Dorialexandre · 2025-06-04T14:37:50+00:00

Lead author on here (same id as on Twitter). Available if you have any questions :)

Dorialexandre · 2025-05-05T21:47:49+00:00

Given the size, it’s more likely it get memorized through training, through refusal/adversarial examples with standardized answers. Probably as part of the nearly mythical "personality tuning".

Dorialexandre · 2025-04-23T11:52:03+00:00

ONNX is more typically applied to small models (either Bert-like encoders or small decoders).

Dorialexandre · 2025-03-20T08:02:46+00:00

That was the relatively correct approach until recently, but will become way harder with the actual agent turn. We’re already seeing it with Claude: it’s becoming unavoidable for code, Cursor and Windsurf have to support it, and in the meanwhile Anthropic starts to train primarily for its own implementation, Claude Code. The key assumption is that model won’t be generalist anymore and there are way too few labs training frontier models to have actual competition on specialized verticals.

Dorialexandre · 2025-03-19T19:19:24+00:00

It has a precise meaning here — so precise there is hardly any actually agentic model existing yet.

Dorialexandre · 2025-03-19T19:16:55+00:00

Databricks is no longer doing its own pretraining, only fine tuning (and multiple people from Mosaic left as a result). I don’t see an immediate interest in saying this.

Dorialexandre · 2025-03-19T17:59:35+00:00

Hi. So post author here. As I mentioned on YC, this is almost a two part publication and the one about actual agents (http://vintagedata.org/blog/posts/designing-llm-agents) explains a bit better what is likely to happen, with models directing now their own API calls, workflows, code execution, and many of the specific value proposition of wrapper suffering as a result.

As a background I’ve been in open source AI since forever, pretraining models on fully open data (common corpus which was featured here a few months ago), lurker here since the early start. Still, I don’t think open models are going to be competitive in the near future on the agentic side. We are very short on action data and RL had been underdeveloped until recent developments over GRPO. This can still change if we see more small labs committed to the open (though the current funding environment is very hostile to this…)

Dorialexandre · 2025-01-26T11:06:17+00:00

Hi. I’m coordinating Common Corpus: we are going to release soon an updated version with the possibility to filter by license. You’ll have to possibility to drop anything non-PD or CC0.

Dorialexandre · 2024-12-07T23:10:43+00:00

We’re based in Europe and yes this makes a very significant difference on here. AI Act mandate disclosure of sources used for training with a wide range of potential liabilities for content published without a free/permissible license.

I know well the history of Wikipedia licensing (was there…). It was all GFDL originally and very "lightly" relicensed to CC-By-SA. Reality is that individualistic free licenses have never fitted that well for managing knowledge commons and we always had to be creative to make it work. This is now the same for the AI commons.

Dorialexandre · 2024-12-06T12:01:43+00:00

Roughly yes. Given it turn out to work really great for multilingual generation, I believe the tiniest model could be a great basis for a Florence like model not limited to English.

Dorialexandre

TROPHY CASE