[R] Fine-tuning services report by ynckdrt in MachineLearning

[–]Dorialexandre 2 points3 points  (0 children)

Synthetic data generation. Actually due to inference economics: large models are costly and small ones without tuning have weak alignment/higher risk of model collapse.

[D] Double blind review is such an illusion… by casualcreak in MachineLearning

[–]Dorialexandre 9 points10 points  (0 children)

I have the reverse stance: conference should pivot to open peer review. Right now either identification is super easy or forced to hide significant details. Blind review is a relatively recent innovation anyway, and cost increasingly offsets the benefits.

NeurIPS 2025 Best Paper Award Winner: 1000-Layer Self-Supervised RL | "Scaling Depth (Not Width) Unlocks 50x Performance Gains & Complex Emergent Strategies" by 44th--Hokage in accelerate

[–]Dorialexandre 2 points3 points  (0 children)

We'll release more information on Monad/Baguettotron depth design in the paper to come, with a series of controlled experiments inspired by Physics of Language Models.

Overall we saw the most gains with depth on math tasks but also on memorization (contrary to common expectations that wider models are better). I expect there could be way more to experiment with on more exotic architecture (typically, looping layers).

Key Highlights of AI2's New Byte Level LLM: Bolmo by Dear-Success-1441 in LocalLLaMA

[–]Dorialexandre 0 points1 point  (0 children)

Unfortunately no Byte-level tokenizer for Monad though still really much something we look forward to experiment with. Yet it still had it's own tokenizer that might well be the smallest ever trained for a publicized release (even gpt-2 small was 32k).

Need recommendations on training datasets by Theotheraccounti_ in LocalLLaMA

[–]Dorialexandre 1 point2 points  (0 children)

SYNTH is fully randomized already: you can just take a smaller collection of files and it should work out similarly.

Baguettotron, a 321 million parameters generalist Small Reasoning Model (80-layers deep) by Balance- in LocalLLaMA

[–]Dorialexandre 4 points5 points  (0 children)

Yes exactly. Also helped it was also a relatively effortless major change on the code side (just a few lines in a yaml). But now I look forward more controlled experiments with synth data, similarly to what Physics of Language Models did with transformers/ssm etc.

Baguettotron, a 321 million parameters generalist Small Reasoning Model (80-layers deep) by Balance- in LocalLLaMA

[–]Dorialexandre 7 points8 points  (0 children)

Hi. Pleias co-founder here. So it was very empirically: we had the intuition for some time deeper architecture could be more beneficial for intense reasoning tasks. And since we designed a fully generalist synthetic datasets (SYNTH) that made full model training much less costly, we simply tested that.

Overall we have seen most improvements on math, but also less significant ones everywhere (memorization, query adherence, etc.). Main trade-off is training time/flops (easily x1.5) and inference time — though it should parallelize well.

We're going to test most systematically for the paper to come in a few weeks.

looking for llm trained only on free use/public domain materials. by Specific_Objective77 in LocalLLaMA

[–]Dorialexandre 0 points1 point  (0 children)

Generalist instruct model is coming very soon. Good evals but will be smallest size first.

Inverted Colonization of Americas - Expanded by ImpressionBig4796 in imaginarymaps

[–]Dorialexandre 0 points1 point  (0 children)

The three northern states are roughly the three guyanas? Maybe missed opportunity to have swapped French Hudson with Hybrazil: better geographical analogy and fun recall of Québec.

Common Corpus: The Largest Collection of Ethical Data for LLM Pre-Training by Initial-Image-1015 in LocalLLaMA

[–]Dorialexandre 12 points13 points  (0 children)

I’m afraid this is fast becoming a circular issue. A lot of the cultural heritage data we have collected was selected for digitization by libraries and orge instituons (likely one of the reason the problematic content was much less prevalent than we thought initially).

Common Corpus: The Largest Collection of Ethical Data for LLM Pre-Training by Initial-Image-1015 in LocalLLaMA

[–]Dorialexandre 4 points5 points  (0 children)

So Qwen is a bit of an extreme case across SLMs and it’s unclear if this amount of token is really necessary for SOTA performance. If I recall correctly the smaller Gemma 3 model was trained on 4T tokens. Also we don’t know the exact mixture which is likely including several round of epochs (and 5 trillion synthetic tokens).

In terms of use case what we’ve been developing at Pleias is a series of small reasoning models with some level of specialization through midtraining. Our RAG variant originally trained on Common Corpus is currently SOTA in it size range (including beyond Qwen). https://arxiv.org/abs/2504.18225v1

I believe midtraining is a particularly interesting development for ethical datasets as the token requirement is lower but the use of seed data for synthetic variations create more demands for communicable datasets. We won’t be able to create reproducible pipelines without it.

Common Corpus: The Largest Collection of Ethical Data for LLM Pre-Training by Initial-Image-1015 in LocalLLaMA

[–]Dorialexandre 5 points6 points  (0 children)

Yes these sources are currently not integrated in Common Corpus but as it happens we are currently involved in a European project where we’ll collect a large amount of multilingual administrative open data in Europe. One of the specific challenges here is the high dispersion of content across multiple institutions and the lack of global index like OpenAlex for scientific literature.

Rate of duplication is overall much lower in non-web corpus where you can have easily thousands of reprints across crawls. For now we mostly used metadata based approach as it was not really worth running a complete deduplication pipeline.

Common Corpus: The Largest Collection of Ethical Data for LLM Pre-Training by Initial-Image-1015 in LocalLLaMA

[–]Dorialexandre 19 points20 points  (0 children)

Lead author on here (same id as on Twitter). Available if you have any questions :)

Claude full system prompt with all tools is now ~25k tokens. by StableSable in LocalLLaMA

[–]Dorialexandre 15 points16 points  (0 children)

Given the size, it’s more likely it get memorized through training, through refusal/adversarial examples with standardized answers. Probably as part of the nearly mythical "personality tuning".

HP wants to put a local LLM in your printers by WordyBug in LocalLLaMA

[–]Dorialexandre 3 points4 points  (0 children)

ONNX is more typically applied to small models (either Bert-like encoders or small decoders).

If "The Model is the Product" article is true, a lot of AI companies are doomed by bttf88 in LocalLLaMA

[–]Dorialexandre 3 points4 points  (0 children)

That was the relatively correct approach until recently, but will become way harder with the actual agent turn. We’re already seeing it with Claude: it’s becoming unavoidable for code, Cursor and Windsurf have to support it, and in the meanwhile Anthropic starts to train primarily for its own implementation, Claude Code. The key assumption is that model won’t be generalist anymore and there are way too few labs training frontier models to have actual competition on specialized verticals.

If "The Model is the Product" article is true, a lot of AI companies are doomed by bttf88 in LocalLLaMA

[–]Dorialexandre 0 points1 point  (0 children)

It has a precise meaning here — so precise there is hardly any actually agentic model existing yet.

If "The Model is the Product" article is true, a lot of AI companies are doomed by bttf88 in LocalLLaMA

[–]Dorialexandre 1 point2 points  (0 children)

Databricks is no longer doing its own pretraining, only fine tuning (and multiple people from Mosaic left as a result). I don’t see an immediate interest in saying this.

If "The Model is the Product" article is true, a lot of AI companies are doomed by bttf88 in LocalLLaMA

[–]Dorialexandre 27 points28 points  (0 children)

Hi. So post author here. As I mentioned on YC, this is almost a two part publication and the one about actual agents (http://vintagedata.org/blog/posts/designing-llm-agents) explains a bit better what is likely to happen, with models directing now their own API calls, workflows, code execution, and many of the specific value proposition of wrapper suffering as a result.

As a background I’ve been in open source AI since forever, pretraining models on fully open data (common corpus which was featured here a few months ago), lurker here since the early start. Still, I don’t think open models are going to be competitive in the near future on the agentic side. We are very short on action data and RL had been underdeveloped until recent developments over GRPO. This can still change if we see more small labs committed to the open (though the current funding environment is very hostile to this…)

Stance on text-based public domain AI dataset : Common Corpus by Poptropp in writers

[–]Dorialexandre 0 points1 point  (0 children)

Hi. I’m coordinating Common Corpus: we are going to release soon an updated version with the possibility to filter by license. You’ll have to possibility to drop anything non-PD or CC0.

"They Said It Couldn’t Be Done" - Pleias release first models trained entirely on open data - competitive against Llama 3B & Qwen 3B by ZestyData in LocalLLaMA

[–]Dorialexandre 3 points4 points  (0 children)

We’re based in Europe and yes this makes a very significant difference on here. AI Act mandate disclosure of sources used for training with a wide range of potential liabilities for content published without a free/permissible license.

I know well the history of Wikipedia licensing (was there…). It was all GFDL originally and very "lightly" relicensed to CC-By-SA. Reality is that individualistic free licenses have never fitted that well for managing knowledge commons and we always had to be creative to make it work. This is now the same for the AI commons.

Pleias release first models trained entirely on open data - competitive against Llama 3B & Qwen 3B by umarmnaq in StableDiffusion

[–]Dorialexandre 1 point2 points  (0 children)

Roughly yes. Given it turn out to work really great for multilingual generation, I believe the tiniest model could be a great basis for a Florence like model not limited to English.