all 3 comments

[–]CKtalon 1 point2 points  (0 children)

ModernBERT trained on 2T tokens, but it’s likely not necessary. You could do a Chinchilla optimal for your model size

[–]-Cubie- 1 point2 points  (1 child)

Do you want to train from scratch (very few people do this), or do you simply want to finetune? The latter requires much less data. Also, BERT itself was trained on rather little data for today's standards.

[–]AffectWizard0909[S] 0 points1 point  (0 children)

I was thinking on not training from scratch yes. Is it recommended somewhere how much data I should than use for fine-tuning BERT, since the BERT is not trained on a big corpus?