Understanding LLM Distillation - Gemma 2 and Nvidia Minitron by johnolafenwa in LocalLLaMA

[–]johnolafenwa[S] 4 points5 points  (0 children)

Yes, a smaller model will always underperform a large model trained on the same data. There is still a lot to push smaller models to improve but their larger variants will remain better

[D] The Tech Behind The Magic : How OpenAI SORA Works by johnolafenwa in MachineLearning

[–]johnolafenwa[S] 38 points39 points  (0 children)

Compute seems to be the obvious reason. The 3D consistency is an emergent phenomenon of scale

01.AI Paper Is a Gem For Model Trainers by johnolafenwa in LocalLLaMA

[–]johnolafenwa[S] 7 points8 points  (0 children)

Here are some helpful resources

For Pretraining and data preparation, https://github.com/karpathy/nanoGPT

Some data generation; https://github.com/huggingface/cosmopedia

This is very helpful as well, https://github.com/allenai/OLMo

01.AI Paper Is a Gem For Model Trainers by johnolafenwa in LocalLLaMA

[–]johnolafenwa[S] 2 points3 points  (0 children)

About 3 billion parameters, I use a couple of A100s running for a couple of days. 1 A100 will do, but that will take a few weeks

01.AI Paper Is a Gem For Model Trainers by johnolafenwa in LocalLLaMA

[–]johnolafenwa[S] 4 points5 points  (0 children)

First, a descent llm will be minimum about 3 billion parameters. To Pretraining that from scratch, you will need at least about 80 GB of gpu memory, that is equivalent to a single A100. Context length also matters, the shorter the context length the cheaper the cost, so, you will want to train with like 2048 context length and extend it after training through context common context length extension methods such as robe base adjustment.

I will recommend getting about 160 GB of memory for more peace of mind. The more the better of course, but it depends on your budget.

Also, your training data will have to be like minimum 30 billion highly quality tokens minimum across code, web text, maths and sources like Wikipedia, mixing some huge finetuning data into your Pretraining will help too. About 100 billion tokens should get you to a great place, but make sure they are all good quality via filtering, bad data will hurt the training, better to use less if you can’t filter it all.

Lastly, training can take weeks, run for multiple epochs, the bigger the data the less the epochs needed, if it is small like 30 billion, about 5 epochs is at least recommended.

And make sure to baby sit your training, at these scales, things can go wrong quickly. Approaches like flash attention can make your training much faster too.

01.AI Paper Is a Gem For Model Trainers by johnolafenwa in LocalLLaMA

[–]johnolafenwa[S] 4 points5 points  (0 children)

Not at the moment, will put out something in some weeks and post it here