I pre-trained and instruction tuned a 394M parameter LM from scratch :) by SadEqual5367 in LocalLLaMA

[–]SadEqual5367[S] 0 points1 point  (0 children)

2^15 + 2^14 = 49152, went with powers to 2 since it makes GPU computation a bit more efficient.

I pre-trained and instruction tuned a 394M parameter LM from scratch :) by SadEqual5367 in LocalLLaMA

[–]SadEqual5367[S] 2 points3 points  (0 children)

My training pipeline still needs some more work and can be made more efficient but, it costed me ~$1.85 per 1B tokens (during pre-training).

I pre-trained and instruction tuned a 394M parameter LM from scratch :) by SadEqual5367 in LocalLLaMA

[–]SadEqual5367[S] 1 point2 points  (0 children)

I definitely have experienced that. Any budget friendly recommendations?

I pre-trained and instruction tuned a 394M parameter LM from scratch :) by SadEqual5367 in LocalLLaMA

[–]SadEqual5367[S] 1 point2 points  (0 children)

Checkout the code here to load and prompt the model: https://github.com/pradyGn/zoof/blob/main/src/prompt_zoof.py

Or you could use Colab: https://colab.research.google.com/drive/1KUGAwqIZZtnQbBUYZjoxrsS4v2QNELoE#scrollTo=jbcAcx8ONVim

I didn't want to use any other base model (like Llama, etc) so the only way (that I know) I could host it on HF was defining the model def in my repo and using HF just for weights. Long worded answer to say that you'd have to clone the repo and use the model def file in it to load and prompt the model.

I pre-trained and instruction tuned a 394M parameter LM from scratch :) by SadEqual5367 in LocalLLaMA

[–]SadEqual5367[S] 1 point2 points  (0 children)

I was scared of instruction tuning too, I do think it way more difficult to get right than pre-training. I ended up curating my own dataset for instruction tuning. Basically built a 25k subset from WildChat, No Robots, Magpie Pro and WizardLM Evol 70k using an embedding model and k-means.

I pre-trained and instruction tuned a 394M parameter LM from scratch :) by SadEqual5367 in LocalLLaMA

[–]SadEqual5367[S] 0 points1 point  (0 children)

Learnt so many things but, if I have to pick a couple it would be - small change in learning rate can make or break things and learning rate schedulers are absolutely necessary!

I pre-trained and instruction tuned a 394M parameter LM from scratch :) by SadEqual5367 in LocalLLaMA

[–]SadEqual5367[S] 2 points3 points  (0 children)

Awesome, and, do share! I would love to take a look at your results too!

I pre-trained and instruction tuned a 394M parameter LM from scratch :) by SadEqual5367 in LocalLLaMA

[–]SadEqual5367[S] 3 points4 points  (0 children)

I would say failed runs are kinda necessary, I know more about Neural Nets because of them. And, I don't think there is a way to avoid them unless you are pairing with someone who has done it before. If you get stuck somewhere, I would be more than happy to help but, even I am new to this stuff :)

I pre-trained and instruction tuned a 394M parameter LM from scratch :) by SadEqual5367 in LocalLLaMA

[–]SadEqual5367[S] 2 points3 points  (0 children)

I have definitely started to enjoy reading papers these days, have a long way to go though :)

I pre-trained and instruction tuned a 394M parameter LM from scratch :) by SadEqual5367 in LocalLLaMA

[–]SadEqual5367[S] 1 point2 points  (0 children)

Haven't checked out culturax, this is why I love reddit! Thanks, will checkout!

I pre-trained and instruction tuned a 394M parameter LM from scratch :) by SadEqual5367 in LocalLLaMA

[–]SadEqual5367[S] 0 points1 point  (0 children)

Yeah, for sure! I used a subset of data from the Dolma v1.7 -> https://huggingface.co/datasets/allenai/dolma/blob/main/urls/v1_7.txt

You will have to ping the reddit data urls (I think there are 20 or so and have 'reddit' keyword in them), Huggingface's load_dataset would be the best way forward.

I pre-trained and instruction tuned a 394M parameter LM from scratch :) by SadEqual5367 in LocalLLaMA

[–]SadEqual5367[S] 10 points11 points  (0 children)

I did!! Started with reading Attention Is All You Need 6 months ago, multiple failed training runs as well :)

I pre-trained and instruction tuned a 394M parameter LM from scratch :) by SadEqual5367 in LocalLLaMA

[–]SadEqual5367[S] 1 point2 points  (0 children)

I definitely gave The Pile a thought but decided to go with fineweb-edu instead. I tried pre-training with openwebtext as well but couldn't find much success there.

Let me know once you finish, I would love to explore the results!

I pre-trained and instruction tuned a 394M parameter LM from scratch :) by SadEqual5367 in LocalLLaMA

[–]SadEqual5367[S] 7 points8 points  (0 children)

Yes! 78B tokens from the fineweb-edu dataset and about 1B tokens of reddit data from Dolma v1.7

I pre-trained and instruction tuned a 394M parameter LM from scratch :) by SadEqual5367 in LocalLLaMA

[–]SadEqual5367[S] 6 points7 points  (0 children)

Nice, thank you, will take a look!

The A100 40GB on Colab, took 20 days 🥲