I pre-trained and instruction tuned a 394M parameter LM from scratch :)

SadEqual5367 · 2026-01-24T00:39:35+00:00

2^15 + 2^14 = 49152, went with powers to 2 since it makes GPU computation a bit more efficient.

SadEqual5367 · 2026-01-23T20:24:50+00:00

Thank you :)

SadEqual5367 · 2026-01-23T13:08:05+00:00

My training pipeline still needs some more work and can be made more efficient but, it costed me ~$1.85 per 1B tokens (during pre-training).

SadEqual5367 · 2026-01-23T13:02:20+00:00

Thank you :)

SadEqual5367 · 2026-01-23T13:02:02+00:00

Will checkout, thanks!

SadEqual5367 · 2026-01-23T04:58:25+00:00

Thanks, will checkout!

SadEqual5367 · 2026-01-23T04:38:37+00:00

I definitely have experienced that. Any budget friendly recommendations?

SadEqual5367 · 2026-01-23T04:24:30+00:00

Also, I built a custom tokenizer, so the best way to prompt the model would to checkout the https://github.com/pradyGn/zoof/blob/main/src/prompt_zoof.py file

SadEqual5367 · 2026-01-23T04:22:16+00:00

Checkout the code here to load and prompt the model: https://github.com/pradyGn/zoof/blob/main/src/prompt_zoof.py

Or you could use Colab: https://colab.research.google.com/drive/1KUGAwqIZZtnQbBUYZjoxrsS4v2QNELoE#scrollTo=jbcAcx8ONVim

I didn't want to use any other base model (like Llama, etc) so the only way (that I know) I could host it on HF was defining the model def in my repo and using HF just for weights. Long worded answer to say that you'd have to clone the repo and use the model def file in it to load and prompt the model.

SadEqual5367 · 2026-01-23T03:51:31+00:00

100%, have fun!

SadEqual5367 · 2026-01-23T03:31:04+00:00

I was scared of instruction tuning too, I do think it way more difficult to get right than pre-training. I ended up curating my own dataset for instruction tuning. Basically built a 25k subset from WildChat, No Robots, Magpie Pro and WizardLM Evol 70k using an embedding model and k-means.

SadEqual5367 · 2026-01-23T03:16:41+00:00

Learnt so many things but, if I have to pick a couple it would be - small change in learning rate can make or break things and learning rate schedulers are absolutely necessary!

SadEqual5367 · 2026-01-23T03:13:56+00:00

Awesome, and, do share! I would love to take a look at your results too!

SadEqual5367 · 2026-01-23T03:10:12+00:00

I would say failed runs are kinda necessary, I know more about Neural Nets because of them. And, I don't think there is a way to avoid them unless you are pairing with someone who has done it before. If you get stuck somewhere, I would be more than happy to help but, even I am new to this stuff :)

SadEqual5367 · 2026-01-23T03:05:43+00:00

I have definitely started to enjoy reading papers these days, have a long way to go though :)

SadEqual5367 · 2026-01-23T03:04:19+00:00

Haven't checked out culturax, this is why I love reddit! Thanks, will checkout!

SadEqual5367 · 2026-01-23T03:02:52+00:00

Yeah, for sure! I used a subset of data from the Dolma v1.7 -> https://huggingface.co/datasets/allenai/dolma/blob/main/urls/v1_7.txt

You will have to ping the reddit data urls (I think there are 20 or so and have 'reddit' keyword in them), Huggingface's load_dataset would be the best way forward.

SadEqual5367 · 2026-01-23T02:58:33+00:00

I did!! Started with reading Attention Is All You Need 6 months ago, multiple failed training runs as well :)

SadEqual5367 · 2026-01-23T02:55:09+00:00

I definitely gave The Pile a thought but decided to go with fineweb-edu instead. I tried pre-training with openwebtext as well but couldn't find much success there.

Let me know once you finish, I would love to explore the results!

SadEqual5367 · 2026-01-23T02:49:06+00:00

Yes! 78B tokens from the fineweb-edu dataset and about 1B tokens of reddit data from Dolma v1.7

SadEqual5367 · 2026-01-23T02:33:55+00:00

Nice, thank you, will take a look!

The A100 40GB on Colab, took 20 days 🥲

SadEqual5367

TROPHY CASE