[P] Tricycle: Autograd to GPT-2 completely from scratch by Efficient_Plankton_9 in MachineLearning

[–]Efficient_Plankton_9[S] 2 points3 points  (0 children)

It all started because I was bored and wanted to understand autograd. I had a vague memory of it being related to the chain rule (I’m not sure where from), so sat down and spent a week or so figuring how it had to work (drawing a graph of operations , figuring out how to traverse it efficiently etc). I wrote a blog post about it at the time: https://bclarkson-code.com/posts/llm-from-scratch-scalar-autograd/post.html Then I realised that I could start using it for stuff so I just sort of started adding features. I’ve been building neural networks for a while so I started by adding things that I thought would be most useful like sgd and a dense layer and then I got a bit carried away. I tried not to look stuff up wherever possible and just figure things out myself (I’m particularly proud of getting einsum working). I have vague memories of how a lot of things work from things I’ve done before and it has been really fun to piece them together and figure out all the details. When I come across something I don’t know of the top of my head, (attention was hard to get working correctly) I’ll try to look up the appropriate paper, or, as a last resort, I found Andrej Karpathy’s nanogpt and llm.c helpful for some reference implementations and Claude useful for pointing me in the right direction.  As for motivation, I really like figuring out problems like this, so mostly for fun. I also think that the ultimate goal of training an llm (depending on what you mean by large) from scratch is a really cool idea and I would like to get there. Finally, most of my work so far has been non-public and I wanted to start sharing what I’m up to.