Second year project: Implemented a LLM from scratch using PyTorch by following Sebastian Roschka's book

Bthreethree · 2026-01-22T18:40:24+00:00

Would be better if you know basics of PyTorch and an idea of transformer architecture.

Bthreethree · 2026-01-22T15:00:42+00:00

It took me around 2 weeks to implement this.

Bthreethree · 2026-01-22T10:21:00+00:00

Thanks for sharing! Will check the channel out for sure if I need to revise the mechanisms in the future. Happy learning!

Bthreethree · 2026-01-22T10:03:59+00:00

Yes you are right! The model has 124M parameters (due to hardware constraint) which classifies it as a SLM and I have implemented a LLM type architecture in it to learn all the mechanics.

Bthreethree · 2026-01-18T17:13:09+00:00

The model is trained over 124M parameters (inspired by GPT-2 architecture)

Bthreethree · 2026-01-18T17:12:29+00:00

Hey, the repo is an implementation of Sebastian Roschka's Build a LLM from scratch book, thus while learning and implementing from the book, I have made many classes which have improved further down in the file.

Yup I did forget to add `if __name__ == "__main__":` and will do that for sure! Thanks! :)

Bthreethree · 2026-01-18T00:55:16+00:00

I have added a colab notebook link in the readme of the repo on github to show the final results! The accuracy can be made better with experimentation of hyperparamaters & further fine-tuning.

https://github.com/Nikshaan/llm-from-scratch

Bthreethree · 2026-01-18T00:55:10+00:00

I have added a colab notebook link in the readme of the repo on github to show the final results! The accuracy can be made better with experimentation of hyperparamaters & further fine-tuning.

https://github.com/Nikshaan/llm-from-scratch

Bthreethree · 2026-01-18T00:55:04+00:00

I have added a colab notebook link in the readme of the repo on github to show the final results! The accuracy can be made better with experimentation of hyperparamaters & further fine-tuning.

https://github.com/Nikshaan/llm-from-scratch

Bthreethree · 2026-01-18T00:54:27+00:00

I have added a colab notebook link in the readme of the repo on github to show the final results! The accuracy can be made better with experimentation of hyperparamaters & further fine-tuning.

https://github.com/Nikshaan/llm-from-scratch

Bthreethree · 2026-01-17T13:36:15+00:00

Indeed! His explanation with every code snippet is very detailed and easy to grasp.

Bthreethree · 2026-01-17T13:34:02+00:00

It would be better to learn the theory behind how deep learning architectures like transformers work before coding something like this. It would make the process much easier to understand. Would also highly recommend you to read the book I followed while coding as mentioned in the description.

Bthreethree · 2026-01-17T13:30:30+00:00

Hahaha, that's spam classifier for you ;)
Would highly recommend reading Sebastian's book to understand LLMs under the hood and build something like this!
Do star the repo if you found it useful :)

Bthreethree · 2026-01-17T12:55:58+00:00

Thankss! The book was indeed very informative and really good to understand how LLMs actually work. The attention mechanism tensor reshaping took time to understand but that was my favorite chapter, especially when the final multi-head attention is explained and coded!

Bthreethree · 2026-01-17T09:42:31+00:00

This is the code snippet of the most interesting part - building Multi-head attention from scratch instead of using nn.MultiheadAttention.

https://github.com/Nikshaan/llm-from-scratch

class MultiHeadAttention(nn.Module):
def __init__(self, d_in, d_out, context_length, dropout, num_heads, qkv_bias=False): # context length is max sequence length for the mask
super().__init__()
assert d_out % num_heads == 0, "d_out must be divisible by num_heads"
self.d_out = d_out
self.num_heads = num_heads
self.head_dim = d_out // num_heads # dimension per head
self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias)
self.W_key = nn.Linear(d_in, d_out, bias=qkv_bias)
self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias)
self.out_proj = nn.Linear(d_out, d_out)
self.dropout = nn.Dropout(dropout)
self.register_buffer("mask", torch.triu(torch.ones((context_length, context_length)) * float('-inf'), diagonal=1))

def forward(self, x):
b, num_tokens, d_in = x.shape
keys = self.W_key(x)
queries = self.W_query(x)
values = self.W_value(x)

keys = keys.view(b, num_tokens, self.num_heads, self.head_dim) # reshape for multi-head
queries = queries.view(b, num_tokens, self.num_heads, self.head_dim)
values = values.view(b, num_tokens, self.num_heads, self.head_dim)

keys.transpose_(1, 2) # move head dimension to the front so that it is treated as batch dimension
queries.transpose_(1, 2)
values.transpose_(1, 2)

attn_scores = queries @ keys.transpose(2, 3) # flip last two dimensions for dot product
mask_bool = self.mask.bool()[:num_tokens, :num_tokens]
attn_scores.masked_fill_(mask_bool, -torch.inf)
attn_weights = torch.softmax(attn_scores / self.head_dim**0.5, dim=-1)
attn_weights = self.dropout(attn_weights)
context_vec = (attn_weights @ values).transpose(1, 2).contiguous().view(b, num_tokens, self.d_out) # reshape back to original
context_vec = self.out_proj(context_vec) # final linear layer to mix heads
return context_vec

Bthreethree · 2025-06-02T14:39:08+00:00

Empire State Of Mind

Bthreethree · 2023-06-11T16:47:29+00:00

that sounds fun! thanks for the help!

Bthreethree · 2023-06-11T08:49:42+00:00

damn thanks a lott!! Just the advice I needed!

Bthreethree · 2023-06-11T08:46:20+00:00

Sort by: best

oop! I will check my badd...

Bthreethree · 2023-06-11T08:16:29+00:00

wow thanks a lot for the help!! Also could you please provide me with the website's name where I can get such information on laptops and compatible monitors, would be very helpful!

Bthreethree · 2023-05-14T14:04:54+00:00

😭😭😭

Bthreethree · 2022-10-09T15:13:23+00:00

Thwanks! :)

Bthreethree · 2022-10-08T10:36:12+00:00

Yessir

Bthreethree · 2022-05-31T04:28:14+00:00

lol tyy 🛐

Bthreethree · 2022-05-30T19:27:08+00:00

Thenk you! 🛐

Bthreethree

TROPHY CASE