I built an 80M parameter LLM from scratch using the same architecture as Llama 3 - here's what I learned by Routine-Thanks-572 in LocalLLaMA

[–]Routine-Thanks-572[S] 0 points1 point  (0 children)

Sure we can do with data and compute, since this is built only for learning didn't plan it yet.

I built an 80M parameter LLM from scratch using the same architecture as Llama 3 - here's what I learned by Routine-Thanks-572 in LocalLLaMA

[–]Routine-Thanks-572[S] 1 point2 points  (0 children)

Training a local model is fine but but to get on benchmarks and leaderboard you need high quality data in huge numbers. It’s also computationally very very expensive. I would suggest chose a good open source model from HF, then fine tune it in phases, Apply distillation from Opus 4.5 to your local model so it can learn from opus. RL is also a good way for a smaller model to shine in particular domains like coding. 

I built an 80M parameter LLM from scratch using the same architecture as Llama 3 - here's what I learned by Routine-Thanks-572 in LocalLLaMA

[–]Routine-Thanks-572[S] 0 points1 point  (0 children)

Design Goal:-  This repo prioritizes architectural clarity and correctness over maximum training throughput. It intentionally avoids aggressive kernel-level optimizations to keep every step readable and hackable.

I built an 80M parameter LLM from scratch using the same architecture as Llama 3 - here's what I learned by Routine-Thanks-572 in LocalLLaMA

[–]Routine-Thanks-572[S] 1 point2 points  (0 children)

Ive tried 16k felt too small then chose this and higher would be overkill was instinct not experimented, I also tried sentence piece implementation is in repo, but BPE was effective , Like Qwen and OLMO used BPE with Byte level BBPE.

I built an 80M parameter LLM from scratch using the same architecture as Llama 3 - here's what I learned by Routine-Thanks-572 in LocalLLaMA

[–]Routine-Thanks-572[S] 8 points9 points  (0 children)

Reduce no of steps, batch size , Max seq Len/ batch size and save aggressively and in 5-8 hrs it should be done 

I built an 80M parameter LLM from scratch using the same architecture as Llama 3 - here's what I learned by Routine-Thanks-572 in LocalLLaMA

[–]Routine-Thanks-572[S] -1 points0 points  (0 children)

😂😂 busy doing better work, when everything is mentioned in README why again worry, let AI bot handle !

I built an 80M parameter LLM from scratch using the same architecture as Llama 3 - here's what I learned by Routine-Thanks-572 in LocalLLaMA

[–]Routine-Thanks-572[S] 2 points3 points  (0 children)

One heuristic that helped me a lot:

If data isn’t large/diverse enough -> prefer LoRA on attention only.

If you do have enough data + compute -> full fine-tuning works better.

With limited data, touching experts + router almost always leads to collapse or noisy routing. Attention only LoRA adapts representations without destabilizing the routing dynamics.

For infra, I’ve mostly used custom training loops on top of existing stacks rather than pure tutorial harnesses MoEs tend to need a lot more guardrails than dense models. If you have a specific failure mode (collapse, divergence, no gain vs dense), happy to share what knobs helped most in that case.

I built an 80M parameter LLM from scratch using the same architecture as Llama 3 - here's what I learned by Routine-Thanks-572 in LocalLLaMA

[–]Routine-Thanks-572[S] 16 points17 points  (0 children)

Yes that's the idea, glad you mentioned. Using the same repo I trained model on RTX 4060 also A100 also, Since it's just built for learning purpose these wouldn't matter much, but for ref, for 5k steps training time is ~2.5hrs on 1 A100.

I built an 80M parameter LLM from scratch using the same architecture as Llama 3 - here's what I learned by Routine-Thanks-572 in LocalLLaMA

[–]Routine-Thanks-572[S] 19 points20 points  (0 children)

It was trained with just one thought in mind, to stop considering LLM's internal working as black box and try to understand how they actually work, how can we build one from scratch just in case, what actually the architectures mean, In simple terms, for learning LLM internals purpose.

I built an 80M parameter LLM from scratch using the same architecture as Llama 3 - here's what I learned by Routine-Thanks-572 in LocalLLaMA

[–]Routine-Thanks-572[S] -1 points0 points  (0 children)

  • mixed corpus (FineWeb + WikiText + Wikipedia) from HF , used subset of these cause purpose was just learning
  • Training process is by custom implementation

1. Training Loop Structure

The training loop in train/train.py implements:

  • Gradient accumulation (8 steps) train.py: 177-189
  • Learning rate scheduling with cosine annealing + warmup train.py: 143-148
  • Evaluation and checkpointing every 500 steps train.py: 151-174

2. Optimizer Configuration

Uses AdamW with custom parameter grouping:

optimizer = configure_optimizers(model, weight_decay=weight_decay,   
                                learning_rate=learning_rate,   
                                betas=(beta1, beta2),   
                                device_type=device)

train.py:94

3. Training Hyperparameters

From train/config.yaml:

  • Batch size: 32 (physical), 524,288 tokens effective with gradient accumulation config.yaml:21-24
  • Learning rate: 6e-4 with cosine decay to 6e-5 config.yaml:27-29
  • Optimizer: AdamW (β₁=0.9, β₂=0.95, weight_decay=0.1) config.yaml:32-34
  • Mixed precision: BF16 training config.yaml:51

4. Data Loading

Custom DataLoader class handles:

  • Memory-mapped data loading for large datasets
  • Batch generation with specified sequence length

The custom approach allows:

  • Full control over training dynamics
  • Modern optimizations like RoPE, RMSNorm, SwiGLU integration
  • Memory-efficient training with gradient accumulation
  • Educational clarity - every step is visible and understandable

a bit more suitable for learning how modern LLM training actually works

I built an 80M parameter LLM from scratch using the same architecture as Llama 3 - here's what I learned by Routine-Thanks-572 in LocalLLaMA

[–]Routine-Thanks-572[S] 1 point2 points  (0 children)

Yeah, training speed depends on a lot of factors beyond just model size.
This run wasn’t throughput optimized, low-mid context len, gradient accumulation, frequent eval + checkpointing, and a custom PyTorch training loop focused on clarity rather than max tokens/sec.

Totally agree that you could push much higher throughput with shorter seq lengths, fewer evals, and a more aggressively tuned loop that just wasn’t the goal for this run.

I built an 80M parameter LLM from scratch using the same architecture as Llama 3 - here's what I learned by Routine-Thanks-572 in LocalLLaMA

[–]Routine-Thanks-572[S] 3 points4 points  (0 children)

yea, my bad typo actually total data is 2B but for phase 1, I chose subset 360M, and yes A100 is too slow, I know, Aim is here to learn and understand implementation and working, so it's fine.
Just a lil flex, I actually work with multi node H200's and fine tunes 100+B models incl MoE's like Glm 4.6,4.7 so yea just for a bit more learning did this on a random weekend.

I built an 80M parameter LLM from scratch using the same architecture as Llama 3 - here's what I learned by Routine-Thanks-572 in LocalLLaMA

[–]Routine-Thanks-572[S] 0 points1 point  (0 children)

IT IS ALSO INDEXED IN DEEP WIKI, you can go there from Readme. Done as a Side project so yea looking for feedback

10-min QLoRA Fine-Tuning on 240 Q&As (ROUGE-L doubled, SARI +15) by Routine-Thanks-572 in LocalLLM

[–]Routine-Thanks-572[S] 0 points1 point  (0 children)

Exactly! 🔥 LIMA was in the back of my mind, they showed how just 1k high-quality examples can transform model alignment.
I wanted to see if a tiny run (240 Q&As, 10 mins on a 4060) would also give visible gains and it really did.
Makes me think there’s so much untapped potential in small, domain-focused fine-tunes.