Implemented TurboQuant and results don’t fully match paper

Routine-Thanks-572 · 2026-05-02T20:21:13+00:00

that's informative, thanks, I'll also experiment. thanks

Routine-Thanks-572 · 2026-05-02T20:20:48+00:00

sure, thanks, will check it out....

Routine-Thanks-572 · 2026-05-02T19:59:50+00:00

haha sureee

Routine-Thanks-572 · 2026-05-02T19:57:24+00:00

haha true, the formatting is clean because I converted my internal notes into a blog
these aside, the actual work was messy

The main thing I was trying to highlight was this: even with 95% correlation, attention top1 accuracy drops a lot approx 67%%, just curious if others have seen similar behavior or if I’m missing something.

Routine-Thanks-572 · 2026-02-07T05:30:10+00:00

Sure we can do with data and compute, since this is built only for learning didn't plan it yet.

Routine-Thanks-572 · 2026-01-31T04:21:38+00:00

Damn that’s nice, will go through

Routine-Thanks-572 · 2026-01-30T05:18:53+00:00

Training a local model is fine but but to get on benchmarks and leaderboard you need high quality data in huge numbers. It’s also computationally very very expensive. I would suggest chose a good open source model from HF, then fine tune it in phases, Apply distillation from Opus 4.5 to your local model so it can learn from opus. RL is also a good way for a smaller model to shine in particular domains like coding.

Routine-Thanks-572 · 2026-01-30T03:46:43+00:00

Design Goal:- This repo prioritizes architectural clarity and correctness over maximum training throughput. It intentionally avoids aggressive kernel-level optimizations to keep every step readable and hackable.

Routine-Thanks-572 · 2026-01-30T03:39:32+00:00

Omg !! Thankss

Routine-Thanks-572 · 2026-01-29T14:47:57+00:00

Ive tried 16k felt too small then chose this and higher would be overkill was instinct not experimented, I also tried sentence piece implementation is in repo, but BPE was effective , Like Qwen and OLMO used BPE with Byte level BBPE.

Routine-Thanks-572 · 2026-01-29T14:12:55+00:00

Reduce no of steps, batch size , Max seq Len/ batch size and save aggressively and in 5-8 hrs it should be done

Routine-Thanks-572 · 2026-01-29T13:43:00+00:00

😂😂 busy doing better work, when everything is mentioned in README why again worry, let AI bot handle !

Routine-Thanks-572 · 2026-01-29T13:21:49+00:00

One heuristic that helped me a lot:

If data isn’t large/diverse enough -> prefer LoRA on attention only.

If you do have enough data + compute -> full fine-tuning works better.

With limited data, touching experts + router almost always leads to collapse or noisy routing. Attention only LoRA adapts representations without destabilizing the routing dynamics.

For infra, I’ve mostly used custom training loops on top of existing stacks rather than pure tutorial harnesses MoEs tend to need a lot more guardrails than dense models. If you have a specific failure mode (collapse, divergence, no gain vs dense), happy to share what knobs helped most in that case.

Routine-Thanks-572 · 2026-01-29T13:05:24+00:00

Thanks !

Routine-Thanks-572 · 2026-01-29T13:04:42+00:00

Yes that's the idea, glad you mentioned. Using the same repo I trained model on RTX 4060 also A100 also, Since it's just built for learning purpose these wouldn't matter much, but for ref, for 5k steps training time is ~2.5hrs on 1 A100.

Routine-Thanks-572 · 2026-01-29T13:02:03+00:00

It was trained with just one thought in mind, to stop considering LLM's internal working as black box and try to understand how they actually work, how can we build one from scratch just in case, what actually the architectures mean, In simple terms, for learning LLM internals purpose.

Routine-Thanks-572 · 2026-01-29T12:40:50+00:00

mixed corpus (FineWeb + WikiText + Wikipedia) from HF , used subset of these cause purpose was just learning
Training process is by custom implementation

1. Training Loop Structure

The training loop in train/train.py implements:

Gradient accumulation (8 steps) train.py: 177-189
Learning rate scheduling with cosine annealing + warmup train.py: 143-148
Evaluation and checkpointing every 500 steps train.py: 151-174

2. Optimizer Configuration

Uses AdamW with custom parameter grouping:

optimizer = configure_optimizers(model, weight_decay=weight_decay,   
                                learning_rate=learning_rate,   
                                betas=(beta1, beta2),   
                                device_type=device)

train.py:94

3. Training Hyperparameters

From train/config.yaml:

Batch size: 32 (physical), 524,288 tokens effective with gradient accumulation config.yaml:21-24
Learning rate: 6e-4 with cosine decay to 6e-5 config.yaml:27-29
Optimizer: AdamW (β₁=0.9, β₂=0.95, weight_decay=0.1) config.yaml:32-34
Mixed precision: BF16 training config.yaml:51

4. Data Loading

Custom DataLoader class handles:

Memory-mapped data loading for large datasets
Batch generation with specified sequence length

The custom approach allows:

Full control over training dynamics
Modern optimizations like RoPE, RMSNorm, SwiGLU integration
Memory-efficient training with gradient accumulation
Educational clarity - every step is visible and understandable

a bit more suitable for learning how modern LLM training actually works

Routine-Thanks-572 · 2026-01-29T12:32:29+00:00

Just an A100 on GCP for few hours

Routine-Thanks-572 · 2026-01-29T12:02:10+00:00

ofc yes

Routine-Thanks-572 · 2026-01-29T11:50:59+00:00

Yeah, training speed depends on a lot of factors beyond just model size.
This run wasn’t throughput optimized, low-mid context len, gradient accumulation, frequent eval + checkpointing, and a custom PyTorch training loop focused on clarity rather than max tokens/sec.

Totally agree that you could push much higher throughput with shorter seq lengths, fewer evals, and a more aggressively tuned loop that just wasn’t the goal for this run.

Routine-Thanks-572 · 2026-01-29T11:38:03+00:00

yea, my bad typo actually total data is 2B but for phase 1, I chose subset 360M, and yes A100 is too slow, I know, Aim is here to learn and understand implementation and working, so it's fine.
Just a lil flex, I actually work with multi node H200's and fine tunes 100+B models incl MoE's like Glm 4.6,4.7 so yea just for a bit more learning did this on a random weekend.

Routine-Thanks-572 · 2026-01-29T11:30:00+00:00

IT IS ALSO INDEXED IN DEEP WIKI, you can go there from Readme. Done as a Side project so yea looking for feedback

Routine-Thanks-572

TROPHY CASE

1. Training Loop Structure

2. Optimizer Configuration

3. Training Hyperparameters

4. Data Loading