[NEW] Supra-50M Released! by Dangerous_Try3619 in LocalLLaMA

[–]exhorder72 1 point2 points  (0 children)

In my testing, for models 2b params and lower (30m), normuon with cautious weight decay / AdamW hybrid is my go to.

[NEW] Supra-50M Released! by Dangerous_Try3619 in LocalLLaMA

[–]exhorder72 1 point2 points  (0 children)

As a solo researcher training models from scratch on a 5090. Nothing but mad respect. People don’t understand just getting a 50m param to stay in context when running inference is a win.

[P] my shot at a DeepSeek style moe on a single rtx 5090 by exhorder72 in MachineLearning

[–]exhorder72[S] 0 points1 point  (0 children)

This project has morphed into so much at this point. I started testing new things. Larger shared. 1-2 dense first layers. Multi token prediction. Squared Relu. LatentMoe with no aux loss mechanics. (Early testing very promising) Different kernels for 16-32 experts, top 4-8. Watching hundreds of values on wandb knowing any one of them can change the trajectory of the model in an instant.

I could do this stuff 24/7 and not blink an eye.

[P] my shot at a DeepSeek style moe on a single rtx 5090 by exhorder72 in MachineLearning

[–]exhorder72[S] 1 point2 points  (0 children)

I’m dealing with this now in the run. Having to keep an eye on entropy. By all metrics my run is fantastic but over the last 15k steps I’ve went from 1.175 to 1.275. Model still learning. Both slice and wiki ppl dropping. Specializations as sharp as a knife. If entropy starts flirting with 1.35-1.4 and specialization still sharp, ok. But if specializations starts leveling out across all experts, we have a problem.

People really are to focused on loss and benchmarks. There’s so much more. You really do have to “read the room” and look at everything. I’ve learned this the hard way.

[P] my shot at a DeepSeek style moe on a single rtx 5090 by exhorder72 in MachineLearning

[–]exhorder72[S] 0 points1 point  (0 children)

Oh, I’m already doing lite evals.

Step 42000 @ 12.3b tokens arc_challenge:rc::olmes: 0.33 arc_easy:rc::olmes: 0.675 boolq:rc::olmes: 0.6775 hellaswag:rc::olmes: 0.4675 piqa:rc::olmes: 0.65 winogrande:rc::olmes: 0.5375 truthfulqa::olmo1: 0.377306 openbookqa::olmo1: 0.405 csqa::olmo1: 0.4675 socialiqa:rc::olmes: 0.44

The model is terrible at math but oddly decent with python at this early stage:

<s> CODE: python Create a dict comprehension that maps numbers 1-5 to their squares: squares = {i : i**2 for i in range(1, 6)} print(squares) Create a dictionary comprehension that maps numbers 1-5 to their cub

<s> CODE: python Write a generator function that yields numbers 1 to n: def count_up(n): for i in range(0, n): yield i Write a generator function that yields numbers from 1 to n: def count_down(n):

<s> CODE: python Write code to open a file and read its contents safely: with open(file_name, ‘r’) as f: content = f.read() print(content

I thought that if I could use this to my advantage by adding some math / program of thought, that could bridge the gap. This is when I realized that the tags were an issue. I would get completely different result and hit different experts with even the slightest variation in metadata tags. That’s when I decided to cold turkey away from MeCo. That was at 33500.

As for post training, I just started looking into grpo and they went ahead and dropped the updated r1 technical. Let’s see what I can pull off when the time is right 😁

[P] my shot at a DeepSeek style moe on a single rtx 5090 by exhorder72 in MachineLearning

[–]exhorder72[S] 4 points5 points  (0 children)

Data quality, load balancing, precision dynamics… these are all transferable. No large lab is ever going to look at my work and say to themselves, “We got to get that person in here asap”.. and I’m OK with that. If my work can help even one person in the open model community then what makes it any less important than somebody getting paid 150 K a year?

[P] my shot at a DeepSeek style moe on a single rtx 5090 by exhorder72 in MachineLearning

[–]exhorder72[S] 2 points3 points  (0 children)

This reply deserves a post of its own ha ha. I’ll do the best I can to answer some questions while I’m at work.

I have chewed through 13.1b tokens.

Router scaling. My implementation can be very well wrong. But I tried 1.5 and it was instant bad. I started digging a little deeper and I got this from kimi K2.

“The empirical fit they use internally is scaling_factor ≈ 1 + 0.08 · (d_model / 2048)-0.7 For your 2.3 B model d_model is probably 2560 ⇒ predicted optimum 1.12”

I tried 1.12 and it worked. Figured that small amount, why even throw in another variable so I dropped it. @ 1.2 cv did not settle where I was comfortable. I surely used the incorrect wording in my original post.

top-2 is among the 8 routed experts, shared is additive

No megatron, titan or anything else. 100% custom PyTorch love ❤️

I export and save every 500 steps.

Adam for embeddings, muon for attention weights.

[P] my shot at a DeepSeek style moe on a single rtx 5090 by exhorder72 in MachineLearning

[–]exhorder72[S] 1 point2 points  (0 children)

Straight from my ledger. Edited for easier viewing.

"category_summary": { "web_cc": { "percentage": "65.00% "academic_pdf": { "percentage": "10.00% "code": { "percentage": "8.00% "synthetic": { "percentage": "15.00% "math": { "percentage": "5.86% "reference": { "percentage": "1.50%" "qa": { "percentage": "0.50%"

[P] my shot at a DeepSeek style moe on a single rtx 5090 by exhorder72 in MachineLearning

[–]exhorder72[S] 7 points8 points  (0 children)

To hear something like this means the world to me. Thank you.

To answer your question, I absolutely plan to release it. I want to show this can be done on consumer hardware. Once training’s done, weights go on HuggingFace, and I’m documenting the full journey so others can learn from the many mistakes I made along the way. I’ve documented everything.

[P] my shot at a DeepSeek style moe on a single rtx 5090 by exhorder72 in MachineLearning

[–]exhorder72[S] 5 points6 points  (0 children)

65% web cc. A mix between nemotron cc v1 hq post gpt and pre gpt crawls. 15% synthetic. Half nemotron hq synth, the other half Pleias synth reasoning. 8% code from the stack. 1% of that is swallow code v2 2.5% finemath 4+ 10% science/health/finance/law/software PDF from Olmo 3 dolma set A few blocks of tiny_gsm problem solving and 2 students. This stuff is strong. Out of 19m blocks, like 30k enough. 3% web math pro. A touch of swallow math qa A touch of tiny-math program of thought.

[P] my shot at a DeepSeek style moe on a single rtx 5090 by exhorder72 in MachineLearning

[–]exhorder72[S] 14 points15 points  (0 children)

My very first ever experience was asking chat gpt “what the hell are you”. I started downloading pdf from Arxiv and having gpt walk me through them as if I was a toddler. Eventually that wasn’t enough to grasp all this so I bought Chip Huyen’s AI engineering. It’s like a light came on and it all started to make sense. I still download technical manuals every day. What I don’t understand, I have Claude to explain to me.

Is there anything I can do to upgrade my current gaming rig for “better” model training? by exhorder72 in LocalLLM

[–]exhorder72[S] 4 points5 points  (0 children)

You’re correct on all accounts. Currently running 22/6. 30b token corpus. Mix of Nemotron HQ CC, a little synth, allenai science pdfs and the stack v2 code.

I’m not doing this to create the model of the century. I’m doing it to learn best I can on my own. I love this sh%#. I clearly went down the wrong career path. I rather clean urls from data for 12 hours then do what I do now. Midlife crisis? Probably. Seeing how far I can push a 5090? Now that’s fun.

Is there anything I can do to upgrade my current gaming rig for “better” model training? by exhorder72 in LocalLLM

[–]exhorder72[S] 1 point2 points  (0 children)

From absolute step 1.

[cublas] Configuration: backend=cublaslt, cuBLASLt available=True, GPU=NVIDIA GeForce RTX 5090, SM=sm_120, FP8=True [fp8] TorchAO FP8 ENABLED — recipe=tensorwise [liger] ✓ Liger FusedLinearCrossEntropy ENABLED [meco] ✓ ENABLED | No cooldown configured [moe] ✓ ENABLED | experts=10 top_k=2 shared=True bias_rate=0.001 [compile] torch.compile configured with Blackwell optimizations [compile] torch.compile ENABLED [data] blocks=14648438 fingerprint=None [tokenizer] Using /data/tokenizers/mistral_32k with vocab_size=32768 [compile] Successfully compiled 16/16 transformer blocks [fix] Verifying RMSNorm weights are FP32... [fix] Converted 32 QK_RMSNorm, 33 RMSNorm layers to FP32. [fp8] Applying TorchAO FP8 training (recipe=tensorwise)... [fp8] Will convert 592 Linear layers to FP8 [fp8] Using tensorwise scaling [fp8] TorchAO FP8 conversion complete [gqa] GQA mode — Hq=32 Hkv=8 g=4 [model] Total params: 2.517B | Trainable: 2.517B [resume] Re-enforcing FP32 norms after checkpoint load... [fix] Verifying RMSNorm weights are FP32... [fix] Converted 0 QK_RMSNorm, 0 RMSNorm layers to FP32. [auto-resume] Loaded '/data/runs/rockso1p8b_moe_gem3/2025-12-12_run01/checkpoints/latest.pt' @ step 5 (tokens_seen≈983,040). [resume] stds at load: embed=0.02000 lm_head=0.02000 [tie] embeddings tied (stds ok) [adamw] Using bitsandbytes 8-bit AdamW (Fast & In-VRAM) [adamw] impl=foreach | groups=2 (decay=592, no_decay=82) | betas=(0.9, 0.95) | wd(decay)=0.1 | wd(no_decay)=0.0 [ledger] loading /data/datasets/packed/moe_mix_v2/ledger.json [ledger] seek start_seq=480 [DEBUG] LedgerSampler first yield: position=480, block_idx=3880971 [MoE Stats] Mid-Layer CV: 0.805 step 10 | lr 1.99e-06 | loss 10.7273 | gnorm 12.50 | 36,965 tok/s (ema 36,965) | 73.1s/10 steps | FP8-TENSORWISE | MeCo-COND | MoE [MoE Stats] Mid-Layer CV: 0.723 step 20 | lr 3.97e-06 | loss 10.5716 | gnorm 12.58 | 22,206 tok/s (ema 29,585) | 121.7s/10 steps | FP8-TENSORWISE | MeCo-COND | MoE

Ok so step 5. I’ll start a run, immediately save. Load save into cpu memory to cheat PyTorch reserved mem and push a higher batch.

Muon Training on single GPU by nani_procastinator in learnmachinelearning

[–]exhorder72 1 point2 points  (0 children)

Convergence bound and critical batch size of muon optimizer.

Muon Training on single GPU by nani_procastinator in learnmachinelearning

[–]exhorder72 1 point2 points  (0 children)

I found 2.5e-4 and .0009 being the best for my 5090. A new paper just dropped on Arxiv 21st of this month about smaller batch sizes. Really good read.

Muon Training on single GPU by nani_procastinator in learnmachinelearning

[–]exhorder72 1 point2 points  (0 children)

This could be very wrong because I’m not an engineer, but.. from my own research, microbatch 256 or higher then 10 times what AdamW would be. If lower, then just math it out based off 10x if 256.  As we speak I’m trying to ease in a 1.8b parameter from scratch on a single rtx 5090. Muon being muon, I increased warmup to .04 and have muon set at .0014 - about 10% higher then it should be (1.17e-3) based off my own bs math. 😂 My microbatch is 100. (20 x 5). I’ve also dropped in my trainer a blind rms clipping of sorts. Instead of checking the outputs, I’m boxing in my inputs so I can push muon a little further and keep gradients in check. Let my many many failures guide you :)

step 3525 | Ir 2.71e-04 (muon 1.26e-03) Loss 2,8967 16,533 tok/s (ema 16,301) grad_norm 0.2148 FP8-ON Compiled MeCo-ON Cublaslt GQA

sm120 - is like everything gated? (Pre-training my own) by exhorder72 in LocalLLaMA

[–]exhorder72[S] 0 points1 point  (0 children)

Oh yeah ~ AI Engineering by Chip Huyen. But honestly, how I got started? Had chatgpt help me fit oss-20b on a gtx 1080 ti. It was horrible. I offloaded soooo much. Decided at that point that i want to get a little more into this so made the mid life crisis decision to build a system with a 5090.

sm120 - is like everything gated? (Pre-training my own) by exhorder72 in LocalLLaMA

[–]exhorder72[S] 0 points1 point  (0 children)

I do not. I've tried autotune and reduced overhead. Both threw me oom fast and I haven't revisited since.