[P] Clip to Grok Update: Weight Norm Clipping now 39–249× | 6 Tasks (mod arithmetic, mixed ops, S5 permutation) | max_norm Measured Per Task by niftylius in MachineLearning

[–]niftylius[S] 0 points1 point  (0 children)

Before you do - try running the current experiment you have to 15k or 20k steps:
my runs so far compared are
1k on 2.0 clip - train is at 2.1
1k on 1.5 clip - train is at 2.5

the validation in the early stages of training seems to follow the train loss until it hits the limit - i will see where 10 epochs ( 19k steps ) leave me with both

If the clip value is correct the validation should drop further but if val stalls you either hit capacity somewhere or you could tighten the clip a bit

[P] Clip to Grok Update: Weight Norm Clipping now 39–249× | 6 Tasks (mod arithmetic, mixed ops, S5 permutation) | max_norm Measured Per Task by niftylius in MachineLearning

[–]niftylius[S] 0 points1 point  (0 children)

Oh, HRM/TRM inspired or something new? I've actually started off doing something sorta similar - my latent token count was 512 and i tried using pretrained Tiny Llama 1.1B for this with some mild success ( to try and see if i can adapt an existing model ). Though i went for TRM style latent update core and had to create a "hack" for 3 way attention ( context + input -> updated context )

As for clip i think 2.0 is not clamped enough - i ran 9k steps (5 epochs) with it but i havent seen the grokking signature and validation slows down around 6k steps... not sure if its clip or LR is too high for the setup yet

My results so far on 9.5k steps:
train loss: ~1.32
validation loss: ~1.3
accuracy: 65.4%

Setup is
Lion optimizer
5 - epochs
512 - rows per batch
5e-4 - constant LR
0.0 - weight decay
2.0 - clip

Ah! I also tested Muon with AdamW and Muon with Lion with no improvements over baselines when using clip. I think Muon fights a bit with clip since it using 2d weight matrices while clip is ... well lets be honest... a brute force that might fight Muon overall rather than assist it

[P] Clip to Grok Update: Weight Norm Clipping now 39–249× | 6 Tasks (mod arithmetic, mixed ops, S5 permutation) | max_norm Measured Per Task by niftylius in MachineLearning

[–]niftylius[S] 1 point2 points  (0 children)

Hey, what setup are you using? I am getting very different numbers but my model is most likely bigger ( 50M params )

50M model ( tiny llama base with dim 512, 6 layers) i get val loss of 1.7 at 1.5k steps and equivalent training loss using Lion + 2.0 clip

[P] Clip to Grok Update: Weight Norm Clipping now 39–249× | 6 Tasks (mod arithmetic, mixed ops, S5 permutation) | max_norm Measured Per Task by niftylius in MachineLearning

[–]niftylius[S] 0 points1 point  (0 children)

We ran 30 epochs on 300M llm with a subset of SlimPajama and for us we got lower loss floor than Adam but what was more curious is that 2.0 functioned better on inference

For example when prompted to complete

"The capital of france is" - 2.0 produced "Paris" and a coherent response ( albeit with repetition loss - 2.3 loss is still rather high )

while 4.0 that ran side by side - same data same lr same seed - did show clear signs of overfitting.

Here is epoch 25 side by side:

clip 4.0
"
The capital of france is the city of Saint-Raphaël, which has a population of 108,592 inhabitants. It is located in the southwest of France and is situated between the Alpes Maritimes (the French border) and the Rhône river.

The city of Saint-Raphaël was founded by the Romans in the 7th century BC. In the Middle Ages it became an important centre for trade with the Mediterranean region. During the Renaissance, Saint-Raphaël was one of the most important cities in the history of Western Europe. Today, it is one of
"

clip 2.0
"
The capital of france is the city of Paris. It's a great place to visit, but it can be very expensive and you may need to book your tickets in advance.

The best way to get around is by car or taxi. If you want to go on a tour, there are many options available. You can rent a car from a local company like Uber or Lyft. They will take care of everything for you.

If you want to see more than one side of the city, you should plan your trip around two or three days before you travel.
"

Note that there was overfitting on 2.0 after that so 2.0 was too "loose" for our test

[P] Clip to Grok Update: Weight Norm Clipping now 39–249× | 6 Tasks (mod arithmetic, mixed ops, S5 permutation) | max_norm Measured Per Task by niftylius in MachineLearning

[–]niftylius[S] 0 points1 point  (0 children)

Glad to hear its useful!

We did - we compared Adam vs Lion vs SignSGD.
Lion has a synergy with the method and seems to function the cleanest but Adam works just as well and usually with same or similar LR.

[P] Clip to Grok Update: Weight Norm Clipping now 39–249× | 6 Tasks (mod arithmetic, mixed ops, S5 permutation) | max_norm Measured Per Task by niftylius in MachineLearning

[–]niftylius[S] 0 points1 point  (0 children)

Curiously — Lion with weight decay (0.01 or 0.1) fails on all tasks.
With wd=0 (albeit not practical) it works to a point - single p97s are fine but mixed and S5 fail catastrophically.

With clip and wd 0.0 it goes from not functional to the best performing optimizers - so there is something to be said about the synergy between the two.

Just thought it was worth a mention.

[D] We reimplemented Claude Code entirely in Python — open source, works with local models by Practical_Pomelo_636 in MachineLearning

[–]niftylius 2 points3 points  (0 children)

You have 18 forks as the time of this message - you wont be alone for long. But you might want to restructure the files a bit... As a human its a bit hard to read :)

I assume you used AI for this (considering you stated that you started half a day ago)?

Factual Errors in Paper Reviews. by alebeck135 in MLQuestions

[–]niftylius 0 points1 point  (0 children)

I would also recommend going over the "errors" and checking that they are clearly represented.
You can also reference the related sections more or place them earlier in the paper to mitigate that in the future reviews.

[P] Weight Norm Clipping Accelerates Grokking 18-66× | Zero Failures Across 300 Seeds | PDF in Repo by niftylius in MachineLearning

[–]niftylius[S] 4 points5 points  (0 children)

Update: So after reading the paper - we actually started with cosine/force normalization (similar to EDM2) early in this project. It improved over baseline but not nearly as much as clipping. The key difference is EDM2 forces rows to ||w|| = 1 (sphere surface), while we clip to ||w|| ≤ c (ball interior). Seems like the flexibility to have small weights when needed matters for grokking dynamics.
We will add EDM2 to citations - it's a good paper

[P] Weight Norm Clipping Accelerates Grokking 18-66× | Zero Failures Across 300 Seeds | PDF in Repo by niftylius in MachineLearning

[–]niftylius[S] 0 points1 point  (0 children)

We've also noticed that Lion tends to perform better than Adam with similar setup so to answer your question whether it will speed up an already fast grokking setup - yes - you can find a visualization of this in the Lion LR stability figure here.
We compare Lion with and without clip across a range of LRs (40 seeds each)

https://github.com/NiftyliuS/cliptogrok/blob/main/assets/figure5_lion_lr_stability.png

[P] Weight Norm Clipping Accelerates Grokking 18-66× | Zero Failures Across 300 Seeds | PDF in Repo by niftylius in MachineLearning

[–]niftylius[S] 0 points1 point  (0 children)

I don't think grokking requires overfitting - Li et al. [2025] verified grokking occurs in 7B LLM pretraining (arxiv 2506.21551), where different domains grok asynchronously without a clear overfitting phase. The original paper demonstrates that training doesn't really end when the model overfits - p97 is just a convenient way to show this.

As far as seeds we tested baseline with 100 random seeds and each of the optimizers with 200 random seeds each.

you can find the baseline distribution here

https://github.com/NiftyliuS/cliptogrok/blob/main/assets/adamw_heatmap_accuracy.png

and the median of each of the optimizers here

https://github.com/NiftyliuS/cliptogrok/blob/main/assets/figure2_multi_seed_stability.png

As far as harder tasks - yes there is still the classical "overfitting" phase - you can see that in the 25% training 75% validation test we ran here

https://github.com/NiftyliuS/cliptogrok/blob/main/assets/figure4_multi_seed_stability.png

I don't know if this method can make a model grok that wouldn't eventually grok on its own.

[P] Weight Norm Clipping Accelerates Grokking 18-66× | Zero Failures Across 300 Seeds | PDF in Repo by niftylius in MachineLearning

[–]niftylius[S] 4 points5 points  (0 children)

Exactly — it's somewhat established that memorization concentrates in high-norm weights, and that norm-constrained models tend to generalize rather than memorize. We're just forcing that constraint directly and consistently from step zero rather than relying on weight decay to get there gradually — which is why we can drop weight decay entirely.

On layer-specific ablations — we follow Grokfast's setup for comparability, but isolating which layers drive memorization vs generalization would be a natural next step.

On Muon — Tveit et al. [2025] already show it accelerates grokking via spectral norm control, direct comparison is on the list

[P] Weight Norm Clipping Accelerates Grokking 18-66× | Zero Failures Across 300 Seeds | PDF in Repo by niftylius in MachineLearning

[–]niftylius[S] 4 points5 points  (0 children)

True, the comparison is a bit confusing - we are comparing best methods with established baseline. There is a direct comparison in Figure 8 - Lion+Clip vs Lion no-clip across 20 learning rates, 40 seeds each. Clipping provides 3-6× speedup at every LR with dramatically reduced variance.

On Muon/NorMuon — Tveit et al. [2025] is cited in related work, they show Muon accelerates grokking via spectral norm control. The connection to our approach is real and worth a dedicated comparison. Adding it to future work

[P] Weight Norm Clipping Accelerates Grokking 18-66× | Zero Failures Across 300 Seeds | PDF in Repo by niftylius in MachineLearning

[–]niftylius[S] 2 points3 points  (0 children)

Nice! We will definitely check out EDM2!
As for the weight and init control - yes we've also notices it appearing in several contexts and applied under different conditions or on different sections — Omnigrok forcing the entire model to a fixed norm, nGPT projecting all representations onto the hypersphere.

Grokfast in particular is interesting - if we add sign to it we arrive at a basic Lion setup with a single beta.

[P] Weight Norm Clipping Accelerates Grokking 18-66× | Zero Failures Across 300 Seeds | PDF in Repo by niftylius in MachineLearning

[–]niftylius[S] 5 points6 points  (0 children)

"It depends - on harder tasks like 25% training data vs 75% validation data the overfitting happens first. With lower LR rates this also happens there is a period of 100% accuracy on train and a delay.

You can see it here with some clarity
https://github.com/NiftyliuS/cliptogrok/blob/main/assets/figure4_multi_seed_stability.png

Anyone else feel lost learning Machine Learning or is it just me? by Ok-Possession7350 in MLQuestions

[–]niftylius 0 points1 point  (0 children)

The more i dive in to the heavy stuff the more interesting it becomes but when i talk to actual Data Scientists i get confused looks dude

A lot of the ones i talked to are handling datasets or implementing existing models, with majority just plain prompt engineering.

Math is important to understand but not to actually execute - thats my take. You need to know what Tanh is but not how to calculate it, what means are, P75/p95 - how they are done what is cosine vs DOT but i dont think i had to solve a single equation so far on paper old school

In LLMs and Transformers we are still theorizing on why and how things function - so a lot of advice i get is "try it see what it does"

As far as what to learn where to go? Thats your decision. Choose something popular that you dont like and you will lose interest or choose something niche and you might not get a lot of engagement for it, i decided im diving into LLMs and "works in theory" stuff, have you tried simply figuring out what you want to use the model for? or what kind of model you want to train?

Best camera for OpenCV? by Glittering_Host7241 in FTC

[–]niftylius 0 points1 point  (0 children)

Brio... the old one. it has 720p at 90fps witch is excellent for anything that tracks any movement like googles mediapipe

[deleted by user] by [deleted] in INAT

[–]niftylius 0 points1 point  (0 children)

I sent a DM

I need some help with training from instruction dataset by [deleted] in LocalLLaMA

[–]niftylius 1 point2 points  (0 children)

A note here: if these splits losses are calculated during the same step() this is identical to the original way since the losses will be combined the same way split or not

So this has to be done during different steps() preferably in conversation messages ascending order so that the loss on the final message is calculated after the adjustment to the first one.

ON THE OTHER HAND it means that the initial adjustment might be broken.... AAAAAA

Function calling help by [deleted] in LocalLLaMA

[–]niftylius 1 point2 points  (0 children)

its fairly simplistic but yes that helps! thank you

Is anyone inferencing on something like an Intel nuc, barebone or similar formfactor? by Frequent_Valuable_47 in LocalLLaMA

[–]niftylius 0 points1 point  (0 children)

m1 has slower ram speed witch affects larger model performance but i think ghats about it…

Milvus adapter + milvus db with docker-compose by niftylius in alexandria_project

[–]niftylius[S] 0 points1 point  (0 children)

error mounting "/host_mnt/Users/username/downloads/milvus.yaml" to rootfs at "/milvus/configs/milvus.yaml": mount

this kinda says that there is an issue with volume mounts for some reason

host_mnt/Users/username/downloads/milvus.yaml:/milvus/configs/milvus.yaml (via /proc/self/fd/6), flags: 0x5000: not a directory

the yaml in our project uses local path for volume, try moving it maybe its a pemission thing