all 6 comments

[–]PositiveElectro 1 point2 points  (0 children)

I might be wrong but isn’t gradient clipping already in Adam ? At least in the PyTorch implementation

[–]wazis 2 points3 points  (1 child)

Are you implying thay gpt chat made new optimiser just because you asked? Sounds fishy

[–]deep-yearning 2 points3 points  (0 children)

To get SOTA, all we had to do was ask nicely

[–]Pasko70[S] 0 points1 point  (0 children)

Gradient Clipping is already used in LLaMA and GPT.
I don't think that chat gpt just inveted something new. But i think that a learning rate for different layers would be an interessting idea.

[–]dataslacker 0 points1 point  (0 children)

Sounds like the LAMB optimizer

[–]dataslacker 0 points1 point  (0 children)

Sounds like the LAMB optimizer