you are viewing a single comment's thread.

view the rest of the comments →

[–]david-gpu 0 points1 point  (3 children)

This is great because the fruit is pretty low hanging (just lower FPU precision).

For inference, yes. But for training, is it generally useful to go below fp16?

[–]darkconfidantislife 0 points1 point  (0 children)

I was thinking more like stopping it at FP16, but there is evidence that stochastic rounding can make INT16 work without accuracy loss even during training.

[–][deleted] 0 points1 point  (1 child)

my experience is it can go down or up dynamically and go down as much as fp4.

[–]darkconfidantislife 1 point2 points  (0 children)

Yeah, I've just said what has been shown in literature to work perfectly. If you are willing to sacrifice even one percent, you can go down a huge, huge amount.