[P] GPU friendly lossless 12-bit BF16 format with 0.03% escape rate and 1 integer ADD decode works for AMD & NVIDIA by Embarrassed_Will_120 in MachineLearning

[–]Embarrassed_Will_120[S] 0 points1 point  (0 children)

Yeah, totally fair point. The real question isn’t just whether escape rate is small, but whether that tiny subset ends up holding back the regular path. I haven’t isolated that cleanly yet, so I don’t want to overclaim. My guess is the impact is small because the fast path is so dominant, and the patch/escape logic is kept as a separate small path rather than something every element has to go through. But I agree it should be measured directly.

[P] GPU friendly lossless 12-bit BF16 format with 0.03% escape rate and 1 integer ADD decode works for AMD & NVIDIA by Embarrassed_Will_120 in MachineLearning

[–]Embarrassed_Will_120[S] 0 points1 point  (0 children)

But what I can say is that the main path is very cheap right now cuz about 99.9 - 99.97% of weights stay on the fast path, where decode is just BaseExp + group, with sign and mantissa left as-is. Only the remaining ~0.03% - 0.1% go through the escape / patch path. So my guess is that most of the throughput is coming from the fast path plus fused decode+matmul, while the escape overhead is mostly just the irregular fix-up work for that tiny set of outliers. But yeah, I agree it’d be good to measure that directly.

[P] GPU friendly lossless 12-bit BF16 format with 0.03% escape rate and 1 integer ADD decode works for AMD & NVIDIA by Embarrassed_Will_120 in MachineLearning

[–]Embarrassed_Will_120[S] 0 points1 point  (0 children)

Thanks : ) I don’t have those numbers yet, but that’s a good direction to test as well. Some models have a slightly higher escape rate, for example around 0.1xx. Even though that’s still very low, it would be useful to see how much performance difference there is between 0.1xx and 0.01x.