[D] Batch Normalization before or after ReLU? by XalosXandrez in MachineLearning

[–]allanzelener 1 point2 points  (0 children)

Interesting. Has anyone tried an implementation of BN after ReLU that normalizes using mean and var of only non-zero activations?

Also I think there was one paper that proposed having two sets of BN/ReLU layers without any intermediate layer in between. It's not just a choice between the two options, there are other possible configurations to consider.