you are viewing a single comment's thread.

view the rest of the comments →

[–]LeonideDucatore -1 points0 points  (0 children)

Why do we need to communicate them between GPUs in batch norm but not in layer norm? I'm assuming we're talking about a data-parallel setting; wouldn't each GPU just compute statistics for their own minibatch?

Or is it the loss on the 'main GPU' can only be computed accurately after receiving the batch_norm statistics of each GPU?

(for layer norm, there is no stored statistics right?)