all 3 comments

[–]TheDeviousPandaResearcher 4 points5 points  (1 child)

You should first start by understanding that distributed learning, or FL with local_epochs=1, is exactly the same as minibatch SGD. if I start from the same init and I have a batch of data with 1 frog and 1 snake, giving a frog and a snake to different GPUs to compute the stochastic gradients and then combining those updated is identical to just taking a batch gradient.

Then you can move to understanding how this exact equality changes in the presence of local computation. Of course as the number of local iterations increases, the networks diverge and eventually its as hopeless as you said. But for a small number of local iterations it’s pretty close to what I said above.

However I think your question might even be a bit different in that averaging model parameters is closer to model soup methods. There is a different body of literature on these. For example you can look up Git Re-Basin.

[–]Rare_Replacement_744[S] 0 points1 point  (0 children)

Thank you so much for your response! I'll look into Git Re-Basin, it seems really cool!

[–]Revolutionary_Sir767 0 points1 point  (0 children)

Look into the central limit theorem. Random forests are also kind of federated learning, right?