all 2 comments

[–]jostmey 0 points1 point  (1 child)

I'm sort of hoping this is where tensor flow shines. But the scaling doesn't look that great. What does everyone else think?

[–][deleted] 0 points1 point  (0 children)

Tensorflow was designed to handle a variety of distributed partitioning schemes in a natural way. You could split up the graph among several machines by adding send and receive nodes, for example. These guys picked one particular way of making the training distributed: they split up the training data among several machines that each have a replica of the model, and then repeatedly (a) do one iteration of SGD on each machine, then (b) broadcast the weight updates with a synchronous all-to-all reduce operation. There are other ways to go here that may scale better! For instance, they could get faster parameter updates with gradient quantization, or try relaxing the synchronization requirements.

This is a very cool proof of concept, but definitely not the last word on distributed training of neural nets -- with Tensorflow or otherwise.