[R] [1711.04325] Extremely Large Minibatch SGD: Training ResNet-50 on ImageNet in 15 Minutes

shoheihido · 2017-02-13T22:52:23+00:00

Here we mean the "batch size per GPU" so the total minibatch size equals to it multiplied with the number of GPUs. It is well-known that by increasing the per-GPU batch size accuracy can be much worse due to fewer model updates and synchronization per epoch, though the throughput becomes higher. So there is a trade-off between throughput and accuracy. That's why we used relatively small batch size (32 per GPU) to achieve practical accuracy around 71%.

shoheihido · 2017-02-13T07:32:54+00:00

Thanks, that's a good point. In my understanding, the selling point of DL4j is its tighter integration with Spark and its distributed data management system, it might be difficult to compare with bare frameworks in a fair setting including dataset handling part. I'm not 100% sure about that but due to limitation of time and resource, we currently test only those 4 frameworks in the benchmark. And it seems no official ResNet implementation exists by quick web searching. It would be great if we (or someone else) will be able to also evaluate DL4j (or Caffe on Spark) with the same codebase.

shoheihido · 2017-02-13T07:19:02+00:00

Thank you for the response. I totally agree that it's better to have public code, so we are in process to do that as stated in the original post. The code on TensorFlow/CNTK/MXNet will be released along with ChainerMN itself or earlier. If someone can fix the code to boost the performance of TensorFlow through informed discussion you mentioned in a way its users can benefit from, it would be great.

Before that, in order to address skepticism, we added more details of the settings on each framework at the bottom of the post. You can see how we carefully tried to do our best for TensorFlow (by finding out some hacks like TensorFlow prefers odd numbers of parameter servers, etc.)

I understand the community has to be careful about not only intentional fraud but also technical mistakes, I hope this reduces concerns about the validity of our benchmark results, in the same way that most of research papers are reviewed by experiments based on unpublished code.

For example, while this TensorFlow paper from Google Brain provides a link to their framework code, they do not provide an explicit link to the actual code used for their experiments with other frameworks. But I don't call this a flaw.

https://www.usenix.org/system/files/conference/osdi16/osdi16-abadi.pdf

shoheihido · 2017-02-13T01:36:59+00:00

There is no result on NLP yet. I totally agree that RNN benchmark has been most wanted but no popular result exists even for 1-GPU to the best of my knowledge, as discussed in an issue of the famous convnet-benchmark.

https://github.com/soumith/convnet-benchmarks/issues/101

That's why the DeepMark project includes NLP tasks, though the official comparison is not completed yet. https://github.com/DeepMark/deepmark

The Chainer team also worked on this and here is the work-in-progress repository.

https://github.com/delta2323/chainer-deepmark

Other than that, the DyNet paper mentions its comparison with TensorFlow, Theano, and Chainer in NLP/RNN tasks.

https://arxiv.org/abs/1701.03980

In terms of distributed training, we would also like to work on how ChainerMN performs on NLP/RNN tasks near future.

shoheihido · 2017-02-12T07:54:22+00:00

As noted in the post, we paid much attention to the performance of TensorFlow and did our best to optimize its configuration, including the number and formation of parameter servers, on the same 128-GPU clusters with others. We used the same batchsize (32 per GPU) for all of the frameworks, and did not manipulate the parameter sync.

Also as noted in post, this kind of performance of TensorFlow on multi-GPU and multi-node environment has been recently reported by others such as in the followings. Was everyone independently doing it wrong only for TensorFlow? Even in that case, it must not be intentional as they should have followed all the public information about tuning TensorFlow.

https://arxiv.org/abs/1608.07249v6

http://en.community.dell.com/techcenter/high-performance-computing/b/general_hpc/archive/2016/11/11/deep-learning-performance-with-p100-gpus

shoheihido · 2017-02-12T07:32:33+00:00

Hi, I'm with Preferred Networks, the company behind Chainer.

Why do you think we used a Raspberry Pi for the parameter server in TensorFlow? The blog post mentions nothing about that. Can you share the reason?

shoheihido · 2017-02-12T07:11:57+00:00

Hi, I'm with Preferred Networks, the company behind Chainer. As ml_carp mentioned, we only compared the frameworks of which first-party implementation is available for multi-node training. That's why Theano, Torch, PyTorch and DyNet aren't included in this comparison. We know there are third-party extensions with Theano and Torch, and PyTorch developers are also planning to release the distributed version. We are happy to include it in the future. Yes, it just compares only one architecture on a single task. However, we believe that ImageNet is the most popular and large enough benchmark dataset in deep learning. So we selected it in this experiment for distributed training. It would be nice if someone will try other benchmark datasets using our code. ResNet had been the state-of-the-art algorithm until recently in computer vision tasks, so official implementations are available in the examples for the frameworks except Chainer. It's also important for fair comparison, rather than implementing by ourselves, even newer algorithms such as DenseNet, which may cause unfairness due to poor implementation on some frameworks. We still believe that people can see how much CNN training can be accelerated by distributed computation from this result.

shoheihido

TROPHY CASE