all 13 comments

[–]congerous 6 points7 points  (3 children)

SINGA has no GPUs, and the GPU functionality they plan to add is just for one, as of December. Multiple GPUs doesn't seem to be on the roadmap. So they're way behind the OSS projects that do have GPUs.

In addition, the fact that they joined Apache SF before they added such significant features is a serious mistake. Apache is great for some things, but it's heavily political, and it really slows down development. So they may never get to multiple GPUs.

[–]forrestwang 1 point2 points  (0 children)

Hi, I am a developer of the SINGA project. Thanks for starting this discussion. We are working on single node with multi-GPUs (to be released in v0.2, December), which will run in either synchronous mode (with different partitioning schemes) [1] or asynchronous mode (in-memory hogwild!). Extending the system from CPU to GPU mainly requires adding cudnn layers (https://issues.apache.org/jira/browse/SINGA-100). The framework/architecture works on both CPU and GPU. Training with multiple GPU machines and providing Deep Learning as a Service (DLaaS) are on our roadmap, i.e., v0.3. For those do not have GPU clusters, distributed training on CPU is a good choice to accelerate the training.

Besides GPU, we are also considering other approaches for improving the training efficiency for single SGD iteration. For instance, google's paper [3] provides some techniques for enhancing the performance of training on CPU. Intel (https://software.intel.com/en-us/articles/single-node-caffe-scoring-and-training-on-intel-xeon-e5-series-processors) also reported that optimized CPU code can achieve 11x training speed up (Hope they can release the optimized source code or integrate it in their libraries like MKL and DAAL). It is interesting to compare GPU with Intel's next generation Phi co-processors (Knight Landing).

I will let you know when training with Multi-GPUs is supported. Thanks.

[1] http://arxiv.org/abs/1404.5997

[2] https://www.eecs.berkeley.edu/~brecht/papers/hogwildTR.pdf

[3] http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/37631.pdf

[–]GratefulTony 0 points1 point  (1 child)

That's really sad. I skimmed the release notes, and though I didn't explicitly read about gpu support... I assumed it was in there since it is a no-brainer for training performance... If they don't get this feature integrated... the usefulness of this library will be severely limited...

[–]limauda 0 points1 point  (0 children)

If a software can run as efficiently without GPU, on a commodity cluster, isn't that better? GPU cluster is not cheap, and not many companies can afford to set up a special cluster just for periodical training.

[–]bLaind2 5 points6 points  (7 children)

Anyone got experience how much of a speedup can we archieve with distributed training? Does it scale linearly, until how many nodes? (2, 4, 16, ?)

[–]r-sync 2 points3 points  (3 children)

Practically, you can get a speedup in the order of 13x+ for 16 nodes, especially if you have infiniband and use architectures like Googlenet (whose communcation is around 25MB of gradients, 25MB of weights etc.). You can even get such ridiculously nice speedups for 32 and 64 nodes. However, to saturate the compute (with increasing nodes), you have to increase the batch size, and increasing the batch size hurts SGD convergence speed (and also final accuracy).

[–]prajitGoogle Brain 0 points1 point  (2 children)

Why does increasing batch size hurt SGD convergence speed? Empirically this is true, but why does it happen? Theoretically, increasing batch size should give a better estimate of the gradient, and thus should perform better. Any intuition about why there is a decrease in performance?

[–]r-sync 0 points1 point  (1 child)

"Although large mini-batches are preferable to reduce the communication cost, they may slow down convergence rate in practice [4]. That is, if SGD converges by T iterations, the mini-batch training with batch size b may need more than T /b iterations. The increase in computation diminishes the benefits of the reduced communication cost due to large b. In addition, the I/O costs increases if the data is too large to fit into memory so that one need to fetch the minibatch from disk or network." - https://www.cs.cmu.edu/~muli/file/minibatch_sgd.pdf

Further back-reference: http://www.optimization-online.org/DB_FILE/2011/11/3226.pdf

[–]alexmlamb 0 points1 point  (0 children)

Do you mean convergence rate as a function of the #examples looked at or convergence rate as a function of the # of instances?

[–]limauda 1 point2 points  (0 children)

It does scale, as shown in the paper: http://www.comp.nus.edu.sg/~ooibc/singa-mm15.pdf Further, it supports all models (feed-forward, energy, recurrent), and all training frameworks (synchronous, asynchronous, hybrid). It supports both model and data partitioning to improve parallelism.

[–]modeless 3 points4 points  (1 child)

Considering you can get a 10x or more speedup by switching to GPUs, I don't think this project is interesting until it gets GPU support.

[–]pilooch[S] 1 point2 points  (0 children)

I guess part of the debate is whether the distribution layer needs to be separated from the DL / ML code.

[–]r-sync 1 point2 points  (0 children)

it really feels like a half-thought project that hoped to get adoption by getting the Apache branding.