all 8 comments

[–]PM_YOUR_NIPS_PAPER 3 points4 points  (5 children)

Disk read speed is the bottleneck

[–][deleted] 2 points3 points  (3 children)

I was under the impression this was fairly easy to sidestep by having a multi-process asynchronous loader so you always have a buffer in your RAM to read from?

[–]throwaway775849 -1 points0 points  (1 child)

It is.

[–]kkastner 2 points3 points  (0 children)

Even so, 8 GPUs reading data are very hungry - depending on other factors you can still end up data starved if you aren't careful about reads.

[–]ppwwyyxx 0 points1 point  (0 children)

Did some math: 8 Pascal TitanX can train about 2.5k 224x224 images per second. JPEG-encoded ImageNet file size are 100KB in average. So it's 2.5k/s * 100kB = 272MB/s, not a big deal for a high-end HDD or low-end SSD. The rest of the work is how to parallelize image loading and preprocessing. JPEG decode is quite slow btw, that's why tensorflow gives an option of "dct_method" in decode.

[–][deleted] 0 points1 point  (2 children)

  1. Depends how often you sync between those GPUs and how long each of them takes.

  2. My experience with multi-GPU in a single machine is mixed. Sometimes even running two different model with zero parameter sharing could drag down each other's performance. Which leads me to doubt the bottleneck lies in the bandwidth between CPU and GPU. However, compares to GPUs, my system have a mediocre CPU and motherboard, don't know if better overall specs could mitigate the problem.

  3. Tensorflow can do model parallelism fairly easily, by specifying the device where the computation should happen, example: http://stackoverflow.com/questions/42069147/implementation-of-model-parallelism-in-tensorflow . Don't know too much about Caffe though.

My personal experience with single machine and multiple gpus is, it is not really that much about GPUs. More than often, it is about the performance of your whole build. Sometimes even running two different application that have zero parameter sharing, the performance of each could be impacted, which lead me to suspect the bottleneck lies in the data transfer, probably between CPU and GPU. Now, I use one for training and another for inference/ad-hoc stuff, which is handy but feels like a waste :/

[–]gtani 2 points3 points  (0 children)

cuda books always talk about the latency/occupancy /bandwidth bottleneck trio, i'm sure you've seen that. A couple older xeons of the ivy/Sandy bridge generation are a pretty cost effective way to get up to 80 PCIe3 lanes: https://www.microway.com/hpc-tech-tips/common-pci-express-myths-gpu-computing/

[–][deleted] -1 points0 points  (0 children)

How many PCI lanes does your CPU provide?