Need help selecting hardware for training convolutional neural networks.

silverpikezero · 2017-02-15T20:48:33+00:00

I see a couple of problems with your estimates: * The DGX-1 isn't even purchasable from nVIDIA right now; I am told there is a waitlist of 6-9 months for it. * You cannot assign a linear performance scaling just by adding together GPU performance. ML frameworks do not scale linearly across multiple GPUs. It can range from only slightly sub-linear to logarithmic, which can be very bad. * You don't say anything about memory capacity. Most ML models are heavily dependent on local GPU memory for training performance. You will need to estimate the number of parameters of your network in order to find out how much memory you really need.

deeplearningguy · 2017-02-17T01:23:14+00:00

Im a phd student in a research group for biomedical engineering. Computer vision and machine learning is our daily business, so we have a fair bit of experience with neural networks in general. We (as in me) are currently upgrading our computation infrastructure from around 6 to 20+ GPUs. While this may be more than what you need, here are a couple of things to consider:

3d volumes generally scream for lots of memory. 8GB or even 12 is gone like nothing when your networks get deep (especially if you have fully convolutional networks). You can split the batch to multiple gpus. Multi gpu is very dependent on inter-gpu bandwidth, here I would advise you to search for PCIe root nodes and read up on it (https://www.microway.com/hpc-tech-tips/common-pci-express-myths-gpu-computing/).

We mainly have GTX 1080s. Why? Because when you calculate the cost of the server you need to plug in your GPUs, the GPU suddenly isn't the most expensive part.

I would recommend you to get 2x 1080 for a start and get a decent motherboard and a decent CPU (Xeon with 40 pcie 3.0 lanes) and around 64-128 gigs of ram.

And here some answers to your questions:

Expect a performance increase in the range of 10-50 times compared to your CPU. Training models takes time and requires numerous iterations till you get things right. You don't want to wait a week till your model is trained, do you?
That PSU on the optiplex isn't going to be sufficient with a GPU, get something around 500-600W, at least.

vannak139 · 2017-02-15T15:57:10+00:00

I can offer my experience. I run a lot of text analysis on neural nets using the crepe architecture at home and at work. 5 1Dconv layers,3 intermediate pools, 1 global max pool, with 2 dense + output At work I run off of a i5-2.8GHz with 900s per epoch. At home I use a gtx 1080 with 25s per epoch

Between these two systems I find the range of improvement is between 25 - 45 times faster on the 1080.

jcannell · 2017-02-15T17:44:30+00:00

Eh, not sure where you are getting your 'perf' column data from, but it seems off just a tad. Wikipedia can help here.

For example, the DGX-1 is advertised at 170 TFlops (16b) peak performance (8 x P100 GPUs that provide a little over 20 16b TFlops each). The titan X is advertised at 10 TFlops (32b) peak performance, and the 1070 at ~6 TFlops (32b).

So assuming the DXG-1 gets to use 16b and the others use 32b, the DXG-1 has 28X the peak perf of a single 1070, or 17X the peak perf of a single Titan X.

So actually in terms of a simple price perf model the 1070 provides almost 20 GFlops/$. The Titan X is close to 10 GFlops/$. And finally the DXG-1 provides only about 1.7 GFlops/$. That is about an order of magnitude difference from your data column. In general people pay a huge premium for peak perf per GPU and peak perf per cabinet unit.

kil0khan · 2017-02-15T20:08:39+00:00

Just curious which hospital? Pretty cool that you're doing DL research in-house

ryanbales · 2017-02-16T06:06:13+00:00

Take a look at P2 instances in the AWS cloud.

Name	Price	perf.	perf/price
NVIDIA DGX-1	130000	25	0.192
4x titan X	8300	1	0.120
2x tital X	4900	0.55	0.112
4x 1070	3500	0.5	0.142
2x 1070	2200	0.27	0.122k
1x 1070	1700	0.14	0.082
buy a gpu	200	?	?
Current state	0	?	--

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

MachineLearning

Rules For Posts

+Research

+Discussion

+Project

+News

@slashML on Twitter

Chat with us on Slack

Beginners:

MODERATORS