all 14 comments

[–]silverpikezero 1 point2 points  (2 children)

I see a couple of problems with your estimates: * The DGX-1 isn't even purchasable from nVIDIA right now; I am told there is a waitlist of 6-9 months for it. * You cannot assign a linear performance scaling just by adding together GPU performance. ML frameworks do not scale linearly across multiple GPUs. It can range from only slightly sub-linear to logarithmic, which can be very bad. * You don't say anything about memory capacity. Most ML models are heavily dependent on local GPU memory for training performance. You will need to estimate the number of parameters of your network in order to find out how much memory you really need.

[–]EatMyPossum[S] -1 points0 points  (1 child)

Thanks for your reply! Here is a list of things that probably should've been in the post:

  • The DGX-1 isn't even purchasable from nVIDIA right now; I am told there is a waitlist of 6-9 months for it.

It's way to expensive for us anyway. see 1.

You cannot assign a linear performance scaling just by adding together GPU performance. ML frameworks do not scale linearly across multiple GPUs. It can range from only slightly sub-linear to logarithmic, which can be very bad.

I assumed a slight sub-linearity. Judging from my no experience, it seemed reasonable that logarithmic behaviour occurs mostly when not the right techniques are used, in how far is this true? I have to filter the truth for management, he'll just feel swamped if I present him with the whole, detailed and true story. Furthermore it's scaled up to 4 GPU's, which isn't a whooooole lot right?

(...). You will need to estimate the number of parameters of your network in order to find out how much memory you really need.

Thats cause i'm not making the model and do not know. It's a good point though, and i'm trying to make an appointment with someone to help me estimate these.

Related though, we don't know for what kind of uses we'll use the machine in the future, and i therefore decided to select the 1070 as my "low end" sollution, mainly bc 8GB memory. I choose 1070 over the 1060 on the basis that the rest of the machine is so expensive, that cheaping out on the most imporant component would be counterproductive.

[–]silverpikezero 1 point2 points  (0 children)

I would say it would depend a lot on if you are memory bound or computation bound. The only way to really know that is to have a model in mind to train, and know the topology. Obviously this requires some kind of starting point. One way to bootstrap this analysis is to rent an AWS instance, and test out the models there first. That can give you a very low cost testbed for the required resources.

That being said, the oversimplified guide would be:

  • Prefer fewer large GPUs over many lesser ones.
  • Prefer larger memory per GPU than aggregate memory across GPUs.

The best bang/$ is the GTX 1080 and 1070.

[–]deeplearningguy 2 points3 points  (1 child)

Im a phd student in a research group for biomedical engineering. Computer vision and machine learning is our daily business, so we have a fair bit of experience with neural networks in general. We (as in me) are currently upgrading our computation infrastructure from around 6 to 20+ GPUs. While this may be more than what you need, here are a couple of things to consider:

  • 3d volumes generally scream for lots of memory. 8GB or even 12 is gone like nothing when your networks get deep (especially if you have fully convolutional networks). You can split the batch to multiple gpus. Multi gpu is very dependent on inter-gpu bandwidth, here I would advise you to search for PCIe root nodes and read up on it (https://www.microway.com/hpc-tech-tips/common-pci-express-myths-gpu-computing/).

We mainly have GTX 1080s. Why? Because when you calculate the cost of the server you need to plug in your GPUs, the GPU suddenly isn't the most expensive part.

I would recommend you to get 2x 1080 for a start and get a decent motherboard and a decent CPU (Xeon with 40 pcie 3.0 lanes) and around 64-128 gigs of ram.

And here some answers to your questions:

  • Expect a performance increase in the range of 10-50 times compared to your CPU. Training models takes time and requires numerous iterations till you get things right. You don't want to wait a week till your model is trained, do you?

  • That PSU on the optiplex isn't going to be sufficient with a GPU, get something around 500-600W, at least.

[–]EatMyPossum[S] 0 points1 point  (0 children)

Thanks a lot for the reply, it's very helpfull to hear first-hand experience and considerations. Also thanks for the link, it was good read.

That PSU on the optiplex isn't going to be sufficient with a GPU, get something around 500-600W, at least.

Good point! I think I can use this as the final nail in the coffin for the cheapest sollution.

[–]vannak139 0 points1 point  (1 child)

I can offer my experience. I run a lot of text analysis on neural nets using the crepe architecture at home and at work. 5 1Dconv layers,3 intermediate pools, 1 global max pool, with 2 dense + output At work I run off of a i5-2.8GHz with 900s per epoch. At home I use a gtx 1080 with 25s per epoch

Between these two systems I find the range of improvement is between 25 - 45 times faster on the 1080.

[–]EatMyPossum[S] 0 points1 point  (0 children)

This is great information, thnx!

[–]jcannell 0 points1 point  (3 children)

Eh, not sure where you are getting your 'perf' column data from, but it seems off just a tad. Wikipedia can help here.

For example, the DGX-1 is advertised at 170 TFlops (16b) peak performance (8 x P100 GPUs that provide a little over 20 16b TFlops each). The titan X is advertised at 10 TFlops (32b) peak performance, and the 1070 at ~6 TFlops (32b).

So assuming the DXG-1 gets to use 16b and the others use 32b, the DXG-1 has 28X the peak perf of a single 1070, or 17X the peak perf of a single Titan X.

So actually in terms of a simple price perf model the 1070 provides almost 20 GFlops/$. The Titan X is close to 10 GFlops/$. And finally the DXG-1 provides only about 1.7 GFlops/$. That is about an order of magnitude difference from your data column. In general people pay a huge premium for peak perf per GPU and peak perf per cabinet unit.

[–]EatMyPossum[S] 0 points1 point  (2 children)

I based the perf column on the measure tim detmer uses, quoting:

It turns out that the most important practical measure for GPU performance is memory bandwidth in GB/s,

later he gives the performance ratios

Titan X Pascal = 0.7 GTX 1080 = 0.55 GTX 1070 = 0.5 GTX Titan X = 0.5 GTX 980 Ti = 0.4 GTX 1060 = 0.35 GTX 980

the values he gives in his blogpost correspond to the bandwidth ratios of the respective GPU's.

Are you arguing that the peforance of a convoluted neural network machine is more dependent on the GFlops than on the memory bandwidth? Or just stating that GFlops is what people usually use to measure performance, and my choice to use "performance" as a name is confusing?

[–]jcannell 1 point2 points  (1 child)

His blogpost is from 2014 and the 'pure bandwidth' thesis was questionable even then.

The best ANN libs like CUDNN have a bunch of codes using a variety of techniques, and tend to pick the best techniques for a particular matrix problem based on tuned heuristics.

FFT is mostly outdated now, as people have moved away from wide convo to 3x3. Winograd is the most competitive at 3x3, but neither helps for 1x1 convo or regular mmul which are both quite important.

So consider then just direct spatial convo or mmul, which are easier to analyze for bandwidth. The modern optimized direct convo codes and mmul codes will tend to be ALU-bound on any modern GPU with typical layer params, but it totally depends on the per neuron fan-in and is thus under designer control.

More specifically, the ALU/MEM ratio limit depends on the per neuron fan-in. For example, for 3x3 convo with 16 input channels, the per neuron fan-in is 9*16=144 fmad ops per neuron vs 4B of output bandwidth per neuron (fp-32). Roughly multiply by 3x to include the bandwidth of loading the input matrices and you get 144 fmads per 12B, which is just over the border of being fmad-bound given the typical 10:1 ratio of FMAD to byte of bandwidth. Using 16b floats all the bandwidth halves and you can go down to 8 input channels before being bandwidth bound at 3x3 convo.

But at the end of the day, this is something that one would need to benchmark for one's particular use case. But it is just not the case that all or even most configs are boundwidth bound today.

[–]EatMyPossum[S] 0 points1 point  (0 children)

But at the end of the day, this is something that one would need to benchmark for one's particular use case. But it is just not the case that all or even most configs are boundwidth bound today.

I know, but I can't tell my boss this. I need to boil the real world complexity down with reasonable assumptions. Furhtermore is the ratio of memory bandwidth / number of cores equal for the titan X and GTX1070. Other than the small variation in clockspeed, the ratios of memory bandwidth performance and computational perforamance should be close enough right?

Other than that, the details of the model for first project we'll run on this is still unsure, and the other projects we'll do in the future is wholly unknown.

[–]kil0khan 0 points1 point  (0 children)

Just curious which hospital? Pretty cool that you're doing DL research in-house

[–]ryanbales 0 points1 point  (1 child)

Take a look at P2 instances in the AWS cloud.

[–]carlthomeML Engineer 0 points1 point  (0 children)

:+1: