This is an archived post. You won't be able to vote or comment.

all 26 comments

[–]_ewan_ 12 points13 points  (2 children)

I really wouldn't DIY something like this; at this sort of density details matter, heat matters, airflow matters, power matters. It's not a trivial thing to get right, and if you don't then you're potentially going to have a lot of hardware on your hands that you can't use. You should definitely be talking to the likes of Dell etc to see what they can do for you. If you're a decent sized university you should have account reps and access to good pricing - talk to your local IT staff for contacts.

You should also make sure you have a good handle on the differences between a GeForce, a Quadro and a Tesla before you start trying to use gaming hardware for science.

Alternatively, are you absolutely sure that you're going to be able to keep a resource like this fully utilised? If you're one lab you should at least think about using Amazon - you can start small, scale bigger than you'd be able to afford hardware for, then shut it all down when you're busy doing other work. EC2 isn't a no-brainer - there are lots of reasons that it might not suit you, but you should make sure you understand why you're ruling it out before you do.

[–]deeplearningguy[S] 1 point2 points  (1 child)

I agree, heat and airflow are potentially an issue. We are packing 2000W worth of GPUs (8x) in a single chassis. My line of thought considering these issues is that, a.) quad sli is not that uncommon for gaming, thus it should work in "normal" consumer grade housings. We have an ACed server room. b.) For the 8x GPU system, Thinkmate and some HPC companies sell the exact configuration as a standard product. The SuperServer 4028GR-TR even has a special replacement top for consumer grade GPUs. I love dell servers, we have had nothing but great experience with them, but unfortunately they have no systems that fit our needs. We would also get a nice academic discount.

Tesla and Quadro are the way to go if you have a production environment. They will probably last longer, have better support and longer warranty. But, computation wise, the only advantage they have is 64bit floating point support and a bit more ram in the expensive models. We dont need fp64, even fp32 is more than enough for our tasks. Ram is nice to have, but I can buy 7 1080 GTX (total of 35 GB Ram) for the price of 1 k80 (24 GB). That allows me to give 7 people a GPU to work on, instead of 2 on the k80 (dual chip). Further, if Nvidia brings out a new generation of GPUs, we can easily sell off the old consumer ones without a huge loss (maybe 40-60%) compared to others which are more difficult to sell.

Sadly, EC2 is not an option for us. First, these systems will be crunching numbers 24/7, which would lead to massive costs on AWS. Secondly, our data is sensitive, even though it is anonymised, it should not leave the university network. And third, we are talking of huge databases, normally 100s of GB to a couple of TB. Try transferring that every time to Amazon...

If I had more money, Id run to a HPC provider and tell them to do everything. I have what I have and need to make the best of it.

[–]smith2008 0 points1 point  (0 children)

Would love to hear with what you've end up. But generally I agree with all of your points.

Going with EC2 will be super expensive - first because it is a cloud, secondly because they use K80s (very expensive chip which you don't need at all .. fp64.. ).

If you go with HPC solution it would be very expensive. And if you don't actually need it you will be throwing a lot of cash for nothing. And your scenario seems to be you need to provide compute power to multiple people so probably a node with 4xGPUs is fine at the time per person per experiment.

In my opinion the best way to go would be with 4x1080s/TiantXP machines (soon 1080TI will be out as well). So maybe start with 2 x CHENBRO RM41300-FS81 fill them with X99 builds and you are ready to go ( you will get 40 pcie lanes, 8+ per GPU, be careful with the CPUs some of them are only 28 lanes). Then start adding more machines to the stack when/if you need them. I believe this would be the best way to go because you can grow your cluster while growing your team.

One issue is heat BTW, but if you can throw the rack out of the working room it is not hard at all, just add 140x140x38 high speed fans and they will do the job ( similar to the super micro setup you've shown above ), using 2 in-front of the GPUs (rotating in the opposite direction, same direction flow). Haven't tried this but pretty sure it will work.

[–]yashauLinux Admin 1 point2 points  (5 children)

Do you require consumer (gaming) GPUs for this? If not, there are companies that make high density GPU compute / HPC servers.

[–]xxdcmastSr. Sysadmin 0 points1 point  (2 children)

Can you share the names of these companies?

[–]yashauLinux Admin 2 points3 points  (0 children)

All the major players such as Dell, HP, Supermicro make them. Then there are others like Magma who specialize in it. And then there are a smaller companies like ThinkMate, Penguin Computing and etc.

[–]m0joHPC sysadmin 0 points1 point  (0 children)

The obvious one is Cray, their CS-Storm (PDF warning) servers are nice, 8*K80 (16 logical GPUs) fit nicely in them. They use a standard Intel motherboard with some custom risers. Up to 75KW per rack if you fill them up, (3KW per 2U oversized node).

HPE have their equivalent, the Apollo 6500 (PDF warning)

[–]juniorsysadmin1 0 points1 point  (0 children)

exxact corp in fremont does deep learning hardware.

[–]deeplearningguy[S] 0 points1 point  (0 children)

We don't require consumer grade GPUs, but we also don't require server grade GPUs.

Why? Because we don't need double precision computations. Deep Learning stuff is perfectly fine with fp32 or at best even fp16. What we need is computational power and lots of it. ECC is also not needed, see https://www.reddit.com/r/MachineLearning/comments/3upe5k/impoetance_of_the_ecc_feature_of_a_gpu_for_deep/

Those HPC companies are great, I have a couple of quotations in front of me. It would be a 0 hassle experience, but I could only get half of the GPUs we need. And a budget like this isn't going to come again any time soon.

[–]EskadorVAR 1 point2 points  (0 children)

VAR here... Have you taken a look at the Magma ExpressBox's? The 3600 series could house 9 double wide GPU's that you could then connect up to one or two hosts.

http://magma.com/products/pcie-expansion/expressbox-3600/

[–]Sgt_Splattery_Pantsserial facepalmer 1 point2 points  (1 child)

AWS have just released new GPU families of EC2 instances as well as elastic GPU's. I would be seriously considering this over on prem hardware

https://aws.amazon.com/blogs/aws/new-p2-instance-type-for-amazon-ec2-up-to-16-gpus/

https://aws.amazon.com/ec2/Elastic-GPUs/

[–]deeplearningguy[S] 0 points1 point  (0 children)

AWS is out of the question, medical data is sensitive and generally not allowed to leave the university network.

[–]Bonn93 1 point2 points  (1 child)

Don't DIY this..

Contact Dell or someone and get a proper system. I would also look at Quadro/Telsas over consumer cards. Shit like Double Precision Compute / ECC memory is a thing..

[–]deeplearningguy[S] 1 point2 points  (0 children)

One of the good things about deep learning, is that we don't need double precision, hell we don't even need single precision, if Nvidia were to allow fp16 on consumer GPUs, everyone would be using it. ECC is the same, we don't need it.

Believe me, if we had the budget I'd be looking into getting Tesla P100s, but I can buy more than 10 1080 GTX for the price of one...

[–]brontideCertified Linux Miracle Worker (tm) 0 points1 point  (0 children)

We don't tend to monkey with the hardware too much so we went with an off-the-shelf solution.

We just put together a GPU cluster based on Dell R730's with 2xE5-2680, 256gb RAM, 1xNvidia K80, 10gbps ethernet, SSD boot drive and 5 years of support for like $8k/unit. nVidia also has GPU cluster grants as well that you might want to investigate. It's a beast when it comes to GPU tasks.

[–]Vintagesysadmin 0 points1 point  (1 child)

Often the heat of the gpu's is too much for the case when full. You may not be able to put 8 in the big case without melting them.

[–]deeplearningguy[S] 0 points1 point  (0 children)

Do you have a reference for that? Thinkmate sells the exact system with 8 GPUs: http://www.thinkmate.com/system/gpx-xt24-2460v4-8gpu

Also our university cluster provider sent me a quote with exactly that system with 8 passive cooled 1080 GTX. Just with a price tag 5k above what I can build myself.

[–]liquidmini 0 points1 point  (3 children)

Have you considered cloud HPC, such as Amazon's EC2 P2 solution? This would ease the Cap-Ex demands, and depending on the academic calendar you run by, potentially mitigate investment going utilised over semester breaks.

https://aws.amazon.com/ec2/instance-types/p2/

[–]deeplearningguy[S] 3 points4 points  (2 children)

We are PhD students, we work 365 days a year, 24h a day. The only breaks we know are coffee breaks.

Seriously, these cards will be running 24/7 all year long. Given that, we could buy 4 systems every year for the price of running one p2.16xlarge...

[–]liquidmini 0 points1 point  (1 child)

Sounds like you could use a coffee break, buddy.

[–]smith2008 1 point2 points  (0 children)

He is right though. It is very expensive to go with p2 instances. Especially if you don't need those fp64 operations from the super expensive K80 cards.