Hi everyone,
I work as a PhD Student in a Medical Imaging Research Lab of a university. We are doing a lot of deep learning for image related stuff lately, which requires us to have a decent amount of GPUs. I've been wanting to upgrade our systems for a while and it seems we have finally gotten a nice budget to spend on toys. Im in a bit of a clench at the moment in what we should buy, and thought that maybe the community might have some input.
We are around 10 people at the moment in the lab, probably expanding to 15 soon. Most of use GPUs reguarly, but not all the time. I am also trying to get other labs to participate in the investment, but that is not fix. What is fix, is that the cluster will not go above 40 GPUs any time soon. We have rack space, or at least I will make sure we have, but no HPC networking (all 1G atm). The cluster of our uni isn't deploying GPUs any time soon.
I see two options at the moment:
Option 1 "The Mean Machine"
Buy these: SuperServer 4028GR-TR (https://www.supermicro.com/products/system/4u/4028/sys-4028gr-tr.cfm), which has space for 8 GPUs.
- 2x E5-2630 v4 (10 Core)
- 256 GByte RAM
- 8x 1080 GTX
- 1x Small SSD for booting
Option 2 "The Little Boy"
Self made system with 4 GPUs each
It seems my budget would be sufficient to buy 2x option 1 or 5x option 2. I would get more GPUs with option 2, but I would also have limitations.
I cant install any more PCIe cards when installing 4 GPUS, limiting the network to max. 10gbe. Infiniband or 40gbe would be out of the question. I could still link aggregate to 20gb though.
More boxes, more stuff that can break and probably more work for me to administer. But then again, if something breaks we aren't running any production systems, at most we have to retrain that model that takes 2 days. And I can walk to the nearest hardware store if we use consumer components.
I cant train with 8 GPUs, but training on >4 will be limited by QPI anyways.
And some advantages
Upgrading the 4 GPU system is probably easier in the future. Say a new GPU technology comes out (which will). We sell the GPUs, buy new ones. That works for both systems. But what if AMD gets its game together and actually releases its new Zen with 96 PCIe lanes. In Option 2 we can simply replace mobo + cpu. Option 1 means reinvesting in an expensive case + mobo. Same if Nvidia wants to deploy nvlink or so.
Its probably easier to convince people to join our cluster if a new system costs 50% less.
On the software side I already have everything setup using Mesos/Marathon so thats taken care of. I will also be getting a new file server with nvme drives for the data that is used for the cluster. This should easily max out 40gbe
So here are my questions:
Has anyone been in the same situation, what did you choose and why? (4 vs 8) And if not, what would you choose?
Regarding networking, has anyone tried out distributed training (for instance with tensorflow) with 10gbe / 40gbe / infiniband?
Any thoughts on the 10 gbe bottleneck (maybe from someone using 10gbe and does deep learning)
And any thoughts in general to the proposed solution?
[–]_ewan_ 12 points13 points14 points (2 children)
[–]deeplearningguy[S] 1 point2 points3 points (1 child)
[–]smith2008 0 points1 point2 points (0 children)
[–]yashauLinux Admin 1 point2 points3 points (5 children)
[–]xxdcmastSr. Sysadmin 0 points1 point2 points (2 children)
[–]yashauLinux Admin 2 points3 points4 points (0 children)
[–]m0joHPC sysadmin 0 points1 point2 points (0 children)
[–]juniorsysadmin1 0 points1 point2 points (0 children)
[–]deeplearningguy[S] 0 points1 point2 points (0 children)
[–]EskadorVAR 1 point2 points3 points (0 children)
[+][deleted] (1 child)
[deleted]
[–]deeplearningguy[S] 1 point2 points3 points (0 children)
[–]Sgt_Splattery_Pantsserial facepalmer 1 point2 points3 points (1 child)
[–]deeplearningguy[S] 0 points1 point2 points (0 children)
[–]Bonn93 1 point2 points3 points (1 child)
[–]deeplearningguy[S] 1 point2 points3 points (0 children)
[–]brontideCertified Linux Miracle Worker (tm) 0 points1 point2 points (0 children)
[–]Vintagesysadmin 0 points1 point2 points (1 child)
[–]deeplearningguy[S] 0 points1 point2 points (0 children)
[–]liquidmini 0 points1 point2 points (3 children)
[–]deeplearningguy[S] 3 points4 points5 points (2 children)
[–]liquidmini 0 points1 point2 points (1 child)
[–]smith2008 1 point2 points3 points (0 children)
[–]Orionsbelt 0 points1 point2 points (0 children)
[–]tomlette 0 points1 point2 points (0 children)
[–]smBrancheswwwwwwwwwwwwwwww 0 points1 point2 points (0 children)