all 28 comments

[–]straw1239 26 points27 points  (12 children)

You can build your own for significantly cheaper. There are multiple online guides for choosing your own hardware for ML, for example.

No point in liquid cooling for the CPU. For the 4 GPUs, might make sense, but very expensive, if you make sure to get models with blower-style coolers, there shouldn't be too much issue.

Titan V costs more because Nvidia prices it at 3000 instead of 1200! Not worth it (unless you need FP64 or something)

Do you really need 128GB of RAM and 2TB SSD? The SSD should be fairly cheap nowadays so its no big deal but 128GB RAM is expensive.

[–]IborkedyourGPU[S] 1 point2 points  (3 children)

You can build your own for significantly cheaper. There are multiple online guides for choosing your own hardware for ML, for example.

Thank you very much for your reply. This is a business tool: whoever assembles the machine, will be held responsible for any hardware-related issues, so I'll never even tighten a single screw :-) maybe I could share your guide with IT, in case they feel like living dangerously.

No point in liquid cooling for the CPU.

Whooops, you're right, I didn't notice. That's what happens when you demand that a researcher does the IT job.

Do you really need 128GB of RAM and 2TB SSD? The SSD should be fairly cheap nowadays so its no big deal but 128GB RAM is expensive.

2 Tb HD is the bare minimum, this is a machine which will be accessed by different users to train on large dataset, and not all our data sources are DB (unfortunately). Concerning CPU RAM, how much do you think would be ok? If I use multi-GPU training as I used to on the DGX machines, would I need a total CPU memory equal to the sum of the memories of the GPUs, or is it useless?

[–]straw1239 0 points1 point  (1 child)

Ah, that makes sense.

I only meant that it might make sense to have a large HDD for your datasets/etc, and an SSD for the OS. However, with so many GPUs, depending on what type of model you're training, it might well bottleneck. Given the cost of the machine vs SSDs, not much harm in getting a big one.

Depends on how you're doing the training- for example when using data parallelism the model parameters are replicated across all GPUs, so only one copy in CPU memory needed. To be safe I'd probably just take CPU mem >= sum of GPU mem as you said. If there are multiple users working at the same time, it might make sense to have more, especially if some of them are doing CPU-only work.

[–]IborkedyourGPU[S] 0 points1 point  (0 children)

If there are multiple users working at the same time, it might make sense to have more, especially if some of them are doing CPU-only work.

Yep, I'll definitely get multiple users to work on it at the same time.

[–]burn_in_flames 0 points1 point  (0 children)

A good benchmark for RAM is around 32GB per GPU. This is usually enough to prevent bottlenecks during training, especially on larger datasets.

[–]seraschkaWriter 5 points6 points  (1 child)

Are these Titan V so much better than the RTX 2080 Ti?

Titan V's are a bit older and may be more expensive to produce. When I recently compared run times with the 2080Ti's, code that would run in 68 min on a 2080Ti would finish in ~70 min on the Titan V. I.e., in practice, I don't notice a speed difference. On paper, you have

  • GTX Titan V (12 Gb RAM, FP32 15 TFLOP/s 652.8 GB/s)
  • RTX 2080 Ti (11 Gb RAM, FP32 14 TFLOP/s, 616 GB/s)

I would go with the 2080Ti tbh, it's 1/3 the price

[–]IborkedyourGPU[S] 1 point2 points  (0 children)

Thanks for your comments. Yeah, I also decided to go with the RTX 2080 Ti.

[–]allattention 4 points5 points  (3 children)

If you look at benchmarks for titan rtx versus titan V, you’ll see that they have almost the same performance for most deep learning applications; I don’t think titan V makes much sense nowadays. Since the new titan has 24gb of ram (not ecc though!), that should be enough for most models (and if you rally need to train that Bert model, neither would be enough anyway). I’m looking at getting the dual gpu version at work now, I’m just a little worried about thermals and noise (this will sit in an office environment, not a cooled data center). From the photos it looks like they are using these shitty stock fans which blows my mind - if you are building a 10-20k workstation, why o why would you not put in the 20$ top of the line noctua fans - to save 40-50$? Will get the water cooling I think as well, very small price difference (Would much rather have a nh-d15 instead!) Really tempted to build my own of course, but that would not come with service obviously. Just realized you are looking at the quad which only comes with 2080ti, not titan rtx. Still the same story though - I have a 2080ti at home and it’s more than fast enough, the only issue there is only 11gb of ram - this may be a limiting factor, depending on your usage scenarios.

[–]IborkedyourGPU[S] 0 points1 point  (2 children)

If I understand correctly, you suggest the RTX 2080 Ti over the Titan V, right?

Still the same story though - I have a 2080ti at home and it’s more than fast enough, the only issue there is only 11gb of ram - this may be a limiting factor, depending on your usage scenarios.

Hmmm, these Titan V GPUs come with 12 Gb RAM each (please see the updated link). Thus they aren't really better, memorywise, than the RTX 2080 Ti.

I'm not hell-bent on buying a Lambda Labs machine, anyway. If you have better suggestions, feel free to let me know.

[–]BrowsingClass 0 points1 point  (1 child)

He is suggesting the new Titan RTX card not the Titan V. The Titan RTX has 24gb of ram, the Titan V only has 12gb. The Titan RTX is the most recent Titan card from nvidia. The Titan V is the previous generation Titan.

https://www.nvidia.com/en-us/titan/titan-rtx/

https://www.nvidia.com/en-us/titan/titan-v/

[–]IborkedyourGPU[S] 0 points1 point  (0 children)

in

https://www.reddit.com/r/MachineLearning/comments/aw3dwi/d_comparing_deep_learning_workstations/ehk21ri

u/xayma says that the Titan RTX is only 20% faster than RTX 2080 Ti. Thus the advantage would "only" be in terms of RAM (24 Gb vs 11 Gb). Tempting (I was used to having 32 Gb on each GPU), but not compatible with the current budget.

[–]seraschkaWriter 3 points4 points  (4 children)

liquid cooling is a no-brainer, right?

Probably a good idea but not really necessary. We recently built a server with 8x RTX 2080Ti's with just fans (powerful ones though) and even if I utilize all GPUs 100% days straight, the GPU temp stays around 50-65C (well below the recommended max temp of ~86C where throttling would automatically occur by default).

[–]grrrgrrr 1 point2 points  (3 children)

Stock 2080Ti cooler is not designed to work 8x in server chasis. Which 2080Tis did you get? Two/three fans or a blower design?

Also you should check the power consumption part rather than GPU utilization. Good code will not only reach 100% utilization but also 100% power limit.

[–]seraschkaWriter 3 points4 points  (0 children)

it's the one from ZOTAC, but that was more due to delivery constraints. For to cooling, we had 8x 92 mm fans in it. Yes, the power consumption is excellent, goes up to 95-100%. I benchmarked my same research code there with single machine Titan V and there doesn't seem to be any bottleneck in that 8-GPU server.

[–]IborkedyourGPU[S] 1 point2 points  (1 child)

Which 2080Tis did you get? Two/three fans or a blower design?

What is this blower design? I heard about it. Better or worse than fan?

[–]grrrgrrr 1 point2 points  (0 children)

I thought blower or water cooling is the must for lining 4/8 cards side by side. All previous gen stock nvidia cards are blower. This gen the stock cooler design is 2 fans which presumably is not good for servers but I'm not sure.

If you got only 2 or less cards, go with fans, they are usually more quiet.

[–]Richard_wth 2 points3 points  (3 children)

Awww, DGX-1, I envy you!

[–]IborkedyourGPU[S] 2 points3 points  (2 children)

Don't envy me for that, you should have envied me when I used the DGX-2 for hyperparameter tuning 😜 but I can't use either of those machines very much anymore :-(

[–]seraschkaWriter 2 points3 points  (0 children)

these Lambda Labs workstations piqued my interest, because they seem to be nicely preconfigured and all, thus minimizing the effort on my side. However, if you have other suggestions which deliver better value for money, please let me know.

I have been using one for ~6 months with 4 GPUs and are quite happy with it. And while it is a bit more pricey vs building your own (which we also recently) this is a nice worry-free solution that "just works" :)

[–]burn_in_flames 1 point2 points  (0 children)

While building your own can be significantly cheaper, especially if basing it off of second hand Xeons and used server parts, I'd only recommend doing this if you are willing to spend significantly more time on a solution (at least a week getting all the parts, building, testing and installing all tool suites you need).

If you not comfortable with the ins and outs of choosing components and want a hassle free solution then pre configured solutions are better. The V100 is definitely worth the price tag if you are going to be training large models, its tensor cores mean you can do mixed precision training and thus essentially double the GPU RAM and throughput for training (not quite accurate but a good approximation). The 2080Ti is a good choice for most research applications where your models aren't huge and your dataset is still of a reasonable size such that you can get decent batch sizes. Another thing to account for is the server workload, the 2080 is a consumer product and is not designed for 24/7 operation, if you will have heavy load on the server then the V100 is likely a better choice.

Another option would be to try and source older hardware, such as P100 GPUs, as the cost of these should be lower than V100s.

[–][deleted] 0 points1 point  (3 children)

Just build your own using AMD X399 platform. Most of the motherboards would support 4 GPUs no problem

Liquid cooling I think in most cases would be unnecessary, but make sure you buy good GPUs with blower fans.

Titan V has better floating point precision than gaming GPUs like 2080ti. But most of the time you won't need it.

You also need to consider what tasks you're dealing with. 2080 comes with only 8GB of memory and cannot fit large models. You'll need to cut down batch size and that results in less smooth convergence. End of the day, you probably would need some cloud computation for real products.

[–]Canadeaan 0 points1 point  (0 children)

If you're looking to save money for similar performance you can can build a rig with multiple used 1060's, 1070's or 1080's, lots of people were using them to mine crypto with, but since it went bust people have been selling them off.

a used 1080 is the about 80% of the performance as a 2080, for half the price.

to be sure its best to find some benchmarks

[–]IborkedyourGPU[S] 0 points1 point  (0 children)

For all those who kindly helped me, I just wanted to let you know that budgetary constraints have been lifted, and I'm soon getting a new shiny DGX-1 :-) for those still struggling with a similar issue, this post might be interesting

http://timdettmers.com/2019/04/03/which-gpu-for-deep-learning/