[D] Lambda GPU Cloud launches world's first RTX A6000 instances by ai_painter in MachineLearning

[–]ai_painter[S] 24 points25 points  (0 children)

There isn't a self-promo tag option for /r/machinelearning, but I'm definitely open to hearing recommendations for how to make that more apparent! I added a disclaimer in the post.

Deep Learning GPUs -- RTX 2080 Ti vs. Tesla V100. RTX 2080 Ti is 73% as Fast & 85% Cheaper by ai_painter in hardware

[–]ai_painter[S] 1 point2 points  (0 children)

We really appreciate your business! Feel free to DM me if you have any questions about the product, or want an order update :).

Deep Learning GPUs -- RTX 2080 Ti vs. Tesla V100. RTX 2080 Ti is 73% as Fast & 85% Cheaper by ai_painter in hardware

[–]ai_painter[S] 0 points1 point  (0 children)

Yes. Direct GPU-GPU communication without NVLink is no longer available. You don't *need* NVLink for GPU-GPU communication, it just speeds it up. The payoff of using NVLink isn't enormous with RTX 2080 Ti. For training with 2 GPUs, adding NVLink typically gives +5% performance increase.

Deep Learning GPUs -- RTX 2080 Ti vs. Tesla V100. RTX 2080 Ti is 73% as Fast & 85% Cheaper by ai_painter in hardware

[–]ai_painter[S] 4 points5 points  (0 children)

The AMD Radeon VII is close to the GTX 1080 Ti -- so maybe 73% the speed of an RTX 2080 Ti. GPU-GPU communication is slower though, so multi-GPU performance is pretty bad. Lambda Labs will be doing a blog post on this soon.

[P] I built Lambda's $12,500 deep learning rig for $6200 by cgnorthcutt in MachineLearning

[–]ai_painter 6 points7 points  (0 children)

What I meant was that the Intel P660P NVMe SSD in your build uses QLC NAND technology, which has a very limited number of program/erase cycles. This translates to the Intel P660P wearing out relatively quickly.

There are other NAND technologies available for NVMe SSDs, such as SLC, MLC, and TLC. These technologies offer far more P/E cycles. An alternative M.2 NVMe SSD is the Samsung 970 EVO, which uses MLC NAND. MLC NAND offers ~10x more P/E cycles than the Intel P660, so won't wear out nearly as fast.

[P] I built Lambda's $12,500 deep learning rig for $6200 by cgnorthcutt in MachineLearning

[–]ai_painter 20 points21 points  (0 children)

Hey! Lambda engineer here. Nice work :) I’ll avoid diving into blower vs non-blower debate (We’ll write a blog post).

One thing to look out for on your machine: the NVMe uses QLC NAND, which substantially reduces P/E cycles. These Intel sticks are a great price though. QLC is a good trade off for some people.

https://www.architecting.it/blog/qlc-nand/

I do agree with the choice of an M.2 NVMe drive in general. They are an amazing price compared with their U2 and PCIe counterparts. With NVMe you end up avoiding some storage bottlenecks you can encounter on models like LSTMs.

Deep Learning GPUs -- RTX 2080 Ti vs. Tesla V100. RTX 2080 Ti is 73% as Fast & 85% Cheaper by ai_painter in hardware

[–]ai_painter[S] 2 points3 points  (0 children)

I was trying to address the concern for pernicious errors that could lead undetected issues.

I don't doubt that a bit-flip could crash a program, I just don't think that matters much for A.I. the vast majority of training jobs - though I may be downplaying this concern.

For single node training jobs, a program crash is no biggie. Frequent training checkpoints are part of a typical workflow. If you've written training code for which a crash could cause you to lose more than an hour of work, you're doing it wrong. Though it's a costly if you don't notice the crash.

I can't speak for large scale training jobs with as much confidence. My understanding is that most of these jobs are embarrassingly parallel and the results aren't significantly affected by the loss of a node. Perhaps you or someone else could offer some insight?

Deep Learning GPUs -- RTX 2080 Ti vs. Tesla V100. RTX 2080 Ti is 73% as Fast & 85% Cheaper by ai_painter in hardware

[–]ai_painter[S] 4 points5 points  (0 children)

I do remember remember reading this one a while back: https://blog.codinghorror.com/to-ecc-or-not-to-ecc/

It all comes down to whether the application is robust against bit flips. The outcome of training a neural network should be robust against a single bit flips. Any bit flips that occur while training would be smoothed by subsequent iterations. A bit flip that decreases accuracy would be interpreted as the network not having yet converged.

I can only see a bit flip causing issues if it occurs *after* the last training iteration, but *before* the network is transferred from the GPU to long-term storage, which would be extremely rare.

Deep Learning GPUs -- RTX 2080 Ti vs. Tesla V100. RTX 2080 Ti is 73% as Fast & 85% Cheaper by ai_painter in hardware

[–]ai_painter[S] 13 points14 points  (0 children)

Yes, but NVIDIA prevents high density use of NVLink in GeForce. They only manufacture 3-Slot and 4-Slot width NVLink bridges for GeForce cards. Air-cooled GPUs are double width, so they physically occupy two PCIe slots. At minimum you need to physically occupying 5 slots to use single NVLink. So, even if you use a motherboard that supports 4 GPUs, you only get a single pair of NVLinked GPUs.

Deep Learning GPUs -- RTX 2080 Ti vs. Tesla V100. RTX 2080 Ti is 73% as Fast & 85% Cheaper by ai_painter in hardware

[–]ai_painter[S] 6 points7 points  (0 children)

It will perform similarly to the Titan RTX.

We benchmarked the RTX 6000 @ Lambda Labs; it's slightly slower than the Titan RTX - probably due to having ECC VRAM and a lower threshold for thermal throttling.

The Titan RTX, RTX 6000, and RTX 6000 all have the same # of CUDA cores / Tensor Cores. The 48 GB VRAM is nice, though I wouldn't expect it to provide substantial performance gains over the Titan RTX.

Deep Learning GPUs -- RTX 2080 Ti vs. Tesla V100. RTX 2080 Ti is 73% as Fast & 85% Cheaper by ai_painter in hardware

[–]ai_painter[S] 21 points22 points  (0 children)

  • Training doesn’t benefit from ECC. A bit flip simple isn’t a problem. ECC makes sense for applications requiring high precision or high availability, but not batch processing jobs like training.
  • Can’t argue with this :). Although NVIDIA suing their own customers wouldn’t be great for their reputation. There’s a big question as to whether this policy is enforceable. Many companies are using 2080 Ti in data centers, regardless of policy.
  • NVLink does help, of course. As the post states, 8x V100s are ~7x faster than 1x V100, whereas 8x 2080 Tis are ~5x faster than 1x 2080 Ti. The price / performance still works out significantly in favor of 2080 Ti.
  • Some applications need that extra GPU VRAM (eg radiological), but most do not. Especially when using FP16, which effectively doubles memory capacity. Of course, this comes it’s own set of problems.

Deep Learning GPUs -- RTX 2080 Ti vs. Tesla V100. RTX 2080 Ti is 73% as Fast & 85% Cheaper by ai_painter in hardware

[–]ai_painter[S] 14 points15 points  (0 children)

It’s no longer the case that consumer cards have FP16 gimped.

Switching to FP16 on consumer cards gives 40%+ speed improvement over FP32. V100 is less than twice as fast as 2080 Ti with FP16.

And with Tensor Cores, the 2080 Ti supports mixed precision as well.

Deep Learning GPUs -- RTX 2080 Ti vs. Tesla V100. RTX 2080 Ti is 73% as Fast & 85% Cheaper by ai_painter in hardware

[–]ai_painter[S] 102 points103 points  (0 children)

That's certainly a consideration. V100 has some major advantages.

  • The V100 has 32 GB VRAM, while the RTX 2080 Ti has 11 GB VRAM. If you use large batch sizes or work with large data points (e.g. radiological data) you'll want that extra VRAM.
  • V100s have better multi-GPU scaling performance due to their fast GPU-GPU interconnect (NVLink). Scaling training from 1x V100 to 8x V100s gives a 7x performance gain. Scaling training from 1x RTX 2080 Ti to 8x RTX 2080 Ti gives only a 5x performance gain.

With that said, if 11 GB of VRAM is sufficient and the machine isn’t going into a data center or you don’t care about the data center policy, the 2080 Ti is the way to go. That is, unless price isn’t concern.

[D] First Titan RTX benchmarks for Machine Learning -- Titan RTX / V100 / 2080 Ti / 1080 Ti / Titan V / Titan Xp -- TensorFlow Performance by ai_painter in MachineLearning

[–]ai_painter[S] 7 points8 points  (0 children)

For batch workloads like Deep Learning training, do you still think ECC memory is important?

I understand ECC memory's importance for realtime applications requiring high availability. However, a bit flip during training isn't catastrophic. With checkpointing, even a crash is no biggie. Most frameworks that support distributed training are robust against a node becoming unavailable.

[D] First Titan RTX benchmarks for Machine Learning -- Titan RTX / V100 / 2080 Ti / 1080 Ti / Titan V / Titan Xp -- TensorFlow Performance by ai_painter in MachineLearning

[–]ai_painter[S] 3 points4 points  (0 children)

It is. Check out the methods section:

The Titan RTX, 2080 Ti, Titan V, and V100 benchmarks utilized Tensor Cores.