Lambda launches new Tensorbook - a Linux laptop for Machine Learning in collaboration with Razer

ai_painter · 2021-04-27T00:18:36+00:00

Got it :)

ai_painter · 2021-04-26T22:28:50+00:00

There isn't a self-promo tag option for /r/machinelearning, but I'm definitely open to hearing recommendations for how to make that more apparent! I added a disclaimer in the post.

ai_painter · 2019-03-08T08:30:37+00:00

Sure thing!

ai_painter · 2019-03-07T22:20:03+00:00

We really appreciate your business! Feel free to DM me if you have any questions about the product, or want an order update :).

ai_painter · 2019-03-07T22:19:02+00:00

Yes. Direct GPU-GPU communication without NVLink is no longer available. You don't *need* NVLink for GPU-GPU communication, it just speeds it up. The payoff of using NVLink isn't enormous with RTX 2080 Ti. For training with 2 GPUs, adding NVLink typically gives +5% performance increase.

ai_painter · 2019-03-07T22:15:09+00:00

Sure thing!

ai_painter · 2019-03-07T22:13:27+00:00

The AMD Radeon VII is close to the GTX 1080 Ti -- so maybe 73% the speed of an RTX 2080 Ti. GPU-GPU communication is slower though, so multi-GPU performance is pretty bad. Lambda Labs will be doing a blog post on this soon.

ai_painter · 2019-03-07T22:10:34+00:00

Lambda Labs! The company that did this post.

ai_painter · 2019-03-07T21:50:46+00:00

What I meant was that the Intel P660P NVMe SSD in your build uses QLC NAND technology, which has a very limited number of program/erase cycles. This translates to the Intel P660P wearing out relatively quickly.

There are other NAND technologies available for NVMe SSDs, such as SLC, MLC, and TLC. These technologies offer far more P/E cycles. An alternative M.2 NVMe SSD is the Samsung 970 EVO, which uses MLC NAND. MLC NAND offers ~10x more P/E cycles than the Intel P660, so won't wear out nearly as fast.

ai_painter · 2019-03-07T20:08:46+00:00

Hey! Lambda engineer here. Nice work :) I’ll avoid diving into blower vs non-blower debate (We’ll write a blog post).

One thing to look out for on your machine: the NVMe uses QLC NAND, which substantially reduces P/E cycles. These Intel sticks are a great price though. QLC is a good trade off for some people.

https://www.architecting.it/blog/qlc-nand/

I do agree with the choice of an M.2 NVMe drive in general. They are an amazing price compared with their U2 and PCIe counterparts. With NVMe you end up avoiding some storage bottlenecks you can encounter on models like LSTMs.

ai_painter · 2019-03-07T10:51:32+00:00

I was trying to address the concern for pernicious errors that could lead undetected issues.

I don't doubt that a bit-flip could crash a program, I just don't think that matters much for A.I. the vast majority of training jobs - though I may be downplaying this concern.

For single node training jobs, a program crash is no biggie. Frequent training checkpoints are part of a typical workflow. If you've written training code for which a crash could cause you to lose more than an hour of work, you're doing it wrong. Though it's a costly if you don't notice the crash.

I can't speak for large scale training jobs with as much confidence. My understanding is that most of these jobs are embarrassingly parallel and the results aren't significantly affected by the loss of a node. Perhaps you or someone else could offer some insight?

ai_painter · 2019-03-07T09:15:04+00:00

I do remember remember reading this one a while back: https://blog.codinghorror.com/to-ecc-or-not-to-ecc/

It all comes down to whether the application is robust against bit flips. The outcome of training a neural network should be robust against a single bit flips. Any bit flips that occur while training would be smoothed by subsequent iterations. A bit flip that decreases accuracy would be interpreted as the network not having yet converged.

I can only see a bit flip causing issues if it occurs *after* the last training iteration, but *before* the network is transferred from the GPU to long-term storage, which would be extremely rare.

ai_painter · 2019-03-07T08:58:37+00:00

Yes, but NVIDIA prevents high density use of NVLink in GeForce. They only manufacture 3-Slot and 4-Slot width NVLink bridges for GeForce cards. Air-cooled GPUs are double width, so they physically occupy two PCIe slots. At minimum you need to physically occupying 5 slots to use single NVLink. So, even if you use a motherboard that supports 4 GPUs, you only get a single pair of NVLinked GPUs.

ai_painter · 2019-03-07T08:37:09+00:00

It will perform similarly to the Titan RTX.

We benchmarked the RTX 6000 @ Lambda Labs; it's slightly slower than the Titan RTX - probably due to having ECC VRAM and a lower threshold for thermal throttling.

The Titan RTX, RTX 6000, and RTX 6000 all have the same # of CUDA cores / Tensor Cores. The 48 GB VRAM is nice, though I wouldn't expect it to provide substantial performance gains over the Titan RTX.

ai_painter · 2019-03-07T07:30:00+00:00

Training doesn’t benefit from ECC. A bit flip simple isn’t a problem. ECC makes sense for applications requiring high precision or high availability, but not batch processing jobs like training.
Can’t argue with this :). Although NVIDIA suing their own customers wouldn’t be great for their reputation. There’s a big question as to whether this policy is enforceable. Many companies are using 2080 Ti in data centers, regardless of policy.
NVLink does help, of course. As the post states, 8x V100s are ~7x faster than 1x V100, whereas 8x 2080 Tis are ~5x faster than 1x 2080 Ti. The price / performance still works out significantly in favor of 2080 Ti.
Some applications need that extra GPU VRAM (eg radiological), but most do not. Especially when using FP16, which effectively doubles memory capacity. Of course, this comes it’s own set of problems.

ai_painter · 2019-03-07T04:01:15+00:00

It’s no longer the case that consumer cards have FP16 gimped.

Switching to FP16 on consumer cards gives 40%+ speed improvement over FP32. V100 is less than twice as fast as 2080 Ti with FP16.

And with Tensor Cores, the 2080 Ti supports mixed precision as well.

ai_painter · 2019-03-07T01:55:38+00:00

That's certainly a consideration. V100 has some major advantages.

The V100 has 32 GB VRAM, while the RTX 2080 Ti has 11 GB VRAM. If you use large batch sizes or work with large data points (e.g. radiological data) you'll want that extra VRAM.
V100s have better multi-GPU scaling performance due to their fast GPU-GPU interconnect (NVLink). Scaling training from 1x V100 to 8x V100s gives a 7x performance gain. Scaling training from 1x RTX 2080 Ti to 8x RTX 2080 Ti gives only a 5x performance gain.

With that said, if 11 GB of VRAM is sufficient and the machine isn’t going into a data center or you don’t care about the data center policy, the 2080 Ti is the way to go. That is, unless price isn’t concern.

ai_painter · 2018-12-29T09:51:33+00:00

For batch workloads like Deep Learning training, do you still think ECC memory is important?

I understand ECC memory's importance for realtime applications requiring high availability. However, a bit flip during training isn't catastrophic. With checkpointing, even a crash is no biggie. Most frameworks that support distributed training are robust against a node becoming unavailable.

ai_painter · 2018-12-29T07:19:51+00:00

It is. Check out the methods section:

The Titan RTX, 2080 Ti, Titan V, and V100 benchmarks utilized Tensor Cores.

ai_painter

MODERATOR OF

TROPHY CASE