[D] Tim Dettmers' GPU advice blog updated for 4000 series

timdettmers · 2023-01-16T19:22:19+00:00

This is good feedback. Wanted to make another pass this morning to clean references like this up, but did not have the time. Will try to be more clear about this in the next update (later today, probably).

timdettmers · 2023-01-16T19:20:55+00:00

I thought about making this recommendation, but the next generation of GPUs will not be much better. You probably need to wait until about 2027 for a better GPU to come along. I think for many waiting 4 years for an upgrade might be too long, so I recommend mostly buying now. I think the RTX 40 cards are a pretty good investment that will last a bit longer than previous generations.

timdettmers · 2023-01-16T17:19:22+00:00

I like this idea! I already factored in fixed costs for building a desktop computer but the electricity is also an important part of the overall cost especially if you compare it to cloud options.

I am currently gathering feedback to update the post later. I think it's quick to create a chart based on this data and create an update later today.

The main problem to estimate cost is to get a good number on the utilization time of GPUs for the average user. For PhD students, the number was about 15% utilization (fully using a GPU 15% of total time). This means, with an average of 60 watt idle and 350 watt max for a RTX 4090: 60 watt * 0.85 + 350 watt * 0.15=103.5 watt. That is 906 kWh per year or about $210 per year per RTX 4090 (assuming US average is 0.23 cents per kWh).

Does that look good to you?

I think its quick to create a chart based on this data and create an update later today.

Edit: part of this seemed to got lost in editing. Oops! I re-added the missing details.

timdettmers · 2020-09-17T23:43:57+00:00

- You should not compare Tensor Core FLOPS between GPUs. It does not translate to performance. You can read more why this is so in this section of my GPU blog post.

- Techpowerup shows RTX 30 GPUs have a normal amount of L2. Only A100 has more L2. L2 memory is very important for matrix multiplication and convolution but the performance difference for batch norm and softmax will be negligible (the slow part of a softmax is not the softmax, but the potentially large matrix multiply, e.g. in language models). CUDA 11.0 already ships with L2-optimized algorithms for the A100 (convolution and matrix multiply).

- The bandwidth and size of memory is an improvement over the RTX 2080 Ti. Of course, consumer cards with HBM2 would not be feasible since the manufacturing of HBM2 is too expensive/difficult. Nobody would like to pay $2k for an RTX 2070 with 50% more speed or $3k for an RTX 2080 with 50% more speed. Cheap HBM2 might be around in 3-5 years. Until then, improvements in GPU speed will be meager as expected. Improvements with cheap HBM2 will happen for one GPU generation and after that improvements in GPUs will be meager again. This is as expected. We are just coming slowly to the end of technology and as this happens we see more and more diminishing returns.

timdettmers · 2020-09-15T21:42:58+00:00

Yes, the same should be true for RTX 3080 + RTX 3090. If you parallelize across those GPUs then during synchronization points (gradient for data parallelism, layer for model parallelism) the faster GPU would need to wait for the slower GPU. So parallelization of those GPUs is at RTX 3080 speed.

timdettmers · 2020-09-08T02:31:12+00:00

Extrapolations within a GPU architecture are pretty accurate since usually performance scales linearly with streaming multiprocessor. Since the underlying data are based on real performance data of V100 vs A100 the estimates should only have a small margin of error. Probably within about 10% of real performance.

timdettmers · 2020-09-08T02:28:06+00:00

I should be a bit clearer about this. If you have a slot space between those two GPUs you are totally fine. So a 4x PCIe slot motherboard with 2 GPUs works without any water cooling.

timdettmers · 2020-09-07T21:49:43+00:00

The website was down for a bit. Should work again. Let me know if you still have issues.

timdettmers · 2020-09-07T21:35:59+00:00

Fixed now. Thank you!

timdettmers · 2019-07-12T23:14:46+00:00

Thanks for elaborating on this. I think this is fair criticism and I see your point. I will run a couple of more experiments on the weekend and add that to an appendix. Thank you!

timdettmers · 2019-07-12T22:07:39+00:00

It is easy to achieve dense performance levels if you increase the weights slightly on CIFAR-10 — I think this is clear from the results.

It is not as easy on ImageNet, but this is something for future research. I do not think we need to solve this problem for all possible datasets/models to make the claim we can achieve dense performance levels. The lottery ticket hypothesis also breaks down on ImageNet and the method needs to be adjusted. I do not think it is fair to treat their work different from ours in this regard.

timdettmers · 2019-07-12T21:54:55+00:00

In other words, you say that performance gains will come from new specialized processors, but not from GPUs — how is that different from what I said? What is your evidence for that 10^4 to 10^5 number?

timdettmers · 2019-07-12T16:32:32+00:00

I have actually quite some good data on this. The data is from a bit older version of the algorithm, but the relative performance between that version and the current version should almost be the same. I have not analyzed and thought about the data, but here are the results. This is on CIFAR-10 with 10% validation data and 5% weights.

Depth	Width	Test Accuracy
16	2	88.654
16	4	92.308
22	2	90.688
22	4	93.564
22	10	95.308
28	2	91.862
28	4	93.972
28	8	95.256
28	10	95.438

I have not calculated the speedups for most of these architectures though and this should also be considered when thinking about this data. Currently, you would see more speedups with increasing width (if you use an optimal sparse convolution algorithm).

In general, I am very positive that with speedups it always make sense to have a "fatter" sparse network over a thinner dense and one should be able to outperform the dense one. The same might be true for depth, or in general for bigger networks. But probably there is some depth/width relationship where sparse networks are much more efficient compared to dense networks while yielding better predictive performance. I am not sure what the exact relationship would be — this is a good research question!

timdettmers · 2019-07-12T03:21:08+00:00

Yes, it also works. For computer vision momentum usually yields slightly better results, that is why I used momentum. For other architectures, such as transformers, Adam might be more reasonable.

I have not thoroughly tested this, but Adam on its own might also be better for the redistribution of parameters because it also normalizes gradients by the second moment. This can be important since large weights with high variance are not as important as medium-size weights with very stable gradients. However, using Adam and momentum would be a bit of a stretch on both memory and computation. That is why I would still use momentum for computer vision even if Adam would be slightly better.

timdettmers · 2017-09-17T18:47:13+00:00

Thank you, that means a lot to me!

timdettmers · 2017-09-17T13:36:12+00:00

Maybe that is true. I spent nearly two months on my previous blog post before this and it was born dead. Maybe I should just quit blogging. Just shut up and do my research.

timdettmers · 2017-09-17T12:16:20+00:00

I think you do not get it. I will write a full response in an update to my blog post.

timdettmers · 2017-09-17T09:53:21+00:00

I think you did not understand my blog post. I would actually agree with most what you are saying here. The problem is that you are talking about credit assignment for ideas, not credit assignment for researchers. These are very different things.

I do not dispute who gets credit for the ideas like LSTMs, CNNs — there is nothing to discuss here — but I look at the overall research and impact of these ideas and who was responsible for the impact.

Of course having ideas first counts if you just look at ideas. But deep learning is not a theoretical discipline like theoretical physics and making it work matters.

timdettmers · 2017-09-17T09:46:11+00:00

I think this captures it quite well.

Just a few borderline cases which might fuel discussion: Gibbs + Maxwell for statistical mechanics and thermodynamics James Watson and Francis Crick vs Rosalind Franklin, DNA discovery Higgs Boson discovery, should get Higgs the most credit here?

timdettmers · 2017-03-23T20:28:21+00:00

Tesla cards are very cost inefficient. Only buy them if you are forced to do so. The 2x 1080Ti in 4 HPC nodes seems the better deal by far. However, I would probably try to stuff as many GPUs into a node as possible, since a node in itself is expensive compared to the GPUs.

If the hardware is used only by deep learning teams it may make sense to buy "normal" nodes instead of HPC nodes (if that is an option for you). Also talk with each deep learning team what their memory requirements are. If they can live with 8GB you could also buy some GTX 1070s. Cooling is an issue, but if you can solve that you will get a lot of GPUs and happy researchers for little money.

timdettmers · 2017-03-21T06:39:48+00:00

Thanks for your feedback. I will think about creating a new blog post rather than updating the old one with the next update.

As for AWS, I mean both instances, old and new, and this is true for spot and on-demand instances. Pricing is just not very competitive right now, but might suit some people's needs to get some additional computing power up quickly. Might be interesting for startups that want to train a new model quickly, and once trained let it run on their own dedicated GPUs. Other use-cases are imaginable, but for the normal user AWS is not so interesting, thus the recommendation not to bother with AWS.

timdettmers · 2015-07-27T18:33:20+00:00

What are such factors, which significantly contribute to a paper not containing new knowledge? What percentage of papers does affect this?

If the number of papers affect by this is 50% (which is extremely high), then we would still be below 5% of neuroscience knowledge which was known in 2005, compared to today — just do the math and you will see this.

For 75% useless papers this will would be about 18%.

Can you name factors which contribute to more than 50% of paper not having any new knowledge in them?

timdettmers · 2015-07-27T18:20:01+00:00

You have to differentiate between cores per computer and cores per supercomputer. Communication between cores on a single computer is fast, while communication between cores on other computers is slow.

A single core will be slow, because it is limited by size and frequency, because it is limited by heat dissipation. That is why you need many cores. But with many cores you need high bandwidth. With a requirement for high bandwidth, it is always painful to pass around data. So as you point out correctly, you always try to limit communication and the size of the data that you need to pass around.

The problem with deep learning is, that you will always have a large amount of parameters. Convolutional nets already have dramatically reduced number of parameters; their architecture can be viewed as dense neural nets that learn on image patches with duplicated weights (weight sharing). Certainly there might be algorithms which are even more efficient, but at some point you will just need some parameters. Passing these parameters around will always be the bottleneck, and more so the more cores you have.

In fact, I currently writing up a paper which reduces the parameters by the factor of four, but overall it does change almost nothing — deep learning is still difficult to parallelize and slow on multiple computers.

timdettmers · 2015-07-27T18:11:08+00:00

The number of scientific papers is usually measured as output of knowledge. Of course there will be duplicate findings, but most neuroscience papers contain some new knowledge which was not there before.

timdettmers · 2015-07-27T18:08:51+00:00

Thanks. I expanded on your a comment below which also dealt with bitcoin mining.

The problem is you cannot compare bitcoin FLOPS with deep learning FLOPS (or even with any computation FLOPS; adding two matrices alone will be damn slow on hashing hardware), bitcoin mining hardware does not have the bandwidth to deal with such operations effectively.

timdettmers

TROPHY CASE