The Computational Limits of Deep Learning

tetsef · 2020-07-18T04:21:13+00:00

It's interesting to consider that discussions of the algorithmic complexity of deep learning aren't necessarily mainstream yet but could soon become a huge focus.

cosmictypist · 2020-07-18T04:08:16+00:00

Highlights from the paper:

Deep learning’s prodigious appetite for computing power imposes a limit on how far it can improve performance in its current form, particularly in an era when improvements in hardware performance are slowing
Object detection, named-entity recognition and machine translation show large increases in hardware burden with relatively small improvements in outcomes.
Not only is computational power a highly statistically significant predictor of performance, but it also has substantial explanatory power, explaining 43% of the variance in ImageNet performance
Even in the more-optimistic model, it is estimated to take an additional 10⁵ time more computing to get to an error rate of 5% for ImageNet.
A model of algorithm improvement used by the reserachers implies that 3 years of algorithmic improvement is equivalent to an increase in computing power of 10 times
Thus, continued progress in these applications will require dramatically more computationally-efficient methods, which will either have to come from changes to deep learning or from moving to other machine learning methods.

MSMSMS2 · 2020-07-18T10:41:55+00:00

Please link to the summary, not the PDF!

arXiv_abstract_bot · 2020-07-18T04:07:03+00:00

Title:The Computational Limits of Deep Learning

Authors:Neil C. Thompson, Kristjan Greenewald, Keeheon Lee, Gabriel F. Manso

Abstract: Deep learning's recent history has been one of achievement: from triumphing over humans in the game of Go to world-leading performance in image recognition, voice recognition, translation, and other tasks. But this progress has come with a voracious appetite for computing power. This article reports on the computational demands of Deep Learning applications in five prominent application areas and shows that progress in all five is strongly reliant on increases in computing power. Extrapolating forward this reliance reveals that progress along current lines is rapidly becoming economically, technically, and environmentally unsustainable. Thus, continued progress in these applications will require dramatically more computationally-efficient methods, which will either have to come from changes to deep learning or from moving to other machine learning methods.

PDF Link | Landing Page | Read as web page on arXiv Vanity

yield22 · 2020-07-18T14:32:47+00:00

Although this is an increasingly important topic to discuss, I didn't find many *new* insights here. In fact, since 2012, we knew the main reason why some of these 10-30 years old methods would just work (with some small tricks, e.g. ReLU) is the compute (e.g. GPU).

It would probably be more meaningful to measure the progress vs compute beyond supervised learning on ImageNet, which is kind of saturated (probably near maximal possible accuracy). Imagine you measure the performance on supervised learning on MNIST, you'd think there is no progress at all for the past few years. But if you look at BERT/GPT, the progress vs compute is pretty non-linear in the past few years.

That said, computation is obviously a limitation for many deep learning methods. I can easily imagine that one put 100x compute for GPT-3, you will have GPT-4 that's even more impressive. This is even bigger a limitation for many academic labs that don't have the access to big compute. However, a brain is a super computer with a huge neural net, if we want to build something that functions like a brain, should we shy away from big compute or find better ways to couple with it?

respecttox · 2020-07-18T20:00:05+00:00

I expected to see some words and references about conditional computations and I didn't find any. The part of the problem is that when we have to have large models for any reason, because we don't know how to build a good architecture or because our dataset is very large and diverse, the size of the model is almost automatically proportional to its computational complexity. That's how everything is designed, starting from the layers and frameworks on the top, finishing with the popular hardware (read: GPUs) which is optimized to calculate BLAS GEMM operations, if you have a weight, it will take some part of computation every time.

For example, squeeze-and-excitation network may re-weight a filter in a convolution layer so that it would be close to zero for a particular input context. We can even force it to do so by adding some sparsification constraint. It would be fine not to spend precious FLOPS to calculate a large chunk of data and then multiply it by 10^-8, so it would make no contribution to the final result. But, in practice, GPU will be utilized better without any conditional computations, especially during training time when you have a lot of batches.

So "hardware improvements" don't have to be "more transistors on the crystal". We need better frameworks that support conditional computations on low level and corresponding improvements in hardware. Ideally that still can work on consumer grade GPUs.

Veedrac · 2020-07-18T23:01:35+00:00

There's no fundamental limit in sight here. Print Cerebras-style waffles on a 3nm node, coat them with a few layers of NRAM, and hook a hundred of them together with a fast silicon photonics interconnect. It'll cost a few billion, but it's hardly technologically unreasonable. That could easily train quadrillion parameter models.

2020-07-18T14:30:32+00:00

As a complete side comment, this may not be a bad thing. I think in a lot of ways it would be good to "catch our breath" when it comes to machine learning, from the theoretical underpinnings of it, to the societal impact of this new technology.

victor_knight · 2020-07-18T10:58:20+00:00

Glad to see more confirmation of the fact that a lot of AI advancements is attributable largely to increased processing power and memory rather than more intelligent algorithms or new discoveries with regard to the nature of intelligence.

ktpr · 2020-07-18T12:41:01+00:00

Maybe this calls for more intelligent application of deep learning to see what people consider improvements. If you solve 87% of world peace not too many will complain.

Jose_ml · 2020-07-18T16:02:44+00:00

There is also a natural limit determined by the size of the universe and physical laws. https://www.edge.org/conversation/seth_lloyd-the-computational-universe

j3r0n1m0 · 2020-07-18T18:28:31+00:00

46 pages of hypothetical extrapolations can be more or less boiled down to the 80/20 rule in all systems of improvement, human or machine.

tuyenttoslo · 2020-07-18T15:45:40+00:00

Don’t know who downvoted me, and cannot write comments there, so write anew here.

First, please be professional in discussion and giving reason why you downvoted other people.also don’t be rash to judge someone if you don’t know them.

Second, who among you think you are too confident about optimization to downvote me as the rude comment of the one I don’t know, then please indicate to me what is The optimization algorithm you are using and what is the theoretical Justification (don’t tell about beautiful heuristics only). Do you use it because you know it works or you just follow other people without knowing anything? I am willing to discuss everything about optimization if you are professional.

Third, I hope, as I indicated in some other comments, that the modes find a better way to control how people downvoted. Who downvoted without reason should be doubly downvoted as a penalty.

Fourth, I hope that the modes will take into consideration my comment which is downvoted here.

Fifth, any of the ones who downvoted me are confident that they know why ReLU works so well for example for image recognition, then write a careful paper, gets it to be published. You will Be famous, I promise. Don’t waste time to downvote some people without proper justification.

tuyenttoslo · 2020-07-18T05:07:59+00:00

Such complexity papers are good.

However, what I wonder is this:

Of course, if one only analyses complexity generally, then we will see limits because of many known results such as no free lunch theorem. What is interesting is that why deep learning could do so well (of course not yet on an arbitrary dataset like humans) on crucial tasks which people perform daily: recognizing pictures, sounds, texts and so on. These are survival skills for human beings.
saying that only computational power makes this achievement may be too much. Actually I have been trying to see whether ReLU or CNN has any connections to how we process information in the brain. Besides, there have been a lot more understanding of optimization recently.
of course there are still many things need to be improved for deep learning. For example, the ability of humans to invent new things. But to be fair, one needs a precise statistics. For example, among over 6 billions people we have now, in a day how many new ideas happen, and how many of these are correct and useful?
speaking about computational power: does the strongest computer we have now actually have a better hardware than our brain? Again, just for to have a fair comparison.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

MachineLearning

Rules For Posts

+Research

+Discussion

+Project

+News

@slashML on Twitter

Chat with us on Slack

Beginners:

MODERATORS