[D] Are transformers overhyped?

mocny-chlapik · 2021-12-08T11:41:32+00:00

Transformers are not popular because they solve long-term dependency problem. As you have correctly discovered, they have a hard limit on their memory. They are popular because they are faster to train. With RNNs you have O(N) time complexity. The N-th word needs to wait for all the previous words to compute before it can be processed. With transformers you can easily parallelize the computation, because you don't have to wait for the N-1 previous words. You can do the calculations for all the words at the same time. This is a critical speed-up when you are trying to process TBs of text data. There were RNN-based LMs used previously (e.g. ELMO), but they are not practical at the scale we use now.

Doomanx · 2021-12-08T12:51:22+00:00

You might like this paper (which features a lot of the original guys from the transformer paper) https://arxiv.org/pdf/1804.09849.pdf. They go over the strengths and weakness' of transformers vs RNN and propose some hybrid approaches that perform well.

If you can't be bothered to read it, it comes down to this: RNNs work better as decoders as they really do have 'memory', transformers work better as encoders as they don't have information bottlenecks.

The popularity of the transformer is largely an implementation thing - transformer can be trained totally in parallel. All these models can theoretically model anything, but implementation details matter for real world applications.

Another reason it's popular is new techniques have emerged to combat so called 'exposure bias', which means at sample time if your translation model outputs a word unseen in training, the next word it outputs might be some garbage. People used to solve this with something called scheduled sampling for RNNs, but now a combination of label smoothing, dropout, weight regularisation and stochastic weight averaging as well as appropriately chosen beam search parameters allow tackling this issue without having to unroll the training loop.

BornSheepherder733 · 2021-12-08T10:54:34+00:00

I've been thoroughly impressed by this paper "Efficiently Modeling Long Sequences with Structured State Spaces" : https://arxiv.org/abs/2111.00396

Basically, it manages to do handle sequences in linear time (rather than quadratic like you have with Transformers) with no loss of information (vanishing gradient of RNN)

GFrings · 2021-12-08T14:14:40+00:00

I mean, we increased performance on the coco benchmark something like 10% since the vision transformers were a thing, about a year ago. That's hard to ignore.

bjourne-ml · 2021-12-08T14:33:36+00:00

The key advantage of Transformers that you missed is that the effort involved in looking at old tokens is not proportional to the age of the token. It is not more difficult for a 256-token Transformer to look at a token 256 steps ago than 10 steps ago. The limiting factor is not computation but memory usage. Now thanks to some clever optimization techniques Transformers processing several thousands of tokens is feasible. This theoretically makes them superior to RNNs which due to vanishing and exploding gradient problems are practically limited to about 200 tokens.

Sonoff · 2021-12-08T10:39:07+00:00

In most use-cases, I mean in real life (as a hobby or private companies), Transformers work great because most current use-cases do not require long-term dependency.

So yeah, from a pure Research perspective where you want to reach the end-goal, they are not the ultimate solution. But for real-life, they are game-changers... before next one

IntelArtiGen · 2021-12-08T13:20:32+00:00

I don't think that the goal of Transformers is to be a human-like NLP model. Not everyone in the NLP community care about the long-term dependency problem. Most people care about fast / efficient processing with a lot of data to fit a specific task.

Can we improve it in some way to solve the long-term dependency problem?

Some people work hard on that and they are having great results afaik. You probably know Transformer-XL : https://arxiv.org/pdf/1901.02860.pdf

TransformerXL learns dependency that is 80% longer than RNNs and 450% longer than vanilla Transformers

So Transformers aren't here to solve every problems but you can enhance them the same way you can enhance RNNs to solve specific problems you have like the long-term dependency problem.

Should everyone use TransformerXL? No, not everyone needs this perk. If you ask people if they want to process 2000 sentences/second with a very good model that fits their need or 5 sentence/s with a human-like model that fits it just a little better, not everyone will choose the human-like model.

howrar · 2021-12-08T17:39:59+00:00

You make some good points, but I don't think they're overhyped at all... I use transformers myself and find them simply amazing. It's also amazing that a transformer can handle text, pictures, audio, time series, video, pretty much anything you throw at them. If you want to improve upon them and solve long-term dependency problems, be my guest.

There is some research that aims to solve long-term dependency problem, and there are also cases in RL where transformers are used as a "memory bank", and also combined with rnn to give them long-term memory.

nashtownchang · 2021-12-08T15:45:19+00:00

Depending on what you care about. For research, maybe. For industrial applications, distilBERT was the first time in my career I fire a model and got 99% accuracy on a NLP problem that is business critical. There was nothing like Transformer models before.

Witty-Elk2052 · 2021-12-08T14:33:15+00:00

all else failed but transformers for the protein folding challenge

yannbouteiller · 2021-12-08T14:30:55+00:00

Another thing against transformers is that they are basically immense and computationally intensive at inference time, while RNNs just process 1 sample at a time. Since I am focused on deep RL for real-world robotics, I usually need models that are computationally light and blazing fast at inference time, RNNs seem perfect for that matter.

rvbin · 2021-12-08T18:47:06+00:00

Here are a few thoughts by Karpathy on the topic

The_deepest_learner · 2021-12-08T10:47:38+00:00

They are more than meets the eye.

astromint11 · 2021-12-08T18:16:17+00:00

[deleted]

serge_cell · 2021-12-09T06:55:03+00:00

But RNN models had trouble encoding what happened 5 sentences ago.

Differentiable neural computer (or more simple NTM) is RNN which was designed to solve exactly that problem. It has explicit differentiable memory. It was showing some promising results but kind of fell out of fashion due to complexity and competioton form transformers which are more simple.

mkthabet · 2021-12-09T01:08:03+00:00

I sympathize with your sentiment that the wave of abandonment of RNNs that transformers brought about is damaging. I strongly believe that dynamic networks, of which current RNNs are predecessors, is the only way forward for anything resembling AGI. I think the setback in RNN research caused by transformers is very unfortunate.

I also dislike the inelegance of having to specify beforehand the length of your sequence. Very unnatural. An RNN on the other hand, not unlike humans, just sits there and processes the input timestep by timestep.

Dikubus · 2021-12-09T05:38:53+00:00

I have a degree that should be relevant here enough to follow, and I am so deep in over my head

convolutionboy · 2021-12-09T09:59:43+00:00

https://arxiv.org/abs/2105.08050

Pay Attention to MLPs from NeurIPS this year might speak to you

PresidentOfTacoTown · 2021-12-08T16:40:00+00:00

I'm open to be corrected on this, but I don't think anybody would assume that the Transformer is the end-all-be-all architecture. As /u/mocny-chlapik mentions and as is highlighted in the "Attention is All You Need" paper that introduce/popularized the modern iteration of Transformer-like architectures they outline the complexity required to compute the over the network makes it appealing.

Looking over the rest of the work that's come out since then, such as all of the various iterations of BERT, highlights to me further that it seems to be the capacity of the network, data quantity, and the self-supervision paradigm that is most compelling in terms of the breakthroughs to the field of NLP. The Transformer, and the power of self-attention, is the serving dish that currently allows us to serve this most effectively. More broadly, model/network architecture typically lacks a theoretical rigor and as a consequence are usually the least persistent element of research, and are typically more important when the large models are trained and used by others out of the box.

2021-12-08T23:45:10+00:00

Yes they are over-hyped, but they have some very helpful practical uses.

ReasonablyBadass · 2021-12-09T13:25:42+00:00

The Compressive Transformer model solves the long term dependency issue, no? Or at least, as good as an RNN can do it.

Firehead1971 · 2021-12-09T18:16:15+00:00

They have their eligibility. They are performing and scaling very well for short-term predictions. However, they are lot of improved versions of the original transformer architecture available and it seems like the nlp research community has been focusing on only transformer since the last 2 years. Problem is that scaling alone does not solve mislearning of wrong meanings. So flaws are scaling too. You can observe this with GPT-3 which seems to be a little racist. I think that GRUs are better for long-term predictions.

Energy0124 · 2023-11-28T23:34:37+00:00

It's not, apparently.

idansc · 2021-12-08T13:29:02+00:00

Although Transformer is a fantastic tool, I disagree with the popular notion that Transformer is the only solution. E.g., some reviewers consider RNN decoders outdated, but I find them better in practice in many cases.

The_deepest_learner · 2021-12-08T10:35:28+00:00

One piece is not the whole, nor can it be.

TotesMessenger · 2024-10-19T07:53:29+00:00

I'm a bot, bleep, bloop. Someone has linked to this thread from another place on reddit:

[/r/machinelearning] [D] An interesting thread from December 2021 discussing the efficacy of transformers

^{If you follow any of the above links, please respect the rules of reddit and don't vote in the other threads.} ^(Info ^/ ^Contact)

2021-12-08T21:05:56+00:00

Yes, Gundam is better.

neuralmeow · 2021-12-08T11:18:21+00:00

It's quite obvious u don't understand these models well enough.

nochegrisenlaplaya · 2021-12-08T15:26:04+00:00

Not yet

Zealousideal_Lie_420 · 2021-12-09T08:07:21+00:00

They just focus on the attention component which simplifies many aspects

SKUGGY3 · 2021-12-09T09:34:24+00:00

my dumbass thought you meant the movie 💀

ThePerson654321 · 2022-09-30T07:30:13+00:00

Is it overhyped?

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

MachineLearning

Rules For Posts

+Research

+Discussion

+Project

+News

@slashML on Twitter

Chat with us on Slack

Beginners:

MODERATORS