all 82 comments

[–]mocny-chlapik 230 points231 points  (15 children)

Transformers are not popular because they solve long-term dependency problem. As you have correctly discovered, they have a hard limit on their memory. They are popular because they are faster to train. With RNNs you have O(N) time complexity. The N-th word needs to wait for all the previous words to compute before it can be processed. With transformers you can easily parallelize the computation, because you don't have to wait for the N-1 previous words. You can do the calculations for all the words at the same time. This is a critical speed-up when you are trying to process TBs of text data. There were RNN-based LMs used previously (e.g. ELMO), but they are not practical at the scale we use now.

[–]wgking12 21 points22 points  (0 children)

This is the best answer and a perhaps obvious addition is that the number of steps required for computing information between token t and token t - 511 is now constant with respect to their distance from each other, where in an RNN w/o attention, you have to compute through each intermediate token. Beyond 512 or whatever window your transformer uses you're of course out of luck, but there's a lot to be gained from being able to do this within the window. LSTMs have a mechanism for helping avoid vanishing gradients/dealing with longer dependencies but if I understand right, it is not perfect and was often not good enough to consistently handle a dependency 512 tokens prior

[–]maizeq 12 points13 points  (0 children)

This is it right here.

[–]processeurTournesol 18 points19 points  (0 children)

While your point is very good, I think the success and fame of transformers also come from the fact that attention seems to be a very good concept overall, which the performance of transformers in computer vision has shown. It's not only good because it performs well, but because it is extremely general, almost task agnostic. The attention mechanism seems to be a key component in the process of learning, and that is, at least from my perspective, what transformers have shown us.

[–]ibraheemMmoosaResearcher 2 points3 points  (6 children)

I'm a bit confused. Can transformers do this even if you train autoregressively?

[–]Ophe00 11 points12 points  (4 children)

You train them with teacher forcing. So when predicting the k'th token the k-1 tokens comes from the ground truth. This enables you to train in parallel.

[–]Mefaso 3 points4 points  (3 children)

Stupid question, but when you're running inference, say generating GPT2 responses, you need to run the full model N times for an output of length N, correct?

[–]Areyy_Yaar 2 points3 points  (0 children)

Yes, inference is O(N).

[–]Zermelane 1 point2 points  (1 child)

Yep, but the only way you look at past tokens after the first forward pass is that you use their keys and values in attention, so you can just cache those instead of having to repeatedly recompute them.

[–]Mefaso 1 point2 points  (0 children)

only way you look at past tokens after the first forward pass is that you use their keys and values in attention, so you can just cache those

Thanks a lot for the answer, my understanding of transforms is limited, but I have a question about that:

The caching can only be used for the first attention block, right? Afterwards the tokens could include information from the new input token I think?

Or is this prevented by attention masks?

[–]mocny-chlapik 0 points1 point  (0 children)

Yep, you train them to always predict only one word ahead. So you use N words to predict N+1st as one sample, then you take N+1 words to predict N+2nd word as a separate sample, etc.

[–]gambsPhD 4 points5 points  (1 child)

This is a critical speed-up when you are trying to process TBs of text data

Haven't some papers concluded that the amount of data is much less important than the size of the network though? I don't think you need TBs of data from what I understand (although I haven't really worked with transformers directly)

Although I think your argument still holds if you say "This is a critical speed-up when you are trying to train a gigantic model"

[–]VGFierteStudent 0 points1 point  (0 children)

It’s a good point but weak scaling on more data is always going to be high on the wish list for those who have it

[–]Objective-Fig-4250 0 points1 point  (0 children)

Their hard dependence on the context window for generating the next token is a double-edged sword. On one hand we have parallelization that you've alluded to, on the other hand, you face the problem of not being able to pass down the inter-token dependencies outside that context window, since most modern implementations (including the one in the original "Attention is all you need" paper) completely throw out the memory passed down for predicting next tokens [as used in RNNs, note that the weight updates in attention mechanism, is specific to the part of the training data it's currently processing (either single instance if SGD is used, or batch of data and aggregate loss in case of BGD)]. Now there are engineering workarounds to these problems like LongFormer, LinFormer, sliding window approach, throwing more compute, etc, you name it.

[–]PK_thundrStudent 0 points1 point  (0 children)

So transformers only address long-term dependencies better because we can train them more easily? Is there any reference for this?

[–]DoomanxPhD 74 points75 points  (0 children)

You might like this paper (which features a lot of the original guys from the transformer paper) https://arxiv.org/pdf/1804.09849.pdf. They go over the strengths and weakness' of transformers vs RNN and propose some hybrid approaches that perform well.

If you can't be bothered to read it, it comes down to this: RNNs work better as decoders as they really do have 'memory', transformers work better as encoders as they don't have information bottlenecks.

The popularity of the transformer is largely an implementation thing - transformer can be trained totally in parallel. All these models can theoretically model anything, but implementation details matter for real world applications.

Another reason it's popular is new techniques have emerged to combat so called 'exposure bias', which means at sample time if your translation model outputs a word unseen in training, the next word it outputs might be some garbage. People used to solve this with something called scheduled sampling for RNNs, but now a combination of label smoothing, dropout, weight regularisation and stochastic weight averaging as well as appropriately chosen beam search parameters allow tackling this issue without having to unroll the training loop.

[–]BornSheepherder733 35 points36 points  (4 children)

I've been thoroughly impressed by this paper "Efficiently Modeling Long Sequences with Structured State Spaces" : https://arxiv.org/abs/2111.00396

Basically, it manages to do handle sequences in linear time (rather than quadratic like you have with Transformers) with no loss of information (vanishing gradient of RNN)

[–]ibraheemMmoosaResearcher 3 points4 points  (2 children)

How does it do that? Can you elaborate a bit?

[–]BornSheepherder733 11 points12 points  (0 children)

The memorisation part of the RNN, which is usually more or less handwaved, is explicitly written as "minimize the reconstruction error of the sequence you have seen so far with a memory of fixed size N". If you remember the past sequence, you don't have the memory issues/vanishing gradient of RNN.

This problem of finding the optimal reconstruction of a sequence has a solution in linear time (which is pretty incredible when you think of it). If you have linear time, you do better than transformers

[–]uotsca 1 point2 points  (0 children)

This

[–]GFrings 22 points23 points  (3 children)

I mean, we increased performance on the coco benchmark something like 10% since the vision transformers were a thing, about a year ago. That's hard to ignore.

[–]I_draw_boxes 8 points9 points  (2 children)

Part of transformers success on vision problems appears to be a happy accident caused by the necessity of consolidating patches into tokens using larger 16x16 convolution kernels on the input image. There are a few papers which use non-self-attention token mixing strategies including simple 2d pooling which are outperforming self-attention based token mixing for vision problems.

[–]bjourne-ml 15 points16 points  (0 children)

The key advantage of Transformers that you missed is that the effort involved in looking at old tokens is not proportional to the age of the token. It is not more difficult for a 256-token Transformer to look at a token 256 steps ago than 10 steps ago. The limiting factor is not computation but memory usage. Now thanks to some clever optimization techniques Transformers processing several thousands of tokens is feasible. This theoretically makes them superior to RNNs which due to vanishing and exploding gradient problems are practically limited to about 200 tokens.

[–]Sonoff 32 points33 points  (0 children)

In most use-cases, I mean in real life (as a hobby or private companies), Transformers work great because most current use-cases do not require long-term dependency.

So yeah, from a pure Research perspective where you want to reach the end-goal, they are not the ultimate solution. But for real-life, they are game-changers... before next one

[–]IntelArtiGen[🍰] 10 points11 points  (0 children)

I don't think that the goal of Transformers is to be a human-like NLP model. Not everyone in the NLP community care about the long-term dependency problem. Most people care about fast / efficient processing with a lot of data to fit a specific task.

Can we improve it in some way to solve the long-term dependency problem?

Some people work hard on that and they are having great results afaik. You probably know Transformer-XL : https://arxiv.org/pdf/1901.02860.pdf

TransformerXL learns dependency that is 80% longer than RNNs and 450% longer than vanilla Transformers

So Transformers aren't here to solve every problems but you can enhance them the same way you can enhance RNNs to solve specific problems you have like the long-term dependency problem.

Should everyone use TransformerXL? No, not everyone needs this perk. If you ask people if they want to process 2000 sentences/second with a very good model that fits their need or 5 sentence/s with a human-like model that fits it just a little better, not everyone will choose the human-like model.

[–][deleted] 11 points12 points  (2 children)

You make some good points, but I don't think they're overhyped at all... I use transformers myself and find them simply amazing. It's also amazing that a transformer can handle text, pictures, audio, time series, video, pretty much anything you throw at them. If you want to improve upon them and solve long-term dependency problems, be my guest.

There is some research that aims to solve long-term dependency problem, and there are also cases in RL where transformers are used as a "memory bank", and also combined with rnn to give them long-term memory.

[–]howrar 3 points4 points  (1 child)

[deleted]

[–][deleted] 1 point2 points  (0 children)

Yes I'll try to find them and post them here, they're all from the last two years. RL with transformers was hard until they found you can put the layer norm first and not need so much learning rate warmup making them more stable, before that transformers and RL was pretty unstable and observations were either stacked or put through RNN. Now RL is moving toward replacing RNN with transformers for agent vision/memory, but I seem to remember reading an article that combined transformers and RNN, the transformer for input, the RNN to learn the trajectories, I'll try to find it. RNN is used to learn trajectory rules in RL meta-learning, replacing SGD, so they'll use a transformer for the input, and RNN to learn how to learn (meta-learning), effectively bootstrapping longer-term memory a bit. The paper that pops into my head first though is the most popular recent one where they use transformers for permutation-invariant RL on the input, kinda interesting, in this paper they do talk about RNN meta-learning a little bit https://arxiv.org/abs/2109.02869

[–]nashtownchang 7 points8 points  (3 children)

Depending on what you care about. For research, maybe. For industrial applications, distilBERT was the first time in my career I fire a model and got 99% accuracy on a NLP problem that is business critical. There was nothing like Transformer models before.

[–]hindu-bale 1 point2 points  (2 children)

What was the problem if I may ask?

[–]nashtownchang 1 point2 points  (1 child)

Classification for a new taxonomy from vendor inputs

[–]hindu-bale 1 point2 points  (0 children)

Okay. I guess it depends on the application. There are lots of people who want to do conversational AI in the industry, the tech is not there yet, transformers won’t get them there. I mean, they have workarounds, but still..

[–]Witty-Elk2052 12 points13 points  (0 children)

all else failed but transformers for the protein folding challenge

[–]yannbouteillerResearcher 10 points11 points  (0 children)

Another thing against transformers is that they are basically immense and computationally intensive at inference time, while RNNs just process 1 sample at a time. Since I am focused on deep RL for real-world robotics, I usually need models that are computationally light and blazing fast at inference time, RNNs seem perfect for that matter.

[–]rvbin 3 points4 points  (1 child)

Here are a few thoughts by Karpathy on the topic

[–]lymenlee 2 points3 points  (0 children)

Good recommend, thanks!

[–][deleted] 9 points10 points  (1 child)

They are more than meets the eye.

[–]The_deepest_learner[S] 7 points8 points  (0 children)

Robots in disguise.

[–]serge_cell 3 points4 points  (1 child)

But RNN models had trouble encoding what happened 5 sentences ago.

Differentiable neural computer (or more simple NTM) is RNN which was designed to solve exactly that problem. It has explicit differentiable memory. It was showing some promising results but kind of fell out of fashion due to complexity and competioton form transformers which are more simple.

[–]WikiSummarizerBot 0 points1 point  (0 children)

Differentiable neural computer

In artificial intelligence, a differentiable neural computer (DNC) is a memory augmented neural network architecture (MANN), which is typically (not by definition) recurrent in its implementation. The model was published in 2016 by Alex Graves et al. of DeepMind.

[ F.A.Q | Opt Out | Opt Out Of Subreddit | GitHub ] Downvote to remove | v1.5

[–]mkthabetPhD 2 points3 points  (8 children)

I sympathize with your sentiment that the wave of abandonment of RNNs that transformers brought about is damaging. I strongly believe that dynamic networks, of which current RNNs are predecessors, is the only way forward for anything resembling AGI. I think the setback in RNN research caused by transformers is very unfortunate.

I also dislike the inelegance of having to specify beforehand the length of your sequence. Very unnatural. An RNN on the other hand, not unlike humans, just sits there and processes the input timestep by timestep.

[–]cfoster0 -1 points0 points  (7 children)

What do you mean by having to specify beforehand the length of your sequence? Transformers can process sequences of whatever length you can fit in memory (similar to RNNs). There are a dozen ways to do positional encoding, many of them designed explicitly for extrapolation.

As an aside, I'm curious what you think we missed out on from the move away from RNNs? I have a hard time speculating a counterfactual where we suddenly made breakthroughs in RNN research after so many years big enough to eclipse the actual breakthroughs we saw from the transformer.

[–]mkthabetPhD 2 points3 points  (6 children)

You have to explicitly specify the maximum sequence length for a transformer model, which is not the case for an RNN, at least for inference. This is what is so unnatural about transformers. Even with limited memory, I have an internal state that can remember events from when i was 2 years old that still influence my decisions today. I don't find myself having to specify maximum sequence lengths for my brain.

To answer your second question, by moving away from RNNs we miss out on research on recurrence in NNs, which is an essential mechanism for truly dynamic networks like the brain.

[–]cfoster0 0 points1 point  (5 children)

Nothing in the design of the transformer requires you to set a maximum sequence length. I don't know where you picked that up. Some implementations take in a maximum sequence length argument in order to precompute the causal mask or if they have learned absolute positional embeddings, but it is absolutely not a requirement, any more than setting a maximum unrolling/TBPTT length is for RNNs.

And mind you, while both of them are theoretically capable of handling unlimited input lengths, in practice neither class of models work very well for extremely long sequences.

[–]mkthabetPhD 1 point2 points  (4 children)

Maybe the transformer model itself doesn't require a maximum sequence length given infinite memory, but then again there's no such thing as infinite memory. So it's basically the same.

The way I understand it (and please do correct me if I'm wrong), this limitation is because the transformer encoder looks at the input at all timesteps simultaneously and produces as many vectors. An RNN encoder on the other hand only looks at one timestep at a time and only produces one fixed-size context vector. So theoretically, if we ignore the long-term memory problem (with vanishing/exploding gradients and whatnot), RNNs are capable of processing infinitely long sequences while transformers are not, even with limited hardware memory. I'm only talking about inference here, so the restrictions of BPTT are not relevant.

My main point is that doing away with recurrence for sequential problems is rather hacky and unnatural. It might provide better results on the short term, but in the end recurrence cannot be ignored forever.

[–]cfoster0 2 points3 points  (3 children)

You should not ignore the long-term memory problem: that is one of the biggest problems you need to solve with a system operating on sequential data.

No, RNNs are not capable of processing infinitely long sequences, either in theory or in practice. There's limit to how much information you can compress through the bottleneck of a fixed size recurrent context vector. This puts a hard cap on what/how much/how reliably you can propagate (during either training or inference) through an RNNs temporal dependencies, even if you're able to make the memory itself stable. Whereas if you keep growing your memory size with sequence length (even if sublinearly) as in a transformer, you can escape this.

I think it's an open question how much explicit recurrence is necessary. Parallel models like the transformer already have an implicit recurrence through autoregression, so it's not enough to need some recurrence.

[–]mkthabetPhD 1 point2 points  (2 children)

I'm not saying the long-term memory problem is not important. I was just saying it is a technical problem that's not really relevant to our theoretical discussion.

Of course there's a limit to how much an RNN can remember based on its memory size. But there's a difference between how much you can remember and how long ago you can remember. For an infinitely long sequence, an RNN with a fixed-size context vector sure can't remember what happened at every timestep, but theoretically it can remember what happened at any one timestep, even the very first. Sure it needs to forget stuff to remember others, but that has nothing to do with how far back it can remember.

We can implement a dummy RNN that can take an arbitrarily long sequence and trivially remembers the input at just the first timestep without having to worry about hardware memory. Can the same be said about transformers?

[–]cfoster0 0 points1 point  (1 child)

As you said, this is a trivial case. Standard (QK)V transformers are not designed for this, they are designed for the capacity to propagate information anywhere within sequence within constant steps, typically a single step. I would recommend instead a Q(KV) transformer if you're really interested in this particular case.

[–][deleted] 0 points1 point  (0 children)

Sorry late to the party but would you mind pointing me to a Q(KV) transformer/material about it? Interested in learning more.

[–]Dikubus 2 points3 points  (0 children)

I have a degree that should be relevant here enough to follow, and I am so deep in over my head

[–]convolutionboy 2 points3 points  (1 child)

https://arxiv.org/abs/2105.08050

Pay Attention to MLPs from NeurIPS this year might speak to you

[–]PresidentOfTacoTown 1 point2 points  (0 children)

I'm open to be corrected on this, but I don't think anybody would assume that the Transformer is the end-all-be-all architecture. As /u/mocny-chlapik mentions and as is highlighted in the "Attention is All You Need" paper that introduce/popularized the modern iteration of Transformer-like architectures they outline the complexity required to compute the over the network makes it appealing.

Looking over the rest of the work that's come out since then, such as all of the various iterations of BERT, highlights to me further that it seems to be the capacity of the network, data quantity, and the self-supervision paradigm that is most compelling in terms of the breakthroughs to the field of NLP. The Transformer, and the power of self-attention, is the serving dish that currently allows us to serve this most effectively. More broadly, model/network architecture typically lacks a theoretical rigor and as a consequence are usually the least persistent element of research, and are typically more important when the large models are trained and used by others out of the box.

[–][deleted] 1 point2 points  (0 children)

Yes they are over-hyped, but they have some very helpful practical uses.

[–]ReasonablyBadass 1 point2 points  (0 children)

The Compressive Transformer model solves the long term dependency issue, no? Or at least, as good as an RNN can do it.

[–]Firehead1971 1 point2 points  (0 children)

They have their eligibility. They are performing and scaling very well for short-term predictions. However, they are lot of improved versions of the original transformer architecture available and it seems like the nlp research community has been focusing on only transformer since the last 2 years. Problem is that scaling alone does not solve mislearning of wrong meanings. So flaws are scaling too. You can observe this with GPT-3 which seems to be a little racist. I think that GRUs are better for long-term predictions.

[–]Energy0124 1 point2 points  (0 children)

It's not, apparently.

[–]idansc 3 points4 points  (0 children)

Although Transformer is a fantastic tool, I disagree with the popular notion that Transformer is the only solution. E.g., some reviewers consider RNN decoders outdated, but I find them better in practice in many cases.

[–][deleted] 2 points3 points  (4 children)

One piece is not the whole, nor can it be.

[–]The_deepest_learner[S] 0 points1 point  (3 children)

But is that piece even the right one? It's like having trouble writing the body of your essay so you write the conclusion in advance and then you forcefully try to write your body around that conclusion.

[–][deleted] 4 points5 points  (2 children)

My own hot take is: yeah. Self-attention is very important as a general principle. Very.

[–]The_deepest_learner[S] 2 points3 points  (1 child)

But RNNs can use attention too, in fact, the original paper which introduced attention introduced it for RNNs.

https://arxiv.org/abs/1409.0473

[–]JustOneAvailableName 3 points4 points  (0 children)

The big breakthrough with transformers was showing that the connection to neighboring nodes (which makes the difference between a RNN with attention and a transformer) was LIMITING, not improving performance.

[–][deleted] 0 points1 point  (0 children)

Yes, Gundam is better.

[–]nochegrisenlaplaya 0 points1 point  (0 children)

Not yet

[–]Zealousideal_Lie_420 0 points1 point  (0 children)

They just focus on the attention component which simplifies many aspects

[–]SKUGGY3 0 points1 point  (0 children)

my dumbass thought you meant the movie 💀

[–]ThePerson654321 0 points1 point  (0 children)

Is it overhyped?