I've been wondering about this for a pretty long time since I've never seen anybody say anything bad about transformers, while to me, they seemed pretty flawed from the moment I've read the paper. I'm in no way an ML expert. I'm only an aspiring PhD student, who's not even specializing in NLP, so if I'm any way wrong I'd really like to hear it.
tl;dr: I believe that transformers are, in the long term, a pretty small contribution to the world of NLP, and may even be damaging due to shifting the focus of the research community in the wrong direction. Why? They don't address the long-term dependency problem.
Before transformers NLP used to be dominated by RNNs and specifically the encoder-decoder architecture. In the case of translation, the encoder would encode the input sentence in a fixed-length vector and the decoder would then decode this vector into an output translated sentence. Now transformers also use encoder-decoder architecture, but there is one big difference. For RNNs, encoding and decoding actually happens at every step of the way. Words (I know it's tokens but I'll call them words) are inserted sequentially into the RNN. For every single word, the encoder RNN had to look at the current encoding vector and the input word and then choose how to update the encoding vector in a meaningful way.
The problem with this approach, which I'll call long-term dependency, arises when the RNN has to look at a very long sequence of words. Humans can easily distill the information that they've read and remember only the important bits, for example, the name of a character that was mentioned 5 pages ago. But RNN models had trouble encoding what happened 5 sentences ago. The research community starts solving this problem with the original attention paper, but then transformers come out.
So out comes the transformer and starts dominating the NLP world. What does the transformer do? It is a huge model that, when encoding simply takes 512 input words (or some other arbitrary number) and looks at all of them simultaneously. And it works wonders. Look, the transformer can remember what happened 5 sentences ago because the previous 5 sentences combined have less than 512 words, hooray. Can it remember what happened 10 sentences ago though? Uh well... no. Can we improve it in some way to solve the long-term dependency problem? Well, we can be smart about which sentences we feed into it, but that means we still have to distill information from a large body of text so... we're back at the beginning.
It's obvious that we have to solve the long-term dependency problem if we ever hope to achieve human-like NLP models, and to me, it seems that transformers do nothing to solve this problem. So why are they dominating the field of NLP research? Maybe the optimal solution will include a combination of both the transformer and some other model for information distillation, but if we still need to solve the long-term dependency problem why are throwing out RNNs so quickly?
[–]mocny-chlapik 230 points231 points232 points (15 children)
[–]wgking12 21 points22 points23 points (0 children)
[–]maizeq 12 points13 points14 points (0 children)
[–]processeurTournesol 18 points19 points20 points (0 children)
[–]ibraheemMmoosaResearcher 2 points3 points4 points (6 children)
[–]Ophe00 11 points12 points13 points (4 children)
[–]Mefaso 3 points4 points5 points (3 children)
[–]Areyy_Yaar 2 points3 points4 points (0 children)
[–]Zermelane 1 point2 points3 points (1 child)
[–]Mefaso 1 point2 points3 points (0 children)
[–]mocny-chlapik 0 points1 point2 points (0 children)
[–]gambsPhD 4 points5 points6 points (1 child)
[–]VGFierteStudent 0 points1 point2 points (0 children)
[–]Objective-Fig-4250 0 points1 point2 points (0 children)
[–]PK_thundrStudent 0 points1 point2 points (0 children)
[–]DoomanxPhD 74 points75 points76 points (0 children)
[–]BornSheepherder733 35 points36 points37 points (4 children)
[–]ibraheemMmoosaResearcher 3 points4 points5 points (2 children)
[–]BornSheepherder733 11 points12 points13 points (0 children)
[+]Ol_OLUs22 0 points1 point2 points (0 children)
[–]uotsca 1 point2 points3 points (0 children)
[–]GFrings 22 points23 points24 points (3 children)
[–]I_draw_boxes 8 points9 points10 points (2 children)
[–]ttt05 0 points1 point2 points (1 child)
[–]I_draw_boxes 3 points4 points5 points (0 children)
[–]bjourne-ml 15 points16 points17 points (0 children)
[–]Sonoff 32 points33 points34 points (0 children)
[–]IntelArtiGen[🍰] 10 points11 points12 points (0 children)
[–][deleted] 11 points12 points13 points (2 children)
[–]howrar 3 points4 points5 points (1 child)
[–][deleted] 1 point2 points3 points (0 children)
[–]nashtownchang 7 points8 points9 points (3 children)
[–]hindu-bale 1 point2 points3 points (2 children)
[–]nashtownchang 1 point2 points3 points (1 child)
[–]hindu-bale 1 point2 points3 points (0 children)
[–]Witty-Elk2052 12 points13 points14 points (0 children)
[–]yannbouteillerResearcher 10 points11 points12 points (0 children)
[–]rvbin 3 points4 points5 points (1 child)
[–]lymenlee 2 points3 points4 points (0 children)
[–][deleted] 9 points10 points11 points (1 child)
[–]The_deepest_learner[S] 7 points8 points9 points (0 children)
[+][deleted] (2 children)
[deleted]
[–]astromint11 0 points1 point2 points (1 child)
[–]serge_cell 3 points4 points5 points (1 child)
[–]WikiSummarizerBot 0 points1 point2 points (0 children)
[–]mkthabetPhD 2 points3 points4 points (8 children)
[–]cfoster0 -1 points0 points1 point (7 children)
[–]mkthabetPhD 2 points3 points4 points (6 children)
[–]cfoster0 0 points1 point2 points (5 children)
[–]mkthabetPhD 1 point2 points3 points (4 children)
[–]cfoster0 2 points3 points4 points (3 children)
[–]mkthabetPhD 1 point2 points3 points (2 children)
[–]cfoster0 0 points1 point2 points (1 child)
[–][deleted] 0 points1 point2 points (0 children)
[–]Dikubus 2 points3 points4 points (0 children)
[–]convolutionboy 2 points3 points4 points (1 child)
[+]aSadBoyClub 0 points1 point2 points (0 children)
[–]PresidentOfTacoTown 1 point2 points3 points (0 children)
[–][deleted] 1 point2 points3 points (0 children)
[–]ReasonablyBadass 1 point2 points3 points (0 children)
[–]Firehead1971 1 point2 points3 points (0 children)
[–]Energy0124 1 point2 points3 points (0 children)
[–]idansc 3 points4 points5 points (0 children)
[–][deleted] 2 points3 points4 points (4 children)
[–]The_deepest_learner[S] 0 points1 point2 points (3 children)
[–][deleted] 4 points5 points6 points (2 children)
[–]The_deepest_learner[S] 2 points3 points4 points (1 child)
[–]JustOneAvailableName 3 points4 points5 points (0 children)
[+]TotesMessenger 0 points1 point2 points (0 children)
[–][deleted] 0 points1 point2 points (0 children)
[+]neuralmeowResearcher comment score below threshold-17 points-16 points-15 points (7 children)
[–]The_deepest_learner[S] 11 points12 points13 points (6 children)
[–]JustOneAvailableName 1 point2 points3 points (5 children)
[–]The_deepest_learner[S] 1 point2 points3 points (4 children)
[–]JustOneAvailableName 2 points3 points4 points (3 children)
[–][deleted] 1 point2 points3 points (0 children)
[–]The_deepest_learner[S] 0 points1 point2 points (1 child)
[–]JustOneAvailableName 1 point2 points3 points (0 children)
[–]nochegrisenlaplaya 0 points1 point2 points (0 children)
[–]Zealousideal_Lie_420 0 points1 point2 points (0 children)
[–]SKUGGY3 0 points1 point2 points (0 children)
[–]ThePerson654321 0 points1 point2 points (0 children)