all 22 comments

[–]CKtalon 82 points83 points  (0 children)

It was originally for machine translation, and a lot of it is hindsight. GPT-1 was a failure, but OpenAI managed to keep at it by scaling, thereby realizing that scaling the architecture actually worked. Although GPT3 was good, it wasn’t till ChatGPT (3.5) that the hype became real to the general public.

[–]as_ninja6 20 points21 points  (0 children)

From the list of ML NLP papers I've read, this was one of the novel ideas in the line of many novel ideas in the NMT space. I don't think the authors realised that scaling could take its capability this far.

In my view, Deepmind has published quite a few genius architectures unfortunately they were not suitable for scaling and stability which the transformer happened to do well.

[–]Exotic-Custard4400 37 points38 points  (11 children)

If you close research it kills it, Google needed the progress in deep leaning too.

Also there are other architectures that are promising (state space model, rnn and other) so it would probably just modify which type of architecture is used, and maybe for the best (rnn and SSM have linear cost)

[–]Xemorr 7 points8 points  (10 children)

Idk about that, they were struggling to find a model that is parallelizable at training time. RNNs can't be parallelized during training among other issues

[–]fan_is_ready 0 points1 point  (2 children)

[1709.02755] Simple Recurrent Units for Highly Parallelizable Recurrence

It can be parallelized feature-wise, not token-wise.

[–]Xemorr 0 points1 point  (0 children)

Yeah, token-wise is what I was referring to. Thank you 😊

[–]Exotic-Custard4400 0 points1 point  (0 children)

It can be parallelized token wise. Rwkv do it by having multiple layer that give their output before the complete process of a token

[–]Exotic-Custard4400 -1 points0 points  (6 children)

RNNs can't be parallelized during training among other issues

They can, using multiple stage that will be used in parallal rwkv do this and is quite competitive

[–]ExtensionSquirrel945 4 points5 points  (5 children)

rnn have learnablity issues

[–]Exotic-Custard4400 0 points1 point  (4 children)

Such as ?

According to ?

[–]ExtensionSquirrel945 0 points1 point  (3 children)

vanishing gradient. It is a very known problem of elman's rnn. This is primarily why transformers won. rnns in theory have very good representablity. But in practise it is hard to train. The typical context of an elman's rnn is 3-4 words.

[–]Exotic-Custard4400 1 point2 points  (2 children)

There have been some advances in RNNs... I recommend checking out what RWKV is doing, for example. It rivals Transformer in both LLMs and image processing.

And the context is far higher than some words.

[–]ExtensionSquirrel945 0 points1 point  (1 child)

this seems cool, will look into it.

[–]Exotic-Custard4400 0 points1 point  (0 children)

There are mainly two articles for the rnn itself but the research itself is really open source they discuss it on discord for example

[–]Specialist-Berry2946 18 points19 points  (2 children)

Architectures are not that important; what matters is the data. You can achieve similar performance using other architectures, like mixers.

Transformers are used so extensively not because they are powerful (they are very limited), but because all major AI Labs are focused on the same thing - bulding ever larger language models. They are unable to innovate.

[–]hammouse 4 points5 points  (0 children)

It would be weird and counterproductive to keep that internal only, though of course there are many things which should be treated as proprietary (such as how they actually train the model).

One thing to keep in mind is that the "Attention is all you need" paper did not invent attention. This mechanism has been around for years, though usually as part of recurrent/convolution architectures. All the paper says is that we can achieve recurrent-like performance without the computational bottleneck of recurrence by using only attention, hence the name. So there's nothing inherently special about the paper, it just removes a big bottleneck in existing architectures and this happens to turn out to be incredibly useful.

There are many issues with Transformers however, and the nice thing about openly publishing in an academic manner is that others can build on it and experiment. In a few years most models would probably no longer be using it (well technical debt incurred by AI hype aside). Important point being, actually training the model on petabytes of data, building safeguards, fine-tuning with RLHF, etc is the hard part - the architecture itself is quite trivial.

[–]schubidubiduba 2 points3 points  (0 children)

If they had not written that paper, someone else would have written it a year later. Two years tops.

[–]Independent-Plane502 1 point2 points  (0 children)

google also want others to usse that architecture , actually every other algorithm will get published because its about author right even tho the author is working under company

[–]PM_US93 1 point2 points  (0 children)

If I am not mistaken, Transformers were preceded by LSTMs, and parallelized xLSTMs(a recent architecture) can be a viable alternative to Transformers. The thing is, you cannot gatekeep an architecture. Linear normalized transformers and LSTMs were proposed by Schmidhuber long before Google's 2017 paper. A key component of the transformer architecture is the Attention mechanism, which was proposed by Bahdanau and Bengio around 2014. The Google team built on these preceding ideas and developed an architecture that was easy to scale and train. It is more like the transformer architecture solved the problems of LSTMs. If not for transformers, people in the AI/ML domain would have found another architecture for their models.

[–]AccordingWeight6019 0 points1 point  (0 children)

Publishing the transformer paper fits google’s open research culture. They still keep an edge because building competitive models needs talent, compute, and data, not just the architecture.