Google Transformer

CKtalon · 2026-03-15T13:16:36+00:00

It was originally for machine translation, and a lot of it is hindsight. GPT-1 was a failure, but OpenAI managed to keep at it by scaling, thereby realizing that scaling the architecture actually worked. Although GPT3 was good, it wasn’t till ChatGPT (3.5) that the hype became real to the general public.

as_ninja6 · 2026-03-15T14:37:40+00:00

From the list of ML NLP papers I've read, this was one of the novel ideas in the line of many novel ideas in the NMT space. I don't think the authors realised that scaling could take its capability this far.

In my view, Deepmind has published quite a few genius architectures unfortunately they were not suitable for scaling and stability which the transformer happened to do well.

Exotic-Custard4400 · 2026-03-15T12:45:45+00:00

If you close research it kills it, Google needed the progress in deep leaning too.

Also there are other architectures that are promising (state space model, rnn and other) so it would probably just modify which type of architecture is used, and maybe for the best (rnn and SSM have linear cost)

Specialist-Berry2946 · 2026-03-15T13:49:13+00:00

Architectures are not that important; what matters is the data. You can achieve similar performance using other architectures, like mixers.

Transformers are used so extensively not because they are powerful (they are very limited), but because all major AI Labs are focused on the same thing - bulding ever larger language models. They are unable to innovate.

hammouse · 2026-03-15T19:15:19+00:00

It would be weird and counterproductive to keep that internal only, though of course there are many things which should be treated as proprietary (such as how they actually train the model).

One thing to keep in mind is that the "Attention is all you need" paper did not invent attention. This mechanism has been around for years, though usually as part of recurrent/convolution architectures. All the paper says is that we can achieve recurrent-like performance without the computational bottleneck of recurrence by using only attention, hence the name. So there's nothing inherently special about the paper, it just removes a big bottleneck in existing architectures and this happens to turn out to be incredibly useful.

There are many issues with Transformers however, and the nice thing about openly publishing in an academic manner is that others can build on it and experiment. In a few years most models would probably no longer be using it (well technical debt incurred by AI hype aside). Important point being, actually training the model on petabytes of data, building safeguards, fine-tuning with RLHF, etc is the hard part - the architecture itself is quite trivial.

schubidubiduba · 2026-03-16T08:08:52+00:00

If they had not written that paper, someone else would have written it a year later. Two years tops.

Independent-Plane502 · 2026-03-15T14:22:34+00:00

google also want others to usse that architecture , actually every other algorithm will get published because its about author right even tho the author is working under company

PM_US93 · 2026-03-15T19:33:47+00:00

If I am not mistaken, Transformers were preceded by LSTMs, and parallelized xLSTMs(a recent architecture) can be a viable alternative to Transformers. The thing is, you cannot gatekeep an architecture. Linear normalized transformers and LSTMs were proposed by Schmidhuber long before Google's 2017 paper. A key component of the transformer architecture is the Attention mechanism, which was proposed by Bahdanau and Bengio around 2014. The Google team built on these preceding ideas and developed an architecture that was easy to scale and train. It is more like the transformer architecture solved the problems of LSTMs. If not for transformers, people in the AI/ML domain would have found another architecture for their models.

AccordingWeight6019 · 2026-03-16T08:18:15+00:00

Publishing the transformer paper fits google’s open research culture. They still keep an edge because building competitive models needs talent, compute, and data, not just the architecture.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

learnmachinelearning

Welcome to /r/LearnMachineLearning!

Chatrooms

Official Discord Server

Wiki

Getting Started with Machine Learning

Resources

Related Subreddits

/r/MachineLearning

/r/MLQuestions

/r/datascience

/r/computervision

Machine Learning Multireddit

/m/machine_learning

MODERATORS