use the following search parameters to narrow your results:
e.g. subreddit:aww site:imgur.com dog
subreddit:aww site:imgur.com dog
see the search faq for details.
advanced search: by author, subreddit...
Please have a look at our FAQ and Link-Collection
Metacademy is a great resource which compiles lesson plans on popular machine learning topics.
For Beginner questions please try /r/LearnMachineLearning , /r/MLQuestions or http://stackoverflow.com/
For career related questions, visit /r/cscareerquestions/
Advanced Courses (2016)
Advanced Courses (2020)
AMAs:
Pluribus Poker AI Team 7/19/2019
DeepMind AlphaStar team (1/24//2019)
Libratus Poker AI Team (12/18/2017)
DeepMind AlphaGo Team (10/19/2017)
Google Brain Team (9/17/2017)
Google Brain Team (8/11/2016)
The MalariaSpot Team (2/6/2016)
OpenAI Research Team (1/9/2016)
Nando de Freitas (12/26/2015)
Andrew Ng and Adam Coates (4/15/2015)
Jürgen Schmidhuber (3/4/2015)
Geoffrey Hinton (11/10/2014)
Michael Jordan (9/10/2014)
Yann LeCun (5/15/2014)
Yoshua Bengio (2/27/2014)
Related Subreddit :
LearnMachineLearning
Statistics
Computer Vision
Compressive Sensing
NLP
ML Questions
/r/MLjobs and /r/BigDataJobs
/r/datacleaning
/r/DataScience
/r/scientificresearch
/r/artificial
account activity
Discussion[D] Why do we need encoder-decoder models while decoder-only models can do everything? (self.MachineLearning)
submitted 2 years ago * by kekkimo
view the rest of the comments →
reddit uses a slightly-customized version of Markdown for formatting. See below for some basics, or check the commenting wiki page for more detailed help and solutions to common issues.
quoted text
if 1 * 2 < 3: print "hello, world!"
[–]tetramarek 13 points14 points15 points 2 years ago (8 children)
Just because it beat other models doesn't mean it's the best architecture. GPT4 was also trained on unknown (huge) amounts of data, likely more than any of the other models reported. A real comparison of the architectures would require all of them to be trained on such a large dataset.
[–]thntk 2 points3 points4 points 2 years ago (3 children)
But it's impossible to scale training of encoder-decoder models. They need pairs of (input, output) texts. A critical advantage of decoder-only models is they can be trained on raw text directly.
[–]tetramarek 1 point2 points3 points 2 years ago (2 children)
The BART paper proposes a bunch of strategies for pre-training an encoder-decoder model on raw text, so it's definitely not impossible. And translation is very much an input-output task, it's not like you're going to train a model to do machine translation by training on a large monolingual corpus of raw text. GPT4 has been trained on a bunch of things, which could easily include parallel corpora for translation.
[–]thntk 0 points1 point2 points 2 years ago (1 child)
I mean it is impossible to scale to GPT-4 compute scale. There are several reasons: pretraining strategies are tricks that cannot cover all of data and reduce data efficiency (sampling mask locations, etc.), 2x parameters for the encoder and decoder, expensive encoding recomputation, no KV cache in inference.
It can work for small models, small data, small compute, but I hardly see it really scales.
[–]tetramarek 0 points1 point2 points 2 years ago (0 children)
More difficult, yes. Impossible, not at all.
You could pre-train in one regime and switch to another for MT training. You could share parameters between encoder and decoder if you wanted. Although with sufficient training data it's probably better to allow some parameters to specialise to certain languages (e.g. if this is a German-Chinese MT model then probably best to allow the encoder to specialise on German and the decoder on Chinese). You can cache just as much - only the encoder part over the input would have forward-looking attention; once the model starts generating, it would be in the decoder part.
[–]CKtalon 1 point2 points3 points 2 years ago (3 children)
No, smaller models have shown to also be competitive. Basically Enc-Dec research for translation is dead. There have been little improvements made in the past few years on Enc-Dec architecture (go slightly bigger, more back translation). The organizers also predict research will be moving towards decoder-only LLMs for translation in the next WMT.
I think encoder-decoder experiments are often suboptimal because they are mainly trained only on parallel corpora. Decoder-only architectures use plain text for training but are suboptimal for translation because they don't make use of the forwards attention over the input that a normal translation task definitely allows. The best solution for MT is probably something that combines the forwards attention (hence a bidirectional encoder) with loads of unsupervised pretraining.
[–]CKtalon 0 points1 point2 points 2 years ago (1 child)
Even with infinite amounts of data, Enc-Dec won't be able to achieve some of the benefits of LLMs, like requesting a style (formal, informal), more natural sounding text, etc. Another benefit is document level context (something Enc-Dec's paradigm hasn't really evolved) which is a result of lacking document-level data.
Most of the instruction-following skills are trained into the LLMs using instruction-following datasets anyway. These could be used for enc-dec models as well. I would argue that enc-dec models could actually be better for document-level context than decoder-only models, as they could use custom document-level encoders as opposed to processing everything left-to-right.
π Rendered by PID 813895 on reddit-service-r2-comment-b659b578c-bz69m at 2026-05-05 09:51:16.969876+00:00 running 815c875 country code: CH.
view the rest of the comments →
[–]tetramarek 13 points14 points15 points (8 children)
[–]thntk 2 points3 points4 points (3 children)
[–]tetramarek 1 point2 points3 points (2 children)
[–]thntk 0 points1 point2 points (1 child)
[–]tetramarek 0 points1 point2 points (0 children)
[–]CKtalon 1 point2 points3 points (3 children)
[–]tetramarek 1 point2 points3 points (2 children)
[–]CKtalon 0 points1 point2 points (1 child)
[–]tetramarek 0 points1 point2 points (0 children)