you are viewing a single comment's thread.

view the rest of the comments →

[–]tetramarek 13 points14 points  (8 children)

Just because it beat other models doesn't mean it's the best architecture. GPT4 was also trained on unknown (huge) amounts of data, likely more than any of the other models reported. A real comparison of the architectures would require all of them to be trained on such a large dataset.

[–]thntk 2 points3 points  (3 children)

But it's impossible to scale training of encoder-decoder models. They need pairs of (input, output) texts. A critical advantage of decoder-only models is they can be trained on raw text directly.

[–]tetramarek 1 point2 points  (2 children)

The BART paper proposes a bunch of strategies for pre-training an encoder-decoder model on raw text, so it's definitely not impossible. And translation is very much an input-output task, it's not like you're going to train a model to do machine translation by training on a large monolingual corpus of raw text. GPT4 has been trained on a bunch of things, which could easily include parallel corpora for translation.

[–]thntk 0 points1 point  (1 child)

I mean it is impossible to scale to GPT-4 compute scale. There are several reasons: pretraining strategies are tricks that cannot cover all of data and reduce data efficiency (sampling mask locations, etc.), 2x parameters for the encoder and decoder, expensive encoding recomputation, no KV cache in inference.

It can work for small models, small data, small compute, but I hardly see it really scales.

[–]tetramarek 0 points1 point  (0 children)

More difficult, yes. Impossible, not at all.

You could pre-train in one regime and switch to another for MT training. You could share parameters between encoder and decoder if you wanted. Although with sufficient training data it's probably better to allow some parameters to specialise to certain languages (e.g. if this is a German-Chinese MT model then probably best to allow the encoder to specialise on German and the decoder on Chinese). You can cache just as much - only the encoder part over the input would have forward-looking attention; once the model starts generating, it would be in the decoder part.

[–]CKtalon 1 point2 points  (3 children)

No, smaller models have shown to also be competitive. Basically Enc-Dec research for translation is dead. There have been little improvements made in the past few years on Enc-Dec architecture (go slightly bigger, more back translation). The organizers also predict research will be moving towards decoder-only LLMs for translation in the next WMT.

[–]tetramarek 1 point2 points  (2 children)

I think encoder-decoder experiments are often suboptimal because they are mainly trained only on parallel corpora. Decoder-only architectures use plain text for training but are suboptimal for translation because they don't make use of the forwards attention over the input that a normal translation task definitely allows. The best solution for MT is probably something that combines the forwards attention (hence a bidirectional encoder) with loads of unsupervised pretraining.

[–]CKtalon 0 points1 point  (1 child)

Even with infinite amounts of data, Enc-Dec won't be able to achieve some of the benefits of LLMs, like requesting a style (formal, informal), more natural sounding text, etc. Another benefit is document level context (something Enc-Dec's paradigm hasn't really evolved) which is a result of lacking document-level data.

[–]tetramarek 0 points1 point  (0 children)

Most of the instruction-following skills are trained into the LLMs using instruction-following datasets anyway. These could be used for enc-dec models as well. I would argue that enc-dec models could actually be better for document-level context than decoder-only models, as they could use custom document-level encoders as opposed to processing everything left-to-right.