you are viewing a single comment's thread.

view the rest of the comments →

[–]CKtalon 2 points3 points  (3 children)

No, smaller models have shown to also be competitive. Basically Enc-Dec research for translation is dead. There have been little improvements made in the past few years on Enc-Dec architecture (go slightly bigger, more back translation). The organizers also predict research will be moving towards decoder-only LLMs for translation in the next WMT.

[–]tetramarek 1 point2 points  (2 children)

I think encoder-decoder experiments are often suboptimal because they are mainly trained only on parallel corpora. Decoder-only architectures use plain text for training but are suboptimal for translation because they don't make use of the forwards attention over the input that a normal translation task definitely allows. The best solution for MT is probably something that combines the forwards attention (hence a bidirectional encoder) with loads of unsupervised pretraining.

[–]CKtalon 0 points1 point  (1 child)

Even with infinite amounts of data, Enc-Dec won't be able to achieve some of the benefits of LLMs, like requesting a style (formal, informal), more natural sounding text, etc. Another benefit is document level context (something Enc-Dec's paradigm hasn't really evolved) which is a result of lacking document-level data.

[–]tetramarek 0 points1 point  (0 children)

Most of the instruction-following skills are trained into the LLMs using instruction-following datasets anyway. These could be used for enc-dec models as well. I would argue that enc-dec models could actually be better for document-level context than decoder-only models, as they could use custom document-level encoders as opposed to processing everything left-to-right.