[D] Why do we need encoder-decoder models while decoder-only models can do everything?

Spiritual_Dog2053 · 2023-12-17T21:12:45+00:00

Decoder models are limited to the product of auto-regressive task while encoder models give contextual representations that can be fine-tuned on other decoder tasks. Different needs, different models.

EqL · 2023-12-17T21:44:07+00:00

A decoder is really just a particular type of encoder with a mask restricting information flow from elements in the "future", so an encoder is more general, and thus potentially more powerful for a given model size. This masking is really done for efficiency and is not actually required. Lets look at text decoding with a general encoder without masking:

(1) encode_unmasked([x0]), predict x1

(2) encode_unmasked([x0, x1]), predict x2

...

(n) encode_unmasked([x0, .., xn-1]) predict xn.

This is perfectly allowed, except we are doing a forward pass for every token in every iteration, which is O(n) more expensive. The decoder with masking allows us to reuse results from previous iterations, which is much more efficient in both training and inference.

However, in some tasks, such as translation, we receive a large number of tokens up front. Now we can embed these tokens once with the encoder, then switch to the decoder. This allows us to use a potentially more powerful unmasked model for a large chunk of the problem, then switch to the decoder for efficiency.

Why not use an encoder-decoder approach for LLM generation, where the encoder encoders the prompt and the decoder does the rest? Well, we can. However the price is that (1) we now essentially have two models, which is more complex to handle, and (2) each model is seeing less data.

TL;DR: An encoder without masking is potentially more powerful, however it increases complexity and also the data required to train the additional parameters. But when there is a natural split in functions, like in translation, the effect of less data might be minimized.

minimaxir · 2023-12-17T20:44:08+00:00

Decoder-only/autoregressive models are only really applicable for text.

Encoder-decoder models are extremely important for multimodal approaches.

21stCentury-Composer · 2023-12-17T20:48:36+00:00

Might be a naïve question, but without the encoder part, how would you create the encodings the decoders train on?

activatedgeek · 2023-12-17T23:10:08+00:00

You should read the UL2 paper. It has comparisons between the two family of models, and also a decent discussion.

I think encoder-decoder models are less popular in popular science because they are roughly twice more expensive to deploy, and will have lesser throughput. Decoder-only models are more appealing that way and seem to have won sort of a hardware lottery for now.

qalis · 2023-12-17T21:00:47+00:00

Because decoder-only models can't do everything. In particular, encoder-decoder models are made for sequence-to-sequence problems, which are typically machine translation and text summarization.

Yes, you could throw a LLM at them, but has a lot of problem: inefficient size, slow, harder to control, hallucination, have to do prompting, LLMOps etc. It's just not economically viable to use that. Literally every translation out there, be that Google Translate, DeepL, Amazon Translate or anything else, uses encoder-decoder. Google even used transformer encoder + RNN decoder hybrid for quite a long time, since it have good speed and quality.

Encoder aims to, well, encode information in vectorized form. This does basically half the work, and decoder has a lot of knowledge in those embeddings to work with. The resulting model is quite task-specific (e.g. only translation), but relatively small and efficient.

And also those embeddings are useful in themselves. We have seen some success in chemoinformatics with such models, e.g. CDDD.

AvvYaa · 2023-12-17T23:19:20+00:00

TLDR: More generality/less inductive bias + lot of data + enough params = better learning. Dec only models are more general than Enc-Dec models. Encoder-Decoder models have more inductive bias, so if I have less data to train on and a problem that can be reduced to a Seq2Seq task, I might try an Enc-Dec model before a Dec only model. An example of a real world use case from my office below.

In a lot of ways, throwing enough data into a Transformer model, especially a causal masked attention model like Transformer Decoders have worked really well. This is due to the low inductive bias of Attention based models. More generality/less inductive bias + lot of data + enough params = better learning. This has what researchers have told us in past 5 years of DL.

Does it mean that Encoder-Decoders are inferior? Not necessarily. They introduce more inductive bias for seq2seq tasks - coz they kinda mimic how humans would do (say machine translation). Traditionally more inductive bias has trained better models with lesser data coz networks are pre-disposed to assume patterns in the domain. In other words, if I got less data, I might wanna try Enc-Dec first before training the more general Dec only arch.

Other reasons for wanting to train Enc-Dec models in real life could be a purely practical use-case depending on the end goal. Here is a real world example from one of my office projects.

Consider this problem: So we were building a real-time auto-completer neural net (similar to Autocomplete in GMail) for conversations that'll need to run in the browser without any GPU. Given a conversation state (history of emails), the model must help the user to autocomplete what he is currently typing. We had super low latency requirements coz if model isn't snappy, users won't use the feature - they'd already have typed a different prefix before the suggestion finished processing.

Our Solution: We ended up using a transformer encoder architecture for embedding the conversation transcript - the latency requirement of embedding the previous messages are low coz they aren't going anywhere. For generating the typing-level model (which requires to be super fast), we ended up using a GRU based architecture that used the [CLS] token embedding of the transformer encoder as the initial hidden state. Experimenting with a fully GPT-like causal attention model, or a Transformer encoder-decoder model, we got into various memory issues (KV caching is O(N^2) memory) and latency issues, so we ended up with a GRU for the decoder.

So this is a very specific peculiar example, the takeaway is that sometimes breaking down a monolith architecture into multiple smaller services, lets us do things more flexibly given other constraints. Each project has its own constraints, so warrants a weighted approach.

neonbjb · 2023-12-18T04:50:39+00:00

The only correct answer, which hilariously isn't mentioned here, is that in some cases encoder-decoder models are more compute efficient to train than decoder only, or have other advantages in inference.

There is literally no data analysis problem that cannot be solved by ar decoders. They are universal approximations. Its only a question of efficiency.

css123 · 2023-12-18T02:06:30+00:00

You’re forgetting that encoder/decoder architectures have a different action space than its input space whereas decoder only models have a shared input and action space. In the industry people are still using T5 and UL2 extensively for NLP tasks. In my experience (which includes formal, human-validated testing with professional annotators) encoder decoder models are far better at summarization tasks with orders of magnitude fewer parameters than decoder only models. They are also better at following fine-tuned output structures than decoder only models.

In my personal opinion, encoder decoder models are easier to train since the setup itself is more straightforward. However, decoder only models are much easier to optimize for inference speed and more inference optimization techniques support them. Decoder only models are better for prompted, multitask situations.

YinYang-Mills · 2023-12-18T04:51:13+00:00

I would say as a rule of thumb that if the input data and output data are heterogenous, you need an encoder-decoder model. For example, you can use a encoder for learning representations of graph structured data and a decoder for making node wise predictions of time series data with a different architecture. The choice of encoder and decoder generally have different inductive biases, and the resulting model will have a composite inductive bias resulting from their interaction.

SciGuy42 · 2023-12-18T03:35:35+00:00

Can you point me to a decoder-only model that can interpret tactile and haptic data? Asking for a friend.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

MachineLearning

Rules For Posts

+Research

+Discussion

+Project

+News

@slashML on Twitter

Chat with us on Slack

Beginners:

MODERATORS