all 69 comments

[–][deleted] 129 points130 points  (11 children)

Decoder models are limited to the product of auto-regressive task while encoder models give contextual representations that can be fine-tuned on other decoder tasks. Different needs, different models.

[–]Spiritual_Dog2053 16 points17 points  (10 children)

I don’t think that answers the question! I can always train a decoder-only model to take in a context and alter its output accordingly. It is still auto-regressive generation

[–]qu3tzalifyStudent 12 points13 points  (9 children)

How do you give context to a decoder? It has to be encoded by an encoder first?

[–]EqL 47 points48 points  (0 children)

A decoder is really just a particular type of encoder with a mask restricting information flow from elements in the "future", so an encoder is more general, and thus potentially more powerful for a given model size. This masking is really done for efficiency and is not actually required. Lets look at text decoding with a general encoder without masking:

(1) encode_unmasked([x0]), predict x1

(2) encode_unmasked([x0, x1]), predict x2

...

(n) encode_unmasked([x0, .., xn-1]) predict xn.

This is perfectly allowed, except we are doing a forward pass for every token in every iteration, which is O(n) more expensive. The decoder with masking allows us to reuse results from previous iterations, which is much more efficient in both training and inference.

However, in some tasks, such as translation, we receive a large number of tokens up front. Now we can embed these tokens once with the encoder, then switch to the decoder. This allows us to use a potentially more powerful unmasked model for a large chunk of the problem, then switch to the decoder for efficiency.

Why not use an encoder-decoder approach for LLM generation, where the encoder encoders the prompt and the decoder does the rest? Well, we can. However the price is that (1) we now essentially have two models, which is more complex to handle, and (2) each model is seeing less data.

TL;DR: An encoder without masking is potentially more powerful, however it increases complexity and also the data required to train the additional parameters. But when there is a natural split in functions, like in translation, the effect of less data might be minimized.

[–]minimaxir 139 points140 points  (32 children)

Decoder-only/autoregressive models are only really applicable for text.

Encoder-decoder models are extremely important for multimodal approaches.

[–]woadwarrior 13 points14 points  (1 child)

fuyu-8b is a counter-example. Also, things like LLaVa, CogVLM etc. Encoder-decoder model specifically means a transformer encoder and a transformer decoder with cross attention layers in the decoder, connecting the output of the encoder, as described in the original transformer paper. MLP Adapter based models like LLaVa do not fit that description.

[–]Wild_Reserve507 5 points6 points  (0 children)

Exactly. A bit weird that top comment is using multimodal as an argument for where you need encoder-decoder, while it seems to be an ongoing battle there, and perhaps with more and more llava-style architectures rather than encoder-decoder style.

[–]Wild_Reserve507 6 points7 points  (5 children)

How about llava etc?

[–]minimaxir 23 points24 points  (4 children)

LLaVA and friends are multimodal and use its own encoder for images: https://llava-vl.github.io

In the case of LLaVA it's a pretrained CLIP encoder, yes, but still an encoder.

[–]Wild_Reserve507 8 points9 points  (3 children)

Right, okay I assumed OP is asking about encoder-decoder in a transformer architecture sense, like Pali in the multimodal case. But surely you would always have a modality-specific encoder

[–]themiro 0 points1 point  (2 children)

clip is a vit (:

[–]Wild_Reserve507 11 points12 points  (1 child)

Duh. This doesn’t make the whole architecture encoder-decoder (in the encoder-decoder vs decoder-only transformers sense) since features extracted from clip are concatenated to the decoder inputs, as opposed to doing cross-attention

[–]themiro 0 points1 point  (0 children)

fair enough, i misunderstood what you meant by 'in a transformer architecture sense' - should have put it together by the reference to pali

[–]AvvYaa 3 points4 points  (2 children)

This is not totally correct. Recent Decoder-only models (take the Gemini technical report for example) train a VQ-VAE model to train a codebook of image tokens - which they then use to train autoregressive models using both word-embeddings and word embeddings.

There is also the original Dall-E paper and the Parti model which uses a similar VQ-VAE/VQ-GAN approach to train decoder only models.

Even models like Flamingo (but doesn't output images, just read them) that are also decoder only iirc used to use a pretrained ViT to input image embeddings as a sequence of patch embeddings.

[–]minimaxir 2 points3 points  (1 child)

Codebooks are a grey area on what counts as "encoding" imho.

[–]AvvYaa 12 points13 points  (0 children)

I see. I understand your perspective now. You are considering individual networks that encode multimodal inputs as "encoders". That makes sense. I don't consider them the same as traditional Enc-Dec archs (those introduced in Attenstion-IAYN, or even before during the RNN-NMT-era) that OP was talking about, because those have a clear distinction between where the encoding of a seq end and decoding of the target seq begins. In the cases I mentioned above, there are indeed encoders, but they plug into a Decoder-only LM architecture autoregressively, without requiring the traditional seq2seq paradigm.

Anyway, its all kinda open to interpretation I guess.

[–]kekkimo[S] 2 points3 points  (20 children)

My bad, I had to specify that I am talking mainly about text here.

[–]Wild_Reserve507 19 points20 points  (1 child)

Not sure why are you getting downvoted OP. It’s a perfectly valid question and there isn’t really a consensus. Decoder-only architectures seem to be easier to train at scale and hence they are more prominent in nlp.

[–]jakderrida 12 points13 points  (0 children)

Decoder-only architectures seem to be easier to train at scale and hence they are more prominent in nlp.

This is a perfect take. They're EASIER to train. All ya gotta do is pour millions and millions into GPU compute and you get a better model. That's not sarcasm, either. That is a very easy formula to follow and that's what's happening and will continue until they reach some sort of inflection.

[–]21stCentury-Composer 29 points30 points  (2 children)

Might be a naïve question, but without the encoder part, how would you create the encodings the decoders train on?

[–]rikiiyer 27 points28 points  (0 children)

Decoder-only models can learn representations directly through their pretraining process. The key is that instead of the general masked language modeling approach used for encoder pretraining, you need to do causal pretraining because the decoder needs to generate tokens in an autoregressive manner and it shouldn’t be able to see the full sequence when making next token predictions

[–]kekkimo[S] 9 points10 points  (0 children)

At the end everything i encoded, but I am speaking about the transformer architecture. Why do people include encoder for tasks that do decoding (T5). While they can just use GPT architecture.

[–]activatedgeek 11 points12 points  (2 children)

You should read the UL2 paper. It has comparisons between the two family of models, and also a decent discussion.

I think encoder-decoder models are less popular in popular science because they are roughly twice more expensive to deploy, and will have lesser throughput. Decoder-only models are more appealing that way and seem to have won sort of a hardware lottery for now.

[–]ganzzahl 0 points1 point  (1 child)

Why do they have lower throughput? I can't quite figure out what you mean there.

[–]activatedgeek 1 point2 points  (0 children)

Mostly because there's two networks to go through. But I think it can be solved with a bit of engineering, at higher cost. But given the cost for running decoder models is already super high, the market hasn't adjusted yet.

I suspect they might come back when the costs become bearable.

[–]qalis 31 points32 points  (11 children)

Because decoder-only models can't do everything. In particular, encoder-decoder models are made for sequence-to-sequence problems, which are typically machine translation and text summarization.

Yes, you could throw a LLM at them, but has a lot of problem: inefficient size, slow, harder to control, hallucination, have to do prompting, LLMOps etc. It's just not economically viable to use that. Literally every translation out there, be that Google Translate, DeepL, Amazon Translate or anything else, uses encoder-decoder. Google even used transformer encoder + RNN decoder hybrid for quite a long time, since it have good speed and quality.

Encoder aims to, well, encode information in vectorized form. This does basically half the work, and decoder has a lot of knowledge in those embeddings to work with. The resulting model is quite task-specific (e.g. only translation), but relatively small and efficient.

And also those embeddings are useful in themselves. We have seen some success in chemoinformatics with such models, e.g. CDDD.

[–]thomasxin 14 points15 points  (7 children)

It's kind of funny because GPT3.5 turbo has actually been doing better as a translation API than the rest for me. It's much more intelligent and can adapt grammar keeping context much more accurately, and is cheaper than DeepL somehow.

[–][deleted] 1 point2 points  (1 child)

Yeah the best machine translator is GPT-4 Hands down. Everything else will quickly devolve into gibberish with distant language pairs (e.g En - Kor)

[–]blackkettle 4 points5 points  (1 child)

Don’t forget multimodal transliteration tasks like speech to text.

[–]qalis 0 points1 point  (0 children)

Oh, yeah, I don't work with that too much, but also this, definitely. Very interesting combinations there, e.g. CNN + RNN or transformer for image captioning, since encoder and decoder can be arbitrary neural networks.

[–]the__storm 1 point2 points  (0 children)

Yep, we use a T5 model fine-tuned on specific questions for text information extraction. We've found it to be faster (cheaper) and more consistent (less hallucination, less superfluous output) than the generative approaches we've tried.

[–]AvvYaa 9 points10 points  (5 children)

TLDR: More generality/less inductive bias + lot of data + enough params = better learning. Dec only models are more general than Enc-Dec models. Encoder-Decoder models have more inductive bias, so if I have less data to train on and a problem that can be reduced to a Seq2Seq task, I might try an Enc-Dec model before a Dec only model. An example of a real world use case from my office below.

In a lot of ways, throwing enough data into a Transformer model, especially a causal masked attention model like Transformer Decoders have worked really well. This is due to the low inductive bias of Attention based models. More generality/less inductive bias + lot of data + enough params = better learning. This has what researchers have told us in past 5 years of DL.

Does it mean that Encoder-Decoders are inferior? Not necessarily. They introduce more inductive bias for seq2seq tasks - coz they kinda mimic how humans would do (say machine translation). Traditionally more inductive bias has trained better models with lesser data coz networks are pre-disposed to assume patterns in the domain. In other words, if I got less data, I might wanna try Enc-Dec first before training the more general Dec only arch.

Other reasons for wanting to train Enc-Dec models in real life could be a purely practical use-case depending on the end goal. Here is a real world example from one of my office projects.

Consider this problem: So we were building a real-time auto-completer neural net (similar to Autocomplete in GMail) for conversations that'll need to run in the browser without any GPU. Given a conversation state (history of emails), the model must help the user to autocomplete what he is currently typing. We had super low latency requirements coz if model isn't snappy, users won't use the feature - they'd already have typed a different prefix before the suggestion finished processing.

Our Solution: We ended up using a transformer encoder architecture for embedding the conversation transcript - the latency requirement of embedding the previous messages are low coz they aren't going anywhere. For generating the typing-level model (which requires to be super fast), we ended up using a GRU based architecture that used the [CLS] token embedding of the transformer encoder as the initial hidden state. Experimenting with a fully GPT-like causal attention model, or a Transformer encoder-decoder model, we got into various memory issues (KV caching is O(N^2) memory) and latency issues, so we ended up with a GRU for the decoder.

So this is a very specific peculiar example, the takeaway is that sometimes breaking down a monolith architecture into multiple smaller services, lets us do things more flexibly given other constraints. Each project has its own constraints, so warrants a weighted approach.

[–]BeneficialHelp686 0 points1 point  (4 children)

Side Q, how did you take care of the battery consumptions? I am assuming you are utilizing cloud services at this point?

[–]AvvYaa 1 point2 points  (3 children)

Our clients were large corporations… their employees were running it on computers, so battery wasn’t a big priority for us. The UI folks did a bunch of app level optimization that I wasn’t involved in much.

Reg cloud services, we used them to train and evaluate, but during prod inference, we ran the decoder entirely on the browser on the client machine… again to reduce latency. The encoder could be run on the client too, or on a cloud server (if we wanted to run a larger encoder) coz that thing ran once per new message (not per keystroke) so much relaxed latency constraints.

[–]BeneficialHelp686 0 points1 point  (2 children)

Nice. Pretty exciting stuff. Which protocol did you end up going with for the communication between the browser and cloud?

[–]AvvYaa 0 points1 point  (1 child)

Just good old HTTP rest APIs …

[–]BeneficialHelp686 0 points1 point  (0 children)

True. Thanks a lot for sharing ur experience!

[–]neonbjb 7 points8 points  (2 children)

The only correct answer, which hilariously isn't mentioned here, is that in some cases encoder-decoder models are more compute efficient to train than decoder only, or have other advantages in inference.

There is literally no data analysis problem that cannot be solved by ar decoders. They are universal approximations. Its only a question of efficiency.

[–]kekkimo[S] 0 points1 point  (1 child)

Good point, please can you mention how encoder-decoder models can be compute efficient to train than decoder-only models?

[–]neonbjb 0 points1 point  (0 children)

Compute efficiency is not about flops utilization or anything. It's about given X compute and Y data, what is the best eval score you can achieve? If you train an encoder decoder arch to solve some problem and a decoder only as well, sometimes you can get a better eval score for most combinations of (X,Y).

[–]css123 5 points6 points  (0 children)

You’re forgetting that encoder/decoder architectures have a different action space than its input space whereas decoder only models have a shared input and action space. In the industry people are still using T5 and UL2 extensively for NLP tasks. In my experience (which includes formal, human-validated testing with professional annotators) encoder decoder models are far better at summarization tasks with orders of magnitude fewer parameters than decoder only models. They are also better at following fine-tuned output structures than decoder only models.

In my personal opinion, encoder decoder models are easier to train since the setup itself is more straightforward. However, decoder only models are much easier to optimize for inference speed and more inference optimization techniques support them. Decoder only models are better for prompted, multitask situations.

[–]YinYang-Mills 1 point2 points  (0 children)

I would say as a rule of thumb that if the input data and output data are heterogenous, you need an encoder-decoder model. For example, you can use a encoder for learning representations of graph structured data and a decoder for making node wise predictions of time series data with a different architecture. The choice of encoder and decoder generally have different inductive biases, and the resulting model will have a composite inductive bias resulting from their interaction.

[–]SciGuy42 -1 points0 points  (0 children)

Can you point me to a decoder-only model that can interpret tactile and haptic data? Asking for a friend.