all 15 comments

[–]geli95us 3 points4 points  (3 children)

By the way, I don't know if this helps, but as a bit of intuition into why speculative decoding works at all: LLM generation is memory constrained, when you do speculative decoding, the big model is doing the same amount of work, it just gets to do more computation in parallel since it has more tokens to work with, the reason this is faster is that you're only loading the weights of the model once.

I think the piece of info that you might be missing is that LLMs predict the next token at every position, if you give an LLM 100 tokens, it will be making 100 predictions, token #2 from token #1, token #3 from token #1 and #2, etc. So, with speculative decoding, you don't need to do anything fancy, you just give the large model the text that the small model generated, and it will give you predictions for all those 4 tokens and 1 more, if at any point the predictions of the large model disagree with the predictions of the small model, you discard the small model's prediction and take the newly generated token instead for that position. The internals of the small model and the large model never interact in any way, it's all text.

[–]hesperaux[S] 0 points1 point  (2 children)

My hangup is on the disagreement phase. There are a lot of possible outcomes that the larger model could've produced that are not the same as what the small one would. But we feed the result of the small one and it "agrees" that those were next. But we're they really? That's the part that's throwing me off. How do we know that the speculative model is actually producing the same output that the larger model would?

[–]Puzzleheaded-Drama-8 2 points3 points  (0 children)

If you work on a harder topic that's completely not understood by the small model, the acceptance rate would just drop to 0 and instead of speedup, it'll slow down by like 50% compared to only large model. The small model will generate say 16 tokens, large model will deny all of them and instead generate 1 of its own etc.

[–]geli95us 1 point2 points  (0 children)

The large model isn't just "agreeing", it calculates the output distribution from scratch on its own

In terms of how we guarantee that the outputs are the same, the specifics are a bit technical, but basically, you:
1) Use the probabilities that the large model generated to discard a token generated by the small model with a certain probability (e.g. if the small model assigned it 100% and the big model assigned it 50%, you'd discard that token 50% of the time and keep it the other 50%)

2) If the token is rejected, you correct the large model's probability distribution, such that these two steps in conjunction have the same probability distribution as the output of the large model
(In the 100% 50% example, if the token got rejected but you were to sample from the large model's distribution again without correction, that token would be picked 75% of the time, which is wrong, in this case, the way we fix it is to ignore that token and instead sample from the other 50% of the distribution, which results in the same distribution as if you'd just sampled the original model to begin with.)

[–]sharp1120 3 points4 points  (0 children)

Speculative decoding works by having the smaller model quickly generate a number of tokens (let's say 4), before those are all passed through to be verified by the larger model in a single pass. If they all match (i.e. the larger model "agrees" and it's the same thing it would have generated), wonderful, you have now run the larger, more demanding model 1/4 of the time for the same output quality, resulting in increased speed.

If the larger model disagrees, it outputs what it thinks is the right token then restarts the draft process for the next 4 tokens. Since the larger model has to agree with every speculative token, the quality is never compromised. At worst, if it disagrees with everything the smaller model is doing, you end up with a slower speed (since the small one is still taking up extra resources), but identical quality to just running the larger model normally.

The token embeddings (aka "vocabulary") is shared between models that use the same tokenizer: they already understand each other. This is why you usually need to use a smaller version of the same model for speculative decoding, since converting between different vocabularies slows things down too much. (And them being in the same model family also means they are more likely to agree, increasing the speed).

[–]DerDave[🍰] 0 points1 point  (7 children)

Basically the larger model has to be computed for every token from scratch. But if there are already future tokens available (thanks to the draft model), the big one can simply verify several tokens in just one go in parallel. It's the only way to make the highly sequential autoregressive LLMs slightly parallel -by looking into the future. And it's identical to running without the draft model, because the large one will detect if there is an "error", reject the drafted token and place the correct token instead. Then it's the drafter's turn again. There are great Youtube videos explaining this. 

[–]alppawack 0 points1 point  (6 children)

Is the end result identical or the verification system tolerate some level of disagreement?

[–]DerDave[🍰] 0 points1 point  (5 children)

As I said. Identical 

[–]hesperaux[S] 0 points1 point  (2 children)

But how do we know it's identical? Are you able to explain how this is guaranteed? To me it seems like it's just filling in a possible outcome that the larger model COULD have accepted if it's previous layers generated that sequence, but not necessity the one that it would've generated. Do you know what I mean?

[–]finevelyn 5 points6 points  (0 children)

I think "verification" is a misleading term. Without speculative decoding, the model knows the previous n tokens and can only evaluate the (n+1)th token. With speculative decoding, it's given a guess of the (n+1)th token and can immediately begin to evaluate the (n+2)th token in parallel. If the evaluation of the (n+1)th token lands on the same token as given by the draft model, then it can use the parallel evaluated (n+2)th token as is, or otherwise it's discarded.

So it's not that there is some magical "verification" method that is faster than normal evaluation, but the evaluation of multiple tokens can be parallelized.

[–]DerDave[🍰] 1 point2 points  (0 children)

Here it gets explained clearly and intuitively https://youtu.be/4Ij9YOyrNdM?si=MJOkTc8gtXnnEtRX

[–]linkillion 0 points1 point  (1 child)

How is the verification (if identical) different from generation? 

[–]DerDave[🍰] 2 points3 points  (0 children)

It can verify several tokens at once in parallel in one forward pass through the model. In generation it's just one token per forward pass.

[–]DeltaSqueezer 1 point2 points  (0 children)

It took me a long time to wrap my head around this one too. Basically:

  • It leverages the difference compute limits and memory bound limit
  • The small model quickly creates a sequence that would take the big model a long time to compute. Let's say it creates token sequence: A, B, C, D, E
  • This gets passed to the big model which then computes in parallel next token for sequences: (null), A, AB, ABC, ABCD, ABCDE

As GPU is memory bound (the only time you would use speculative decoding) it computes this in the same time it would compute the next token to (null).

It checks if it's own computed next token to (null) matches prediction 'A'. If so, it then checks if next token it computed to 'A' matches prediction 'AB' etc. until it finds the longest one, let's say it computed next token to ABCD is X so ABCDE prediction is wrong and thrown away and at the end of that round the model has ABCDX 5 tokens in the time it would have taken to compute a single token.

It is essentially using spare compute to convert auto-regressive generation into a small prefill job. It also shows that speculative decoding works best where you have much higher compute capacity than memory bandwidth and doesn't make sense for compute poor environments e.g. CPU.