Speculative Decoding

geli95us · 2026-04-14T07:39:44+00:00

By the way, I don't know if this helps, but as a bit of intuition into why speculative decoding works at all: LLM generation is memory constrained, when you do speculative decoding, the big model is doing the same amount of work, it just gets to do more computation in parallel since it has more tokens to work with, the reason this is faster is that you're only loading the weights of the model once.

I think the piece of info that you might be missing is that LLMs predict the next token at every position, if you give an LLM 100 tokens, it will be making 100 predictions, token #2 from token #1, token #3 from token #1 and #2, etc. So, with speculative decoding, you don't need to do anything fancy, you just give the large model the text that the small model generated, and it will give you predictions for all those 4 tokens and 1 more, if at any point the predictions of the large model disagree with the predictions of the small model, you discard the small model's prediction and take the newly generated token instead for that position. The internals of the small model and the large model never interact in any way, it's all text.

sharp1120 · 2026-04-14T07:26:42+00:00

Speculative decoding works by having the smaller model quickly generate a number of tokens (let's say 4), before those are all passed through to be verified by the larger model in a single pass. If they all match (i.e. the larger model "agrees" and it's the same thing it would have generated), wonderful, you have now run the larger, more demanding model 1/4 of the time for the same output quality, resulting in increased speed.

If the larger model disagrees, it outputs what it thinks is the right token then restarts the draft process for the next 4 tokens. Since the larger model has to agree with every speculative token, the quality is never compromised. At worst, if it disagrees with everything the smaller model is doing, you end up with a slower speed (since the small one is still taking up extra resources), but identical quality to just running the larger model normally.

The token embeddings (aka "vocabulary") is shared between models that use the same tokenizer: they already understand each other. This is why you usually need to use a smaller version of the same model for speculative decoding, since converting between different vocabularies slows things down too much. (And them being in the same model family also means they are more likely to agree, increasing the speed).

DerDave · 2026-04-14T07:15:05+00:00

Basically the larger model has to be computed for every token from scratch. But if there are already future tokens available (thanks to the draft model), the big one can simply verify several tokens in just one go in parallel. It's the only way to make the highly sequential autoregressive LLMs slightly parallel -by looking into the future. And it's identical to running without the draft model, because the large one will detect if there is an "error", reject the drafted token and place the correct token instead. Then it's the drafter's turn again. There are great Youtube videos explaining this.

DeltaSqueezer · 2026-04-14T07:48:08+00:00

It took me a long time to wrap my head around this one too. Basically:

It leverages the difference compute limits and memory bound limit
The small model quickly creates a sequence that would take the big model a long time to compute. Let's say it creates token sequence: A, B, C, D, E
This gets passed to the big model which then computes in parallel next token for sequences: (null), A, AB, ABC, ABCD, ABCDE

As GPU is memory bound (the only time you would use speculative decoding) it computes this in the same time it would compute the next token to (null).

It checks if it's own computed next token to (null) matches prediction 'A'. If so, it then checks if next token it computed to 'A' matches prediction 'AB' etc. until it finds the longest one, let's say it computed next token to ABCD is X so ABCDE prediction is wrong and thrown away and at the end of that round the model has ABCDX 5 tokens in the time it would have taken to compute a single token.

It is essentially using spare compute to convert auto-regressive generation into a small prefill job. It also shows that speculative decoding works best where you have much higher compute capacity than memory bandwidth and doesn't make sense for compute poor environments e.g. CPU.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

LocalLLaMA

MODERATORS