The Basic Idea

Speculative decoding^[1],[2] is an inference technique for increasing the throughput of an LLM. In its most basic form, there is a large/slow target model and a small/fast draft model. The draft model is used to quickly generate a draft sequence of the next \(N\) tokens¹. Then, a single pass of the target model is used to obtain \(N+1\) next-token distributions². A verification procedure takes the above information into consideration and decides how much of the draft sequence to accept (i.e. determine some \(M\le N\), and keep the first \(M\) draft tokens). Finally, an \((M+1)^{\text{st}}\) token, which we will call the continuation token³, is sampled either directly from the target model’s next-token distribution for that position or from some modified distribution.

¹ as well as the \(N\) next-token distributions from the draft model that each draft token was sampled from

² one for each position in the draft sequence, and one that will be used to sample an additional token at the very end if the entire draft sequence is accepted

³ this is not a standard term, but it will be useful for this token to have a name

⁴ the accepted tokens and the continuation token

When the verification step accepts at least one token from the draft sequence, we will be generating multiple tokens from a single forward pass of the target model⁴. Thus, when the cost of running the draft model is low and the average number of tokens accepted per round is high, we can expect a meaningful increase in the overall throughput of the target model.

Is this a Heuristic?

When I first read about speculative decoding, I thought, “What an interesting heuristic”. It was obvious that it could improve the inference-time throughput of a model, but it was not obvious that it could do so without affecting output quality.

Clearly, speculative decoding can be a heuristic. For instance, one idea could be to look at the probability of a draft token in the target distribution and accept it if this target probability exceeds some threshold. Take a minute, and I’m sure you can come up with a few variants of ‘accept the tokens that are good enough’ and reject the rest. Many of these heuristics will affect the output quality of the model.

Since we have established that speculative decoding can be a heuristic, perhaps a better question is: “Is speculative decoding necessarily a heuristic?”

To answer this, we should define what it means to not be a heuristic. In simple terms, we would like the result of speculative decoding to have the same behavior as the target model. For determinstic functions ‘same behavior’ is easy to define: the same input for both functions will always produce idential outputs. However, LLMs are probabilistic models. The target model can generate multiple different outputs from the same input. For a given context, there is some probability of the target model generating one output sequence, and there is another probability of it generating some other output sequence.

Thus, what we mean by ‘same behavior’ is that speculative decoding should generate the same distribution of outputs. That is, given a particular context, the probability of continuing it with a particular sequence of tokens should be the same for the speculative decoding algorithm and the target model. If this is the case for all valid contexts and continuation sequences, we say that the speculative decoding algorithm is lossless.

Naive Rejection Sampling

Consider the following verification procedure: “always reject the entire draft sequence, and sample the continuation token from the target distribution”. This trivial example clearly offers no improvement to inference speed. However, it is also clearly lossless, so it merits some extra attention⁵.

⁵ some say that’s all you need

Consider the case where the first draft token happens to match the continuation token. In this case, we might as well accept the draft token and sample a new continuation token in the second position. Again, if this new continuation token happens to match the second draft token, we might as well accept the first and second draft tokens and sample the continuation token in the third position. We can continue this process until either the draft token does not match the continuation token, or we have accepted all draft tokens. Take some time to convince yourself that this policy is lossless. We will call this modification of the trivial example naive rejection sampling.

For naive rejection sampling, the acceptance rate at any given position is the probability that the draft and target distributions will yield the same token when sampled independently. In general, the acceptance rate will be positive⁶, but potentially small⁷.

⁶ unless the draft and target distributions have disjoint support, but this would not be typical

⁷ unless both models are highly confident about the same single output token

⁸ i.e. each of the 100 tokens has a probability of 1/100 of being selected

Even in the scenario that the draft and target next-token distributions are identical, the acceptance rate of naive rejection sampling can be quite small. Consider the case where the vocabulary has 100 tokens, and both distributions are the uniform distribution⁸. The acceptance rate of naive rejection sampling will be 1/100. However, in this situation we could have accepted the draft token with a probability of 1. This is because the draft and target distributions are identical, so the draft token is already sampled from the target distribution.

Of course, we would not expect the draft and target distributions to be identical. However, this does provide hope that we can substantially beat naive rejection sampling when the draft and target distributions are similar.

Token-by-Token Verification Procedures

In principle, verification can be any procedure that stochastically accepts some portion of the draft sequence and samples a continuation token after considering the entire sequence of draft tokens, draft next-token distributions, target next-token distributions, and context. For example, “If the draft sequence contains the word ‘apple’, then accept the entire sequence. Otherwise, reject the entire sequence. In either case, sample the continuation token from the uniform distribution.” is a valid verification procedure⁹.

⁹ though, obviously not lossless

To simplify things, we restrict our attention to verification procedures which consider only one token at a time, and accept/resample based only on the following information:

The draft next-token distribution in that position
The target next-token distribution in that position
The draft token in that position

Furthermore, when the verification procedure does not accept the entire draft sequence, we assume that the continuation token is distinct from the draft token in that position¹⁰.

¹⁰ otherwise, we should have accepted that draft token

Any verification procedure with these properties is called a token-by-token verification procedure.

It is worth noting that there are verification procedures which do wholistically consider the entire sequence of draft/target distributions and draft tokens^[3]. Indeed, these block-level verification procedures outperform token-by-token verification. However, they build on the ideas from the token-by-token literature and are better suited for a future blog post.

To analyze a single step of a token-by-token verification procedure, we need some notation. Let \(\mathcal{V}=\{a_i\}_{i=1}^V\) be the shared vocabulary¹¹ for the draft and target models. For a fixed step of a chosen verification procedure, and any pair of tokens \(a_i,a_j\in\mathcal{V}\), there is some probability of \(a_i\) being the draft token and \(a_j\) being the resulting token. Thus, we define

¹¹ i.e. the set of possible output tokens

\[ \Pi_{ij}=\text{the probability that }a_i\text{ is the draft token and }a_j\text{ is the output token}. \]

Then, acceptance occurs when \(a_i=a_j\), and the probability of acceptance is obtained by summing along the diagonal:

\[ \sum_{i=1}^V\Pi_{ii} \]

Let \(p\) be the next-token distribution for the draft model and \(q\) be the next-token distribution for the target model¹². Let \(X\) be the random variable for the draft token (i.e. \(X\sim p\)). Let \(Y\) be the random variable that is result of the verification procedure. Then, the pair of random variables \((X,Y)\) is dependent and their joint distribution is described by \(\Pi\). Finally, let \(Z\) be the random variable that is obtained by sampling independently from the target distribution \(q\).

¹² for the single, fixed position in question

We define \[ \begin{align*} p_i &= P(X=a_i)\\ q_i &= P(Z=a_i)\\ \end{align*} \]

With the notation above, we can expand \[ \begin{align*} \Pi_{ij}&=P(X=a_i\land Y=a_j)\\ &=P(Y=a_j|X=a_i)P(X=a_i)\\ &=P(Y=a_j|X=a_i)p_i \end{align*} \]

It follows that \(\Pi\) satisfies the following constraints¹³

¹³ the sum is just formally stating that the draft token needs to follow the draft distribution

\[ \begin{align*} \sum_{j=1}^V\Pi_{ij}&=p_i\\ \Pi_{ij}&\ge 0 \end{align*} \]

In fact, the converse is also true. Any \(V\times V\) matrix \(\Pi\) satisfying the above constraints can be used to carry out a single step of the token-by-token verification procedure¹⁴.

¹⁴ once you have \(X=a_i\), select \(Y\) with probabilities \(P(Y=a_j|X=a_i)=\frac{\Pi_{ij}}{p_i}\)

The output distribution is obtained by summing along the first index

\[ \begin{align*} P(Y=a_j)&=\sum_{i=1}^VP(Y=a_j|X=a_i)P(X=a_i)\\ &=\sum_{i=1}^V\Pi_{ij} \end{align*} \]

Thus, the step is lossless if and only if \[ \sum_{i=1}^V\Pi_{ij}=q_j \]

An optimal lossless verification step in a token-by-token verification procedure corresponds to finding a matrix \(\Pi\) satisfying the above constraints which maximizes the acceptance rate:

\[ \begin{align*} \textbf{maximize}\\ \sum_{i=1}^V \Pi_{ii}\\ \textbf{subject to}\\ \sum_{j=1}^V\Pi_{ij} &= p_i\\ \sum_{i=1}^V \Pi_{ij} &= q_j\\ \Pi_{ij} &\ge 0 \end{align*} \]

Maximizing the acceptance rate is equivalent to minimizing the rejection rate¹⁵: \[ \begin{align*} \textbf{minimize}\\ \sum_{i\neq j} \Pi_{ij}\\ \textbf{subject to}\\ \sum_{j=1}^V\Pi_{ij} &= p_i\\ \sum_{i=1}^V \Pi_{ij} &= q_j\\ \Pi_{ij} &\ge 0 \end{align*} \]

¹⁵ the rejection rate is just the sum of the off-diagonal terms of \(\Pi\)

Thus, we have a linear program:

A utility function, token acceptance rate (or cost function, token rejection rate), which is linear in the coefficients of \(\Pi\)
A set of equality and inequality constraints that are linear in the coefficients of \(\Pi\)
An objective to maximize the utility function (or minimize the cost function)

There are a variety of methods for finding exact and approximate solutions to linear programs. However, this one is relatively simple and admits a closed-form solution¹⁶:

¹⁶ showing that this is, in fact, a solution to the linear program is a good exercise to check that you understand the constraints and objective

\[ \Pi^{\text{optimal}}_{ij} = \begin{cases} \min\{p_i,q_i\}&\text{if }i = j\\ \frac{(p_i-\min\{p_i,q_i\})(q_j-\min\{p_j,q_j\})}{1-\beta(p,q)}&\text{else} \end{cases} \]

where

\[ \beta(p,q)=\sum_{i=1}^V\min\{p_i,q_i\} \]

For this solution, \(\beta(p,q)\) is the acceptance rate, and it is easy to see that \(\beta(p,q)\) approaches 1 as \(p\) approaches \(q\).

Optimal Transport

The matrix \(\Pi\) defines a probability distribution on the product space \(\mathcal{V}\times\mathcal{V}\) ¹⁷. This distribution has two marginal distributions. One marginal, obtained by summing on the second index, is always the draft distribution, \(p\). In the lossless case, the other marginal is the target distribution, \(q\). Such a distribution is called a transport plan from \(p\) to \(q\). Intuitively, starting with a distribution \(p\), a transport plan describes how to redistribute the mass from \(p\) to obtain \(q\).

¹⁷ recall, it is the probability of getting a pair \((a_i,a_j)\) of draft/output tokens

Given a cost function, \(C(i,j)\) which describes the cost of moving mass from \(i\) to \(j\), the total cost of a transport plan from \(p\) to \(q\) is defined by

\[ \sum_{i,j}C(i,j)\Pi(i,j) \]

We say that a transport plan from \(p\) to \(q\) is optimal¹⁸ if it has minimal total transport cost amongst all transport plans from \(p\) to \(q\).

¹⁸ with respect to \(C\)

Returning to token-by-token verification, we can let \(C(i,j)=\begin{cases}0&\text{if }i=j\\1&\text{if } i \neq j\end{cases}\). This assigns a cost of zero when accepting a token and a cost of 1 when rejecting a token. With this cost function, the total cost of a transport plan is just the overall probability of rejection.

It is straightforward to see that the linear program described at the end of the previous section is equivalent to the following optimal transport problem: find a transport plan from \(p\) to \(q\) which is optimal with respect to the cost, \(C\), defined above.

As stated earlier, this optimal transport problem is relatively simple and admits a closed-form solution. Thus, in the simple setting of single-draft, token-by-token speculative decoding, rephrasing the problem in terms of optimal transport mostly adds a new conceptual perspective. However, in more complex situations, such as multi-draft speculative decoding^[4], optimal transport provies a powerful toolbox for solving the problem. Perhaps we can cover that in a future post.

References

Leviathan, Y., Kalman, M., and Matias, Y. (2023). Fast inference from transformers via speculative decoding.

Chen, C., Borgeaud, S., Irving, G., Lespiau, J.-B., Sifre, L., and Jumper, J. (2023). Accelerating large language model decoding with speculative sampling.

Sun, Z., Mendlovic, U., Leviathan, Y., Aharoni, A., Ro, J.H., Beirami, A., and Suresh, A.T. (2025). Block verification accelerates speculative decoding.

Sun, Z., Suresh, A.T., Ro, J.H., Beirami, A., Jain, H., and Yu, F. (2024). SpecTr: Fast speculative decoding via optimal transport.