Two Definitions of Perplexity

Perplexity Definition 1

Given a sequence $X = x_1, x_2, \dots, x_T$ of words and a language model $p$, the perplexity of the sequence can is defined as follows:

$$ \text{PPL}(X) = p(x_1, x_2, \dots, x_T)^{-\frac{1}{T}} $$

where $p(x_1, x_2, \dots, x_T)$ is the probability of the sequence $X$ under the language model $p$(also called the likelihood of the sequence).

With language models, the probability of a sequence is often computed as the product of the conditional probabilities of each word given the previous words:

$$ p(x_1, x_2, \dots, x_T) = \prod_{t=1}^T p(x_t|x_1, \dots, x_{t-1}) $$

So the perplexity can be rewritten as:

$$ \text{PPL}(X) = \left(\prod_{t=1}^T p(x_t|x_1, \dots, x_{t-1})\right)^{-\frac{1}{T}} $$

Perplexity Definition 2

A more common definition of perplexity(as defined in Huggingface):

$$ PPL(X) = \exp\left(-\frac{1}{T}\sum_{i=1}^T \log p(x_i|x_1, \dots, x_{i-1})\right) $$

The two definitions are equivalent. We can show this by taking the log of the second definition and applying the chain rule(eq-5 to eq-6) and then the exponentiation:

$$ \begin{align} \log \text{PPL}(X) &= \log \exp\left(-\frac{1}{T}\sum_{i=1}^T \log p(x_i|x_1, \dots, x_{i-1})\right) \\ &= -\frac{1}{T}\sum_{i=1}^T \log p(x_i|x_1, \dots, x_{i-1}) \\ &= -\frac{1}{T}\log \prod_{i=1}^T p(x_i|x_1, \dots, x_{i-1}) \\ &= -\frac{1}{T}\log p(x_1, x_2, \dots, x_T) \\ &= \log \frac{1}{\sqrt[T]{p(x_1, x_2, \dots, x_T)}} \\ &= \log p(x_1, x_2, \dots, x_T)^{-\frac{1}{T}} \\ \end{align} $$

Taking the exponentiation of both sides, we get back the first definition.

When to use Definition 1?

From this definition, we can see that the perplexity is exactly the inverse geometric mean of the conditional probabilities of the words in the sequence under the language model $p$. It is rather intuitive because it clearly shows that the perplexity is a measure of how well the language model assigns probabilities to the words in the sequence.

For more details about geometric mean, see section below.

When to use Definition 2?

The second definition is more common because it can be computed directly from the loss of the language model.

Recall that the loss of language modeling is the negative log-likelihood of the sequence under the language model:

$$ \text{Loss}(X) = KL(p(X) || q(X)) = -\sum_{i=1}^T \log p(x_i|x_1, \dots, x_{i-1}) $$ where $q(X)$ is the target distribution of the sequence $X$, which is just 1 for the target token and 0 for all other tokens.

So the perplexity under the second definition is the exponential of the loss divided by the length of the sequence:

$$ \text{PPL}(X) = \exp\left(\frac{\text{Loss}(X)}{T}\right) $$

This is more computationally efficient because the loss is anyway computed during training and validation, and the perplexity can be computed directly from the loss.

Perplexity of a uniform model

Consider a model assigning uniform probabilities to all words in the vocabulary.

The perplexity of any sequence under this model is the size of the vocabulary.

$$ \text{PPL}(w1, w2, \dots, wT) = (|V|^N)^{\frac{1}{T}} = |V| $$

If the model is a fair coin flip, the perplexity of any sequence is 2(regardless of the length of the sequence). If the model is a dice roll, the perplexity of any sequence is 6. If the model is uniform over 26 letters, the perplexity of any sequence is 26.

Geometric mean of inverse probabilities

Recall that the geometric mean of a sequence $a_1, a_2, \dots, a_n$ is defined as follows:

$$ \text{GM}(a_1, a_2, \dots, a_n) = \sqrt[n]{a_1 a_2 \dots a_n} = \prod_{i=1}^n a_i^{\frac{1}{n}} $$

Applying this to the inverse probabilities of the sequence $X$ under the language model $p$ we get:

$$ \text{GM}(p(X)) = \prod_{t=1}^T p(x_t|x_1, \dots, x_{t-1})^{\frac{1}{T}} \text{PPL}(X) = \frac{1}{\text{GM}(p(X))} $$

This is also aligned with the definition given in wikipedia

Saibo Geng
Saibo Geng
PhD student in EPFL

My research interest is improving LLM’s performance through decoding methods.