Probabilistic Interpretation of WordPiece

Apr 3, 2023 6 min read

WordPiece

WordPiece tokenization is a subword tokenization algorithm. It is used in BERT, which is a work of 2018.

Example:

from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
tokenizer.tokenize("I have a new GPU!")
["i", "have", "a", "new", "gp", "##u", "!"]

The tokenizer splits “gpu” into known subwords: [“gp” and “##u”]. “##” means that the rest of the token should be attached to the previous one during decoding, without space. (! without this, the decoding would be “gp u”, which is not a valid word)

Training algorithm of WordPiece

In the start of training, each word is initially split by adding that prefix to all the characters inside the word. So, for instance, “word” gets split like this: w ##o ##r ##d
like BPE, WordPiece learns merge rules, like u ##n ##it -> unit

But how does WordPiece learn merge rules or how does it decide which subwords to merge?

According to this post by HuggingFace, WordPiece is a greedy algorithm that learns merge rules based on the frequency of subwords. When learning the subword units, WordPiece computes a score for each pair of characters using the formula: score = freq_of_pair / (freq_of_first_element * freq_of_second_element)

What does this formula mean?

Probabilistic Interpretation of WordPiece

Here I give a probabilistic interpretation of this formula. without loss of generality, we consider the corpus as a sequence of words, and we want to learn the subword units of this sequence. Let’s start with the initial alphabet, which is the set of all characters in the corpus. We tokenize the corpus with the initial alphabet, i.e. we split each word into characters. We obtain a sequence of characters S, which represents the corpus.

Now we try to build a probability space from this sequence of characters S:

We split S into pairs of characters, and we count the frequency of each pair.

For example, if S is i h a v a n e w g p u i h a v a n e w c p u, then we have the following pairs of characters and their frequencies:

i h, h a, a v, v a, a n, n e, e w, w g, g p, p u, u i, i h, h a, a v, v a, a n, n e, e w, w c, c p, p u

Total number of pairs: C

Our probability space $P=(\Omega, \mathcal{F}, P)$ is defined as follows:

$\Omega$ is the set of all pairs of characters in S.
$\mathcal{F}$ is the set of all subsets of $\Omega$. In this case, $\mathcal{F}$ is the power set of $\Omega$.
$P$ is the probability measure on $\Omega$ defined as the frequency of pair divided by the total number of pairs C.

This is a well-defined probability space.

We could derive a marginal probability $P_1(a)$ for each character $a$ in the alphabet, by summing over all pairs of characters that start with $a$. For example, $P_1(a) = P(a v) + P(a n) + P(a v) + P(a n) + P(a w) + P(a c) = \frac{6}{22}$.

We could derive a second marginal probability $P_2(b)$ for each character $b$ in the alphabet, by summing over all pairs of characters that end with $b$. $P_2(p) = P(g p) + P(c p) = \frac{2}{22}$.

Now we go back to the above formula: $$ \text{score} = \frac{\text{freq_of_pair}}{\text{freq_of_first_element} \times \text{freq_of_second_element}} $$

We divide nomial and denominator by C to get the following formula:

$$ \text{score} = \frac{\text{freq_of_pair}}{C} \div \left(\frac{\text{freq_of_first_element}}{C} \times \frac{\text{freq_of_second_element}}{C}\right) $$

, where $\frac{\text{freq_of_first_element}}{C}$ is the marginal probability of the first character in the pair, and $\frac{\text{freq_of_second_element}}{C}$ is the marginal probability of the second character in the pair $P_1(a)$, and $\frac{\text{freq_of_pair}}{C}$ is the probability of the pair $P_2(b)$.

And we know that the event of pair $a b$ is the intersection of the events of the first character $a$ and the second character $b$.

So the formula can be rewritten as: score = P(a b) / (P(a) * P(b)) = P(b | a) / P(b) , where $P(b | a)$ is the conditional probability of the second character $b$ given the first character $a$.

lift ratio

The quantity $ P(b | a) / P(b) $ is called the “lift ratio” or simply “lift”. It is a measure of how much more likely it is to observe the co-occurrence of characters a and b than one would expect if a and b were independent.

A lift ratio greater than 1 indicates that the occurrence of a and b is positively correlated, while a lift ratio less than 1 indicates a negative correlation.
A lift ratio of 1 indicates that a and b are independent.
lift ratio is not bounded.

This notion of lift is widely used in Data Mining

Pointwise Mutual Information

The PMI of two events $a$ and $b$ is defined as: $$PMI(a, b) = \log_2 \frac{P(a, b)}{P(a)P(b)} = \log_2 \frac{P(b | a)}{P(b)} = \log_2 L(a,b)$$ ,where $L(a,b)$ is the lift ratio.

The measure is symmetric, i.e. $PMI(a, b) = PMI(b, a)$.

c.f. Pointwise Mutual Information

It is said in the Wikipedia article that:

PMI has “a well-known tendency to give higher scores to low-frequency events”, but in applications such as measuring word similarity, it is preferable to have “a higher score for pairs of words whose relatedness is supported by more evidence

Can we improve WordPiece?

The lift value can become very large when $P(b) \approx 0$, i.e. when the support of the second character $b$ is very small. This is a problem because it means that the second character $b$ is not very frequent in the corpus, and it is not very useful to learn a subword unit for it.

Smoothing

So we can improve WordPiece by adding a smoothing term to the denominator of the lift ratio: score = P(a b) / (P(a) * P(b) + epsilon) = P(b | a) / (P(b) + epsilon) , where $\epsilon$ is a small constant.

PMIk measure

Or we could use The PMI_k measure (for k=2, 3 etc.) instead of the lift ratio. PMIk are intended to correct the bias of PMI towards low-frequency events, by boosting the scores of frequent pairs.

A 2011 case study demonstrated the success of PMI3 in correcting this bias on a corpus drawn from English Wikipedia. Taking x to be the word “football”, its most strongly associated words y according to the PMI measure (i.e. those maximizing pmi(x;y)) were domain-specific (“midfielder”, “cornerbacks”, “goalkeepers”) whereas the terms ranked most highly by PMI3 were much more general (“league”, “clubs”, “england”)

Probabilistic Interpretation of WordPiece

WordPiece

Training algorithm of WordPiece

Probabilistic Interpretation of WordPiece

lift ratio

Pointwise Mutual Information

Can we improve WordPiece?

Smoothing

PMIk measure

Saibo Geng

PhD student in EPFL