Skip to main content

Section 8.5 Frequency Analysis

Suppose that the eavesdropper Eve intercepts the cipher text from Alice to Bob. In order to decrypt the message, Eve would need to know the decryption function for the substitution cipher. A simple exhaustive attack on a Caesar cipher would be for Eve to try out all 27 possible decryption functions (the 27 possible shifts) until she obtains a readable message.

We describe another method, called frequency analysis, that enables Eve to decrypt messages encrypted with a substitution cipher. This attack is based on the observation that in an English text, not all letters occur with the same frequency.

In Table 8.5.2 we give the frequency of letters and space in two classic novels. In a substitution cipher, if a letter or space is replaced by another symbol, the replacement symbol occurs with the same frequency as the original letter or space it is replacing. Counting the frequency of the symbols in the cipher text and comparing it with the expected frequency of the letters and space (as communicated in tables such as the one in Table 8.5.2) gives an indication of which letter or space could have been replaced by which symbol.

Use Table 8.5.2 to find the most frequently occurring characters from the set \(\A\) consisting of lower case letters and \(\cspace\) for space.

Let

\begin{equation*} \mathbb{A}=\lbrace \mathtt{-},\mathtt{a},\mathtt{b},\mathtt{c},...,\mathtt{z}\rbrace \end{equation*}

where \(\mathtt{-}\) represents the character space.

Which four characters from \(\mathbb{A}\) occur most frequently in English language texts (written in lower case only)?

Most frequently:

2nd most frequently:

3rd most frequently:

4th most frequently:

For Caesar ciphers, this attack is particularly easy. We only have to find the plain text letter that corresponds to one cipher text letter. Doing so will yield the key \(n\) that provides the decryption function. The following problem demonstrates Eve's approach to deciphering the intercepted message that was encrypted using a Caesar cipher.

Table 8.5.2. Frequency of letters and space in Alice in Wonderland by Lewis Carroll (1865) and in The Time Machine by H. G. Wells (1898). Before counting the number of occurrences all letter were converted to lower case and all characters not in \(\A=\{\cspace,\mathtt{a},\mathtt{b},\mathtt{c},\dots,\mathtt{z}\}\) were removed.
Alice in Wonderland The Time Machine
character count frequency count frequency
\(\cspace\) (space) 27305 20.22% 32679 18.85%
\(\mathtt{a}\) 8791 6.51% 11704 6.75%
\(\mathtt{b}\) 1475 1.09% 1897 1.09%
\(\mathtt{c}\) 2399 1.78% 3424 1.98%
\(\mathtt{d}\) 4931 3.65% 6337 3.66%
\(\mathtt{e}\) 13576 10.05% 17838 10.29%
\(\mathtt{f}\) 2001 1.48% 3354 1.94%
\(\mathtt{g}\) 2531 1.87% 3075 1.77%
\(\mathtt{h}\) 7375 5.46% 8257 4.76%
\(\mathtt{i}\) 7515 5.57% 10138 5.85%
\(\mathtt{j}\) 146 0.11% 97 0.06%
\(\mathtt{k}\) 1158 0.86% 1087 0.63%
\(\mathtt{l}\) 4716 3.49% 6146 3.55%
\(\mathtt{m}\) 2107 1.56% 4043 2.33%
\(\mathtt{n}\) 7016 5.20% 9917 5.72%
\(\mathtt{o}\) 8145 6.03% 9758 5.63%
\(\mathtt{p}\) 1524 1.13% 2427 1.40%
\(\mathtt{q}\) 209 0.15% 95 0.05%
\(\mathtt{r}\) 5438 4.03% 7674 4.43%
\(\mathtt{s}\) 6501 4.81% 8486 4.90%
\(\mathtt{t}\) 10689 7.92% 13515 7.80%
\(\mathtt{u}\) 3468 2.57% 3805 2.20%
\(\mathtt{v}\) 846 0.63% 1295 0.75%
\(\mathtt{w}\) 2676 1.98% 3225 1.86%
\(\mathtt{x}\) 148 0.11% 236 0.14%
\(\mathtt{y}\) 2262 1.68% 2679 1.55%
\(\mathtt{z}\) 78 0.06% 144 0.08%
sum 135026 173332

In the video in Figure 8.5.3 we recap the motivation for frequency analysis and give first examples. Further examples are give below.

Figure 8.5.3. Frequency Analysis by Matt Farmer and Stephen Steward

The following encrypted message from Alice to Bob is intercepted by Eve.

\begin{align*} \amp\mathtt{tddsztmdsakswanawxwsaflgsl}\cspace\mathtt{jxxshtjlksgfxsgyso}\cspace\\ \amp\mathtt{av}\cspace\mathtt{sl}\cspace\mathtt{xsuxdztxsaf}\cspace\mathtt{tualsl}\cspace\mathtt{xstimaltfastfgl}\cspace\mathtt{xjsl}\cspace\\ \amp\mathtt{gkxso}\cspace\mathtt{gsafsl}\cspace\mathtt{xajsgofsdtfzmtzxstjxsvtddxwsvxdlk}\\ \amp\mathtt{safsgmjksztmdksl}\cspace\mathtt{xsl}\cspace\mathtt{ajwstddsl}\cspace\mathtt{xkxswayyxjsyjge}\\ \amp\mathtt{sxtv}\cspace\mathtt{sgl}\cspace\mathtt{xjsafsdtfzmtzxsvmklgekstfwsdtok} \end{align*}

Eve knows that Alice and Bob are using a Caesar cipher. Decipher the message for Eve.

Solution.

Counting the number of occurrences of the characters in the cipher text we get:

\(\cspace\) \(\mathtt{a}\) \(\mathtt{d}\) \(\mathtt{e}\) \(\mathtt{f}\) \(\mathtt{g}\) \(\mathtt{h}\) \(\mathtt{i}\) \(\mathtt{j}\) \(\mathtt{k}\) \(\mathtt{l}\) \(\mathtt{m}\) \(\mathtt{n}\) \(\mathtt{o}\) \(\mathtt{s}\) \(\mathtt{t}\) \(\mathtt{u}\) \(\mathtt{v}\) \(\mathtt{w}\) \(\mathtt{x}\) \(\mathtt{y}\) \(\mathtt{z}\)
\({15}\) \({ 15 }\) \({ 13 }\) \({2}\) \({ 12 }\) \({ 11 }\) \({1}\) \({1}\) \({ 10 }\) \({ 10 }\) \({ 16 }\) \({7}\) \({1}\) \({4}\) \({ 40 }\) \({ 19 }\) \({2}\) \({5}\) \({7}\) \({ 22 }\) \({4}\) \({ 7}\)

According to the data in Table 11.4.2 the character space which we represent by \(\cspace\) is the most common character in English language texts. Since \(\mathtt{s}\) is the character in the cipher text with the highest number of occurrences, Eve tries decrypting the cipher text with the Caesar cipher that replaces \(\cspace\) with \(\mathtt{s}\text{.}\)

\(x\) \(\cspace\) \(\mathtt{a}\) \(\mathtt{b}\) \(\mathtt{c}\) \(\mathtt{d}\) \(\mathtt{e}\) \(\mathtt{f}\) \(\mathtt{g}\) \(\mathtt{h}\) \(\mathtt{i}\) \(\mathtt{j}\) \(\mathtt{k}\) \(\mathtt{l}\) \(\mathtt{m}\) \(\mathtt{n}\) \(\mathtt{o}\) \(\mathtt{p}\) \(\mathtt{q}\) \(\mathtt{r}\) \(\mathtt{s}\) \(\mathtt{t}\) \(\mathtt{u}\) \(\mathtt{v}\) \(\mathtt{w}\) \(\mathtt{x}\) \(\mathtt{y}\) \(\mathtt{z}\)
\(J(x)\) \(\mathtt{s}\) \(\mathtt{t}\) \(\mathtt{u}\) \(\mathtt{v}\) \(\mathtt{w}\) \(\mathtt{x}\) \(\mathtt{y}\) \(\mathtt{z}\) \(\cspace\) \(\mathtt{a}\) \(\mathtt{b}\) \(\mathtt{c}\) \(\mathtt{d}\) \(\mathtt{e}\) \(\mathtt{f}\) \(\mathtt{g}\) \(\mathtt{h}\) \(\mathtt{i}\) \(\mathtt{j}\) \(\mathtt{k}\) \(\mathtt{l}\) \(\mathtt{m}\) \(\mathtt{n}\) \(\mathtt{o}\) \(\mathtt{p}\) \(\mathtt{q}\) \(\mathtt{r}\)

When decrypting Eve reads the table from bottom to top. So Eve decrypts \(\mathtt{t}\) to \(\mathtt{a}\text{,}\) \(\mathtt{d}\) to \(\mathtt{l}\text{,}\) \(\mathtt{s}\) to \(\cspace\text{,}\) and so on. Eve obtains the following message  1 :

\begin{align*} \amp\mathtt{all}\cspace\mathtt{gaul}\cspace\mathtt{is}\cspace\mathtt{divided}\cspace\mathtt{into}\cspace\mathtt{three}\cspace\mathtt{parts}\cspace\mathtt{one}\cspace\mathtt{of}\cspace\\ \amp\mathtt{which}\cspace\mathtt{the}\cspace\mathtt{belgae}\cspace\mathtt{inhabit}\cspace\mathtt{the}\cspace\mathtt{aquitani}\cspace\mathtt{another}\\ \amp\mathtt{}\cspace\mathtt{those}\cspace\mathtt{who}\cspace\mathtt{in}\cspace\mathtt{their}\cspace\mathtt{own}\cspace\mathtt{language}\cspace\mathtt{are}\cspace\mathtt{called}\cspace\\ \amp\mathtt{celts}\cspace\mathtt{in}\cspace\mathtt{ours}\cspace\mathtt{gauls}\cspace\mathtt{the}\cspace\mathtt{third}\cspace\mathtt{all}\cspace\mathtt{these}\cspace\mathtt{differ}\\ \amp\mathtt{}\cspace\mathtt{from}\cspace\mathtt{each}\cspace\mathtt{other}\cspace\mathtt{in}\cspace\mathtt{language}\cspace\mathtt{customs}\cspace\mathtt{and}\cspace\mathtt{laws} \end{align*}

Frequency analysis can also be effective for much shorter texts.

Decrypt the cipher text

\begin{equation*} \mathtt{qexldmwdedpsxdsjdjyr} \end{equation*}

that was encrypted using a Caesar cipher. By how many characters does the Caesar cipher shift ?

Solution.

We count the number of occurrences of each character in the cipher text. For our convenience we start by counting the number of occurrences of the first character in the cipher text, namely \(\mathtt{q}\text{,}\) continue with the second character \(\texttt{e}\) and so on. Only considering the characters that actually occur in the text we get:

character \(\mathtt{q}\) \(\mathtt{e}\) \(\mathtt{x}\) \(\mathtt{l}\) \(\mathtt{d}\) \(\mathtt{m}\) \(\mathtt{w}\) \(\mathtt{p}\) \(\mathtt{s}\) \(\mathtt{j}\) \(\mathtt{y}\) \(\mathtt{r}\)
count 1 2 2 1 5 1 1 1 2 2 1 1

The character with the highest count in the cipher text is \(\mathtt{d}\text{.}\) So we try decrypting with the Caesar cipher that encrypts \(\cspace\) as \(\mathtt{d}\text{.}\) Thus the encryption function was:

\(x\) \(J(x)\)
\(\cspace\) \(\mathtt{d}\)
\(\mathtt{a}\) \(\mathtt{y}\)
\(\mathtt{b}\) \(\mathtt{f}\)
\(\mathtt{c}\) \(g\)
\(\mathtt{d}\) \(\mathtt{h}\)
\(\mathtt{e}\) \(\mathtt{i}\)
\(\mathtt{f}\) \(\mathtt{j}\)
\(\mathtt{g}\) \(\mathtt{k}\)
\(\mathtt{h}\) \(\mathtt{l}\)
\(\mathtt{i}\) \(\mathtt{m}\)
\(\mathtt{j}\) \(\mathtt{n}\)
\(\mathtt{k}\) \(\mathtt{o}\)
\(\mathtt{l}\) \(\mathtt{p}\)
\(\mathtt{m}\) \(\mathtt{q}\)
\(\mathtt{n}\) \(\mathtt{r}\)
\(\mathtt{o}\) \(\mathtt{s}\)
\(\mathtt{p}\) \(\mathtt{t}\)
\(\mathtt{q}\) \(\mathtt{u}\)
\(\mathtt{r}\) \(\mathtt{v}\)
\(\mathtt{s}\) \(\mathtt{w}\)
\(\mathtt{t}\) \(\mathtt{x}\)
\(\mathtt{u}\) \(\mathtt{y}\)
\(\mathtt{v}\) \(\mathtt{z}\)
\(\mathtt{w}\) \(\cspace\)
\(\mathtt{x}\) \(\mathtt{a}\)
\(\mathtt{y}\) \(\mathtt{b}\)
\(\mathtt{z}\) \(\mathtt{c}\)

For decryption we read the table from left to right, that is, cipher text to plain text. Avoiding duplication we get:

\begin{align*} J(\mathtt{q})\amp=\mathtt{m}\\ J(\mathtt{e})\amp=\mathtt{a}\\ J(\mathtt{x})\amp=\mathtt{t}\\ J(\mathtt{l})\amp=\mathtt{h}\\ J(\mathtt{d})\amp=\cspace\\ J(\mathtt{m})\amp=\mathtt{i}\\ J(\mathtt{w})\amp=\mathtt{s}\\ J(\mathtt{p})\amp=\mathtt{l}\\ J(\mathtt{s})\amp=\mathtt{o}\\ J(\mathtt{j})\amp=\mathtt{f}\\ J(\mathtt{y})\amp=\mathtt{u}\\ J(\mathtt{r})\amp=\mathtt{n} \end{align*}

So the plain text was \(\mathtt{math{\cspace}is{\cspace}a{\cspace}lot{\cspace}of{\cspace}fun}\text{.}\) From the table for the encryption function \(J\) we get that the Caesar cipher shifted by 23 characters.

We represent the character space by \(\mathtt{-}\text{.}\)

The eavesdropper Eve knows that Alice and Bob use a Caesar Cipher in their secure communication. Eve intercepts the following message sent form Alice to Bob:

\begin{equation*} \mathtt{cuyiuhinuqvhnuniuqvmnzunbcmugiyzf} \end{equation*}

Eve counts the frequency of the characters and concludes that the character \(\mathtt{-}\) (space) was encrypted as the character .

This tells Eve which encryption function Alice used. Alice's encryption function \(E:\mathbb{A}\to\mathbb{A}\) is given by:

\(E(\mathtt{-})=\),

\(E(\mathtt{a})=\),

\(E(\mathtt{b})=\),

\(E(\mathtt{c})=\),

\(E(\mathtt{d})=\),

\(E(\mathtt{e})=\),

\(E(\mathtt{f})=\),

\(E(\mathtt{g})=\),

\(E(\mathtt{h})=\),

\(E(\mathtt{i})=\),

\(E(\mathtt{j})=\),

\(E(\mathtt{k})=\),

\(E(\mathtt{l})=\),

\(E(\mathtt{m})=\),

\(E(\mathtt{n})=\),

\(E(\mathtt{o})=\),

\(E(\mathtt{p})=\),

\(E(\mathtt{q})=\),

\(E(\mathtt{r})=\),

\(E(\mathtt{s})=\),

\(E(\mathtt{t})=\),

\(E(\mathtt{u})=\),

\(E(\mathtt{v})=\),

\(E(\mathtt{w})=\),

\(E(\mathtt{x})=\),

\(E(\mathtt{y})=\),

\(E(\mathtt{z})=\),

Now Eve also knows the decryption function \(D:\mathbb{A}\to\mathbb{A}\text{.}\) Recall that the input for the decryption function is the cipher text, and the output of the decryption function is the plain text. The decryption function \(D\) is given by:

\(D(\mathtt{-})=\),

\(D(\mathtt{a})=\),

\(D(\mathtt{b})=\),

\(D(\mathtt{c})=\),

\(D(\mathtt{d})=\),

\(D(\mathtt{e})=\),

\(D(\mathtt{f})=\),

\(D(\mathtt{g})=\),

\(D(\mathtt{h})=\),

\(D(\mathtt{i})=\),

\(D(\mathtt{j})=\),

\(D(\mathtt{k})=\),

\(D(\mathtt{l})=\),

\(D(\mathtt{m})=\),

\(D(\mathtt{n})=\),

\(D(\mathtt{o})=\),

\(D(\mathtt{p})=\),

\(D(\mathtt{q})=\),

\(D(\mathtt{r})=\),

\(D(\mathtt{s})=\),

\(D(\mathtt{t})=\),

\(D(\mathtt{u})=\),

\(D(\mathtt{v})=\),

\(D(\mathtt{w})=\),

\(D(\mathtt{x})=\),

\(D(\mathtt{y})=\),

\(D(\mathtt{z})=\),

Using the function \(D\) Eve decrypts the message and obtains:

Frequency analysis can also be used to decrypt text that was encrypted with other substitution ciphers. In general this requires a more careful analysis of the number of occurrences of each character in the cipher text.

The symmetric key ciphers used today are block ciphers, that is, a larger block of characters is encrypted at a time. AES (Advanced Encryption Standard) is one such cipher that is widely used.

A standing figure is conversing with another figure who is sitting on computer desk chair. Behind the sitting figure is a desk with a flat-panel monitor. Standing figure: Alice sends a message to Bob saying to meet her somewhere. Sitting figure: Uh huh. Standing figure: But Eve sees it, too, and goes to the place. Sitting figure: With you so far. Standing figure: Bob is delayed, and Alice and Eve meet. Sitting figure: Yeah? I've discovered a way to get computer scientists to listen to any boring story. Title text: Changing the names would be easier, but if you're not comfortable lying, try only making friends with people named Alice, Bob, Carol, etc.

Changing the names would be easier, but if you're not comfortable lying, try only making friends with people named Alice, Bob, Carol, etc

Figure 8.5.7. Protocol by Randall Munroe (https://xkcd.com/1323).
from the English translation of Julius Caesar's De Bello Gallico (The Gallic Wars) by W. A. McDevitte and W. S. Bohn.