Section 8.5 Frequency Analysis
Suppose that the eavesdropper Eve intercepts the cipher text from Alice to Bob. In order to decrypt the message, Eve would need to know the decryption function for the substitution cipher. A simple exhaustive attack on a Caesar cipher would be for Eve to try out all 27 possible decryption functions (the 27 possible shifts) until she obtains a readable message.
We describe another method, called frequency analysis, that enables Eve to decrypt messages encrypted with a substitution cipher. This attack is based on the observation that in an English text, not all letters occur with the same frequency.
In Table 8.5.2 we give the frequency of letters and space in two classic novels. In a substitution cipher, if a letter or space is replaced by another symbol, the replacement symbol occurs with the same frequency as the original letter or space it is replacing. Counting the frequency of the symbols in the cipher text and comparing it with the expected frequency of the letters and space (as communicated in tables such as the one in Table 8.5.2) gives an indication of which letter or space could have been replaced by which symbol.
Use Table 8.5.2 to find the most frequently occurring characters from the set \(\A\) consisting of lower case letters and \(\cspace\) for space.
Checkpoint 8.5.1. Most frequent characters in \(\A\) in English texts.
Let
where \(\mathtt{}\) represents the character space.
Which four characters from \(\mathbb{A}\) occur most frequently in English language texts (written in lower case only)?
Most frequently:
2nd most frequently:
3rd most frequently:
4th most frequently:
For Caesar ciphers, this attack is particularly easy. We only have to find the plain text letter that corresponds to one cipher text letter. Doing so will yield the key \(n\) that provides the decryption function. The following problem demonstrates Eve's approach to deciphering the intercepted message that was encrypted using a Caesar cipher.
Alice in Wonderland  The Time Machine  
character  count  frequency  count  frequency 
\(\cspace\) (space)  27305  20.22%  32679  18.85% 
\(\mathtt{a}\)  8791  6.51%  11704  6.75% 
\(\mathtt{b}\)  1475  1.09%  1897  1.09% 
\(\mathtt{c}\)  2399  1.78%  3424  1.98% 
\(\mathtt{d}\)  4931  3.65%  6337  3.66% 
\(\mathtt{e}\)  13576  10.05%  17838  10.29% 
\(\mathtt{f}\)  2001  1.48%  3354  1.94% 
\(\mathtt{g}\)  2531  1.87%  3075  1.77% 
\(\mathtt{h}\)  7375  5.46%  8257  4.76% 
\(\mathtt{i}\)  7515  5.57%  10138  5.85% 
\(\mathtt{j}\)  146  0.11%  97  0.06% 
\(\mathtt{k}\)  1158  0.86%  1087  0.63% 
\(\mathtt{l}\)  4716  3.49%  6146  3.55% 
\(\mathtt{m}\)  2107  1.56%  4043  2.33% 
\(\mathtt{n}\)  7016  5.20%  9917  5.72% 
\(\mathtt{o}\)  8145  6.03%  9758  5.63% 
\(\mathtt{p}\)  1524  1.13%  2427  1.40% 
\(\mathtt{q}\)  209  0.15%  95  0.05% 
\(\mathtt{r}\)  5438  4.03%  7674  4.43% 
\(\mathtt{s}\)  6501  4.81%  8486  4.90% 
\(\mathtt{t}\)  10689  7.92%  13515  7.80% 
\(\mathtt{u}\)  3468  2.57%  3805  2.20% 
\(\mathtt{v}\)  846  0.63%  1295  0.75% 
\(\mathtt{w}\)  2676  1.98%  3225  1.86% 
\(\mathtt{x}\)  148  0.11%  236  0.14% 
\(\mathtt{y}\)  2262  1.68%  2679  1.55% 
\(\mathtt{z}\)  78  0.06%  144  0.08% 
sum  135026  173332 
In the video in Figure 8.5.3 we recap the motivation for frequency analysis and give first examples. Further examples are give below.
Problem 8.5.4. Eve breaks a Caesar cipher.
The following encrypted message from Alice to Bob is intercepted by Eve.
Eve knows that Alice and Bob are using a Caesar cipher. Decipher the message for Eve.
Counting the number of occurrences of the characters in the cipher text we get:
\(\cspace\)  \(\mathtt{a}\)  \(\mathtt{d}\)  \(\mathtt{e}\)  \(\mathtt{f}\)  \(\mathtt{g}\)  \(\mathtt{h}\)  \(\mathtt{i}\)  \(\mathtt{j}\)  \(\mathtt{k}\)  \(\mathtt{l}\)  \(\mathtt{m}\)  \(\mathtt{n}\)  \(\mathtt{o}\)  \(\mathtt{s}\)  \(\mathtt{t}\)  \(\mathtt{u}\)  \(\mathtt{v}\)  \(\mathtt{w}\)  \(\mathtt{x}\)  \(\mathtt{y}\)  \(\mathtt{z}\) 
\({15}\)  \({ 15 }\)  \({ 13 }\)  \({2}\)  \({ 12 }\)  \({ 11 }\)  \({1}\)  \({1}\)  \({ 10 }\)  \({ 10 }\)  \({ 16 }\)  \({7}\)  \({1}\)  \({4}\)  \({ 40 }\)  \({ 19 }\)  \({2}\)  \({5}\)  \({7}\)  \({ 22 }\)  \({4}\)  \({ 7}\) 
According to the data in Figure 11.4.2 the character space which we represent by \(\cspace\) is the most common character in English language texts. Since \(\mathtt{s}\) is the character in the cipher text with the highest number of occurrences, Eve tries decrypting the cipher text with the Caesar cipher that replaces \(\cspace\) with \(\mathtt{s}\text{.}\)
\(x\)  \(\cspace\)  \(\mathtt{a}\)  \(\mathtt{b}\)  \(\mathtt{c}\)  \(\mathtt{d}\)  \(\mathtt{e}\)  \(\mathtt{f}\)  \(\mathtt{g}\)  \(\mathtt{h}\)  \(\mathtt{i}\)  \(\mathtt{j}\)  \(\mathtt{k}\)  \(\mathtt{l}\)  \(\mathtt{m}\)  \(\mathtt{n}\)  \(\mathtt{o}\)  \(\mathtt{p}\)  \(\mathtt{q}\)  \(\mathtt{r}\)  \(\mathtt{s}\)  \(\mathtt{t}\)  \(\mathtt{u}\)  \(\mathtt{v}\)  \(\mathtt{w}\)  \(\mathtt{x}\)  \(\mathtt{y}\)  \(\mathtt{z}\) 
\(J(x)\)  \(\mathtt{s}\)  \(\mathtt{t}\)  \(\mathtt{u}\)  \(\mathtt{v}\)  \(\mathtt{w}\)  \(\mathtt{x}\)  \(\mathtt{y}\)  \(\mathtt{z}\)  \(\cspace\)  \(\mathtt{a}\)  \(\mathtt{b}\)  \(\mathtt{c}\)  \(\mathtt{d}\)  \(\mathtt{e}\)  \(\mathtt{f}\)  \(\mathtt{g}\)  \(\mathtt{h}\)  \(\mathtt{i}\)  \(\mathtt{j}\)  \(\mathtt{k}\)  \(\mathtt{l}\)  \(\mathtt{m}\)  \(\mathtt{n}\)  \(\mathtt{o}\)  \(\mathtt{p}\)  \(\mathtt{q}\)  \(\mathtt{r}\) 
When decrypting Eve reads the table from bottom to top. So Eve decrypts \(\mathtt{t}\) to \(\mathtt{a}\text{,}\) \(\mathtt{d}\) to \(\mathtt{l}\text{,}\) \(\mathtt{s}\) to \(\cspace\text{,}\) and so on. Eve obtains the following message ^{ 1 }:
Frequency analysis can also be effective for much shorter texts.
Problem 8.5.5. By how many characters did Alice shift ?
Decrypt the cipher text
that was encrypted using a Caesar cipher. By how many characters does the Caesar cipher shift ?
We count the number of occurrences of each character in the cipher text. For our convenience we start by counting the number of occurrences of the first character in the cipher text, namely \(\mathtt{q}\text{,}\) continue with the second character \(\texttt{e}\) and so on. Only considering the characters that actually occur in the text we get:
character  \(\mathtt{q}\)  \(\mathtt{e}\)  \(\mathtt{x}\)  \(\mathtt{l}\)  \(\mathtt{d}\)  \(\mathtt{m}\)  \(\mathtt{w}\)  \(\mathtt{p}\)  \(\mathtt{s}\)  \(\mathtt{j}\)  \(\mathtt{y}\)  \(\mathtt{r}\) 
count  1  2  2  1  5  1  1  1  2  2  1  1 
The character with the highest count in the cipher text is \(\mathtt{d}\text{.}\) So we try decrypting with the Caesar cipher that encrypts \(\cspace\) as \(\mathtt{d}\text{.}\) Thus the encryption function was:
\(x\)  \(J(x)\) 
\(\cspace\)  \(\mathtt{d}\) 
\(\mathtt{a}\)  \(\mathtt{y}\) 
\(\mathtt{b}\)  \(\mathtt{f}\) 
\(\mathtt{c}\)  \(g\) 
\(\mathtt{d}\)  \(\mathtt{h}\) 
\(\mathtt{e}\)  \(\mathtt{i}\) 
\(\mathtt{f}\)  \(\mathtt{j}\) 
\(\mathtt{g}\)  \(\mathtt{k}\) 
\(\mathtt{h}\)  \(\mathtt{l}\) 
\(\mathtt{i}\)  \(\mathtt{m}\) 
\(\mathtt{j}\)  \(\mathtt{n}\) 
\(\mathtt{k}\)  \(\mathtt{o}\) 
\(\mathtt{l}\)  \(\mathtt{p}\) 
\(\mathtt{m}\)  \(\mathtt{q}\) 
\(\mathtt{n}\)  \(\mathtt{r}\) 
\(\mathtt{o}\)  \(\mathtt{s}\) 
\(\mathtt{p}\)  \(\mathtt{t}\) 
\(\mathtt{q}\)  \(\mathtt{u}\) 
\(\mathtt{r}\)  \(\mathtt{v}\) 
\(\mathtt{s}\)  \(\mathtt{w}\) 
\(\mathtt{t}\)  \(\mathtt{x}\) 
\(\mathtt{u}\)  \(\mathtt{y}\) 
\(\mathtt{v}\)  \(\mathtt{z}\) 
\(\mathtt{w}\)  \(\cspace\) 
\(\mathtt{x}\)  \(\mathtt{a}\) 
\(\mathtt{y}\)  \(\mathtt{b}\) 
\(\mathtt{z}\)  \(\mathtt{c}\) 
For decryption we read the table from left to right, that is, cipher text to plain text. Avoiding duplication we get:
So the plain text was \(\mathtt{math{\cspace}is{\cspace}a{\cspace}lot{\cspace}of{\cspace}fun}\text{.}\) From the table for the encryption function \(J\) we get that the Caesar cipher shifted by 23 characters.
Checkpoint 8.5.6. Break a Caesar cipher.
We represent the character space by \(\mathtt{}\text{.}\)
The eavesdropper Eve knows that Alice and Bob use a Caesar Cipher in their secure communication. Eve intercepts the following message sent form Alice to Bob:
Eve counts the frequency of the characters and concludes that the character \(\mathtt{}\) (space) was encrypted as the character .
This tells Eve which encryption function Alice used. Alice's encryption function \(E:\mathbb{A}\to\mathbb{A}\) is given by:
\(E(\mathtt{})=\),
\(E(\mathtt{a})=\),
\(E(\mathtt{b})=\),
\(E(\mathtt{c})=\),
\(E(\mathtt{d})=\),
\(E(\mathtt{e})=\),
\(E(\mathtt{f})=\),
\(E(\mathtt{g})=\),
\(E(\mathtt{h})=\),
\(E(\mathtt{i})=\),
\(E(\mathtt{j})=\),
\(E(\mathtt{k})=\),
\(E(\mathtt{l})=\),
\(E(\mathtt{m})=\),
\(E(\mathtt{n})=\),
\(E(\mathtt{o})=\),
\(E(\mathtt{p})=\),
\(E(\mathtt{q})=\),
\(E(\mathtt{r})=\),
\(E(\mathtt{s})=\),
\(E(\mathtt{t})=\),
\(E(\mathtt{u})=\),
\(E(\mathtt{v})=\),
\(E(\mathtt{w})=\),
\(E(\mathtt{x})=\),
\(E(\mathtt{y})=\),
\(E(\mathtt{z})=\),
Now Eve also knows the decryption function \(D:\mathbb{A}\to\mathbb{A}\text{.}\) Recall that the input for the decryption function is the cipher text, and the output of the decryption function is the plain text. The decryption function \(D\) is given by:
\(D(\mathtt{})=\),
\(D(\mathtt{a})=\),
\(D(\mathtt{b})=\),
\(D(\mathtt{c})=\),
\(D(\mathtt{d})=\),
\(D(\mathtt{e})=\),
\(D(\mathtt{f})=\),
\(D(\mathtt{g})=\),
\(D(\mathtt{h})=\),
\(D(\mathtt{i})=\),
\(D(\mathtt{j})=\),
\(D(\mathtt{k})=\),
\(D(\mathtt{l})=\),
\(D(\mathtt{m})=\),
\(D(\mathtt{n})=\),
\(D(\mathtt{o})=\),
\(D(\mathtt{p})=\),
\(D(\mathtt{q})=\),
\(D(\mathtt{r})=\),
\(D(\mathtt{s})=\),
\(D(\mathtt{t})=\),
\(D(\mathtt{u})=\),
\(D(\mathtt{v})=\),
\(D(\mathtt{w})=\),
\(D(\mathtt{x})=\),
\(D(\mathtt{y})=\),
\(D(\mathtt{z})=\),
Using the function \(D\) Eve decrypts the message and obtains:
u
u
v
w
x
y
z

a
b
c
d
e
f
g
h
i
j
k
l
m
n
o
p
q
r
s
t
f
g
h
i
j
k
l
m
n
o
p
q
r
s
t
u
v
w
x
y
z

a
b
c
d
e
idonotwanttowastethismodel
Frequency analysis can also be used to decrypt text that was encrypted with other substitution ciphers. In general this requires a more careful analysis of the number of occurrences of each character in the cipher text.
The symmetric key ciphers used today are block ciphers, that is, a larger block of characters is encrypted at a time. AES (Advanced Encryption Standard) is one such cipher that is widely used.