Character Encoding

Section 8.1 Character Encoding

A code is a system of rules to convert information from one form to another. When we convert given information into another representation, we are encoding. When we convert back to the original representation, we are decoding. We represent the rules for encoding and decoding by functions. To be able to recover the original information through decoding, the encoding function must be invertible.

In this section we convert text into a sequence of numbers. Commonly used character encodings are ASCII (American Standard Code for Information Interchange) and Unicode. ASCII uses 128 printable and control characters and was standardized in 1963 by ASA (American Standards Association). Unicode can handle the characters in most of the world’s writing systems.

We use a simpler code that only encodes the characters in the set

\begin{equation*} \A=\{\cspace,\mathtt{a},\mathtt{b},\mathtt{c},\dots,\mathtt{z}\} \end{equation*}

into a sequence of numbers in

\begin{equation*} \Z_{27}=\{0,1,2,\ldots,26\}. \end{equation*}

Recall that when we write texts with the characters in \(\A\) we write \(\cspace\) instead of the character space.

To convert text into a sequence of numbers we use the encoding function \(C\) and its inverse the decoding function \(C^{-1}\) given in Figure 8.1. In practice we only need one of these tables, as instead of explicitly writing down the table for \(C^{-1}\) we can also read the table for \(C\) from right to left.

\(x\)	\(C(x)\)
\(\cspace\)	0
\(\mathtt{a}\)	1
\(\mathtt{b}\)	2
\(\mathtt{c}\)	3
\(\mathtt{d}\)	4
\(\mathtt{e}\)	5
\(\mathtt{f}\)	6
\(\mathtt{g}\)	7
\(\mathtt{h}\)	8
\(\mathtt{i}\)	9
\(\mathtt{j}\)	10
\(\mathtt{k}\)	11
\(\mathtt{l}\)	12
\(\mathtt{m}\)	13
\(\mathtt{n}\)	14
\(\mathtt{o}\)	15
\(\mathtt{p}\)	16
\(\mathtt{q}\)	17
\(\mathtt{r}\)	18
\(\mathtt{s}\)	19
\(\mathtt{t}\)	20
\(\mathtt{u}\)	21
\(\mathtt{v}\)	22
\(\mathtt{w}\)	23
\(\mathtt{x}\)	24
\(\mathtt{y}\)	25
\(\mathtt{z}\)	26

\(y\)	\(C^{-1}(y)\)
0	\(\cspace\)
1	\(\mathtt{a}\)
2	\(\mathtt{b}\)
3	\(\mathtt{c}\)
4	\(\mathtt{d}\)
5	\(\mathtt{e}\)
6	\(\mathtt{f}\)
7	\(\mathtt{g}\)
8	\(\mathtt{h}\)
9	\(\mathtt{i}\)
10	\(\mathtt{j}\)
11	\(\mathtt{k}\)
12	\(\mathtt{l}\)
13	\(\mathtt{m}\)
14	\(\mathtt{n}\)
15	\(\mathtt{o}\)
16	\(\mathtt{p}\)
17	\(\mathtt{q}\)
18	\(\mathtt{r}\)
19	\(\mathtt{s}\)
20	\(\mathtt{t}\)
21	\(\mathtt{u}\)
22	\(\mathtt{v}\)
23	\(\mathtt{w}\)
24	\(\mathtt{x}\)
25	\(\mathtt{y}\)
26	\(\mathtt{z}\)

Figure 8.1. Tables that specify the encoding function \(C:\A\to\Z_{27}\) and its inverse the decoding function \(C^{-1}:\Z_{27}\to\A\text{.}\) The character space is represented by \(\cspace\text{.}\)

When encoding words or longer segments of text we apply the encoding function character by character and thus obtain a sequence of numbers (to be more exact a sequence of elements of \(\Z_{27}\)) which we separate by commas.

In the video in Figure 8.2 we give a detailed description of the use of the encoding function \(C\) and its inverse the decoding function \(C^{-1}\text{.}\)

Figure 8.2. Character encoding by Matt Farmer and Stephen Steward

In the following we give examples on how to apply the encoding and decoding functions.

Problem 8.3. Encoding a word.

Encode the word \(\mathtt{cookies}\) with the encoding function \(C\text{.}\)

Solution.

We evaluate the encoding function \(C\) at the characters in the word \(\mathtt{cookies}\text{.}\) We have:

\begin{align*} C(\mathtt{c})\amp=3\\ C(\mathtt{o})\amp=15\\ C(\mathtt{k})\amp=11\\ C(\mathtt{i})\amp=9\\ C(\mathtt{e})\amp=5\\ C(\mathtt{s})\amp=19 \end{align*}

Thus \(\mathtt{cookies}\) is encoded as the numbers

\begin{equation*} 3,15,15,11,9,5,19 \end{equation*}

To obtain the text encoded in a sequence of numbers (to be more exact a sequence of elements of \(\Z_{27}\)) we apply the decoding function \(C^{-1}\) to each number.

Problem 8.4. Decoding.

Decode \(20, 15, 15, 0, 5, 1, 19, 25\) with the decoding function \(C^{-1}\text{.}\)

Solution.

We have

\begin{align*} C^{-1}(20)\amp=\mathtt{t}\\ C^{-1}(15)\amp=\mathtt{o}\\ C^{-1}(0)\amp=\cspace\\ C^{-1}(5)\amp=\mathtt{e}\\ C^{-1}(1)\amp=\mathtt{a}\\ C^{-1}(19)\amp=\mathtt{s}\\ C^{-1}(25)\amp=\mathtt{y} \end{align*}

Thus we obtain the words \(\mathtt{too{\cspace}easy}\text{.}\)

In the following problems we do not explicitly evaluate the encoding function \(C\) or its inverse the decoding function \(C^{-1}\) anymore, since the encoding and decoding process is a simple lookup in Figure 8.1.

Problem 8.5. Encoding a sentence.

Encode the text

from the poem Devotions upon Emergent Occasions by John Donne, 1624

\begin{align*} \amp\mathtt{and{\cspace}therefore{\cspace}never{\cspace}send{\cspace}to{\cspace}know{\cspace}for{\cspace}}\\ \amp\mathtt{whom{\cspace}the{\cspace}bell{\cspace}tolls{\cspace}it{\cspace}tolls{\cspace}for{\cspace}thee} \end{align*}

with the function \(C\) from Figure 8.1.

Solution.

We obtain the sequence of elements of \(\Z_{27}\text{:}\)

1, 14, 4, 0, 20, 8, 5, 18, 5, 6, 15, 18, 5, 0, 14, 5, 22, 5, 18, 0, 19, 5, 14, 4, 0, 20, 15, 0, 11, 14, 15, 23, 0, 6, 15, 18, 0, 23, 8, 15, 13, 0, 20, 8, 5, 0, 2, 5, 12, 12, 0, 20, 15, 12, 12, 19, 0, 9, 20, 0, 20, 15, 12, 12, 19, 0, 6, 15, 18, 0, 20, 8, 5, 5

Problem 8.6. Decoding a sentence.

Decode the sequence of elements of \(\Z_{27}\)

12, 5, 20, 0, 13, 5, 0, 14, 15, 20, 0, 20, 15, 0, 20, 8, 5, 0, 13, 1, 18, 18, 9, 1, 7, 5, 0, 15, 6, 0, 20, 18, 21, 5, 0, 13, 9, 14, 4, 19, 0, 1, 4, 13, 9, 20, 0, 9, 13, 16, 5, 4, 9, 13, 5, 14, 20, 19

with the function \(C^{-1}\) from Figure 8.1.

Solution.

We obtain the text

from Shakespeare’s Sonnet 116, 1609

\begin{align*} \amp\mathtt{let-me-not-to-the-marriage-of-}\\ \amp\mathtt{true-minds-admit-impediments} \end{align*}

Checkpoint 8.7. Encoding a word.

Let

\begin{equation*} C:\lbrace \mathtt{-},\mathtt{a},\mathtt{b},\mathtt{c},...,\mathtt{z}\rbrace \to \lbrace 0,1,2,3,...26\rbrace,\\ C(\mathtt{-})=0, C(\mathtt{a})=1,...,C(\mathtt{z})=26. \end{equation*}

Encode the word into a sequence of integers, separated by commas, using the function \(C\text{:}\)

\begin{equation*} \mathtt{why} \end{equation*}

Checkpoint 8.8. Decoding a word.

We encode sequences of characters with the function

Decode the sequence of integers into a string using the inverse of the endcoding function \(C\text{:}\)

\begin{equation*} 13, 1, 20, 18, 9, 24 \end{equation*}

Prev Top Next