This blog post summarizes the book titled “Information Theory”

What is Information ?

Information is measured in bits and one bit of information allows you to choose between to equally probable alternatives
If you are finding a treasure and it could be in any of the 8 different locations - If you are given data 001 and you zero in on to one of the locations, then the information you have is 3 bits
One bit of information means that there is information that can help you choose between two equally probable options
Shannon’s theory is super important as it gives precise ways to quantify signal and noise, put upper bounds to the rate at which information can be communicated within any system
Bits and Binary Digits are fundamentally different entities
One bit of information is the amount of information required to choose between two equally probable alternatives
If you have n bits of information, then you can choose from \(m=2^n\) equally probable alternatives.
Binary digit is the value of a binary variable, where the value can be either a 0 or a 1
Binary digit is not information per se
A bit is the amount of information required to choose between two equally probable alternatives, whereas a binary digit is the value of a binary variable
Telegraphy
- 26 different lines to transmit 26 alphabets
- Cooke and Wheatstone two needle system to transmit 26 alphabets
- Morse code - Efficient transmission using one wire
Encodings
- How you can encode the image ? Encoding means figuring out the information content in the message. This encoded message can be sent across to the other end to recreate the image
- Run-length encoding involves flattening the image and then encoding the greyscale values
- Difference encoding involves flattening the image and encoding only the difference values
What is information ? It is what remains after every iota of natural redundancy has been squeezed out of a message, and after every aimless syllable of noise has been removed

Entropy of Discrete variables

Terminology
- source
- messages
- symbols
- communication channel
- channel inputs
- codewords
- codebook
- code
- error
- decoder
- error rate
- channel capacity
A message comprising symbols \( s=(s_1,\ldots,s_k) \) is encoded by a function \( x=g(s) \) in to a sequence of codewords \( x= (x_1, x_2, \ldots, x_n) \), where number of symbols and codewords are not necessarily equal. These codewords are transmitted through a communication channel to product outputs \( y=(y_1, y_2, \ldots, y_n) \) which are decoded to recover the message \( s \)
Shannon’s information propertie
- Continuity
- Symmetry
- Maximal value
- Additive
Shannon information is a measure of surprise
\( h(x) = \log_2 p(x) \)
Entropy is average Shannon information
A variable with an entropy of \( H(X) \) bits provides enough Shannon information to choose between \( m=2^{H(X)} \) equally probable alternatives
Entropy is a measure of uncertainty
The average uncertainty of a variable X is summarized by its entropy \( H(X) \). If we are told the value of \( X \), then the amount of information we have been given is, one average, exactly equal to its entropy
Doubling the number of possible values of a variable adds one bit of entropy

The Source Coding Theorem

There are two reasons why information is so dilute in natural signals
- Values that are close to each other tend to have similar values
- Optimal distribution of values in a channel depends on the constraints that apply
Signal can be conveyed through a communication channel most efficiently if
- it is first transformed in to a signal with independent values
- the values of the transformed signal have a distribution which has been optimized for the particular channel
Shanon’s theorem does not talk about the way to encode the source messages. It talks only about the existence of encoding
The capacity \( C \) of a discrete noiseless channel is the maximum number of bits it can communicate, usually expressed in units of bits per second
The encoding process yields inputs with a specific distribution. The shape of this distribution determines its entropy \( H(X) \)
The key idea is that you take a set of symbols with an entropy \( H(S) \) and then you transform in a way that you can reach channel capacity
If a channel conveys \( x \) binary digits per second, then the maximum amount of information it can convey is an average of \( x \) bits per second

Shannon’s Source Coding Theorem

Let a source have entropy \(H \) (bits per symbol) and a channel have a capacity \(C\)(bits per second). Then it is possible to encode the output of the source in such a way to transmit at the average rate \(C/H - \epsilon\) per second over the channel where ε is arbitrarily small. It is not possible to transmit at an average rate greater than \( C/H \)

If the average number of binary digits in each codeword is \(L(X)\), then the coding efficiency of this simple code is \( H(S)/L(X) \)
If you consider the outcomes of a pair of dice, then the entropy of the outcomes is close to 3.27 bits/symbol. Since there are 11 possible values, one can use a 4 binary digits/symbole to code.In this case, the coding efficiency turns out to be 0.818 bits/binary digit. Using Huffman coding, it is possible to get a coding efficiency of 0.99 bits/binary digits
One can look at the english letters, find the relative frequency of the letters and compute the entropy of the symbol set. It is usually around 4.08 bits/letter
In order to encode this data, one might also have to think about the fact that the occurrence of the alphabets are not truly iid. The alphabets depend on the context. Shannon entropy goes down as we consider blocks of text. In fact if we consider increasing lengths of blocks, it can be empirically shown that the entropy reaches an asymptotic value of 1.8 bits/letter
Shannon proved the following in the case of dependent sequences
- the most common sub-sequences comprise a tiny proportion of the possible sub-sequences
- the common sub-sequences occur with a collective probability of about one
- Each of these sub-sequences occurs with about the same probability
Kolmogorov’s complexity is the length of the shortest computer program capable of describing an object
- Kolmogorov complexity is non-computable

The Noisy Channel Coding Theorem

The mutual information \( I(X,Y) \) between two variable \( X \) is the average information that we gain about \(Y\) after we have observed a single value of \(X\).
The amount of residual uncertainty that we have for \(Y\) after we \(X\) is called the conditional entropy \(H(Y|X)\). Also called noise entropy
Entropy of Joint probability distribution

\begin{align} H(X,Y) & = \mathbf E \left[ \log {1 \over p(x,y)} \right] \end{align}

If \(X\) and \(Y\) are independent, then the entropy of the joint distribution \(p(X,Y)\) is equal to the summed entropies of its marginal distributions

\begin{align} H(X,Y) & = H(X) + H(Y) \end{align}

The mutual information \(I(X,Y)\) can be represented as

\begin{align} H(X,Y) & = H(X) + H(Y) - I(X,Y) \end{align}

Entropy of Joint Distribution

\begin{align} H(X,Y) &= \sum_{i=1}^N \sum_{j=1}^M \log {1 \over p(x_i,y_j)} \end{align}

Conditional Entropy \(H(Y|X)\)

\begin{align} I(X,Y) &= H(Y) - H(Y|X) \\
I(X,Y) &= H(X) - H(X|Y) \\
H(Y|X) &= H(\eta) \\
H(X,Y) &= H(Y) + H(X|Y) \\
H(X,Y) &= H(X) + H(Y|X) \\
\end{align}

Transmission Efficiency

\begin{align} TE &= {I(X,Y) \over H(Y)} \end{align}

One can introduce some redundancy in the message using error correcting codes so as to reduce the effect of channel noise. The redundancy helps in detection and correction of incorrect output messages
Parity bit for row and column matrix of data is a simple yet effective way of detecting and correcting errors at the receiver end
There is a tradeoff between the robustness of the encoded message and the number of extra binary digits required to make the message robust
Capacity of a Noisy Channel

\begin{align} C &= \max_{p(X)} I(X,Y) , \text{bits} \\
C &= \max_{p(X)} H(X) - H(X|Y) , \text{bits} \end{align}

Shannon’s Noisy Channel Coding Theorem

Let a discrete channel have the capacity \(C\) and a discrete source the entropy per second \(H\). If \(H \leq C \) there exists a coding system such that the output of the source can be transmitted over the channel with an arbitrary small frequency of errors. If \(H \geq C \), it is possible to encode the source so that the equivocation is less than \( H-C+\epsilon \) where \( \epsilon \) is arbitrarily small. There is no method of encoding which gives an equivocation less than \(H-C\)

Fantastic example of a noisy type writer that can be used to communicate messages with zero error rate
When averaged over all possible codebooks, if the average error rate is \( \epsilon \), then three must exist at least one codebook which produces an error as small as \(\epsilon\)

Entropy of Continuous Variables

For a continuous variable, as the discretization goes down, the entropy diverges to infinity
The entropy of a continuous variable is infinite because it includes a constant term which is infinite. If we ignore the term, then we obtain the differential entropy

\begin{align} H_{dif}(X) &= \int^\infty_{-\infty} p(x) \log {1 \over p(x)} dx \end{align}

If a variable \(X\) has entropy \(H\), and if each value of the probability distribution \(p(X)\) is estimated as a relative frequency, then the resultant estimated entropy \(H_{MLE}\) tends to be smaller than \(H\)
Multiplying a continuous variable \(X\) by a constant \(c\) changes the range of values, which changes the entropy of \(X\) by an amount \(\log |c|\)
Adding a constant \(c\) to a continuous variable \(X\) has no effect on its entropy
Given an continuous variable \(X\) with a fixed range, the distribution with maximum entropy is the uniform distribution
Given a continuous positive variable \(X\) which has a mean \(\mu \), but is otherwise unconstrained, the distribution with maximum entropy is the exponential distribution
Given a continuous positive variable \(X\) which has a variance \(\sigma \), but is otherwise unconstrained, the distribution with maximum entropy is the Gaussian distribution
Noise limits the amount of information conveyed by a continuous variable, and, to all intents and purposes transforms it into a discrete variable with \(m\) discriminable values, where \(m\) decreases as noise increases
An initial uncertainty of 100% is reduced to 71% after receiving half a bit of information and to \(2^{-H}\) after receiving \(H\) bits.

Mutual Information: Continuous

Mutual information is sensitive to the strength of association between two variables, but it is essentially “blind” to the nature of that relationship. If the two variables are related to each other, then it does not matter how complex or subtle their relationships is.
Conditional Entropy \(H(Y|X\)): A slice through \(p(X,Y)\) at \(X=x_1 \) defines a one-dimensional distribution \(p(Y|x_1) \) with entropy \(H(Y|x_1) \). The conditional entropy \(H(Y|X)\) is the average entropy of all such slices through \(p(X,Y)\). where this average is taken over all values of \(X\)
Conditional Entropy \(H(X|Y\)): A slice through \(p(X,Y)\) at \(Y=y_1 \) defines a one-dimensional distribution \(p(X|y_1) \) with entropy \(H(X|y_1) \). The conditional entropy \(H(X|Y)\) is the average entropy of all such slices through \(p(X,Y)\). where this average is taken over all values of \(X\)
Conditional Entropy

\begin{align} H(Y|X) &= \mathbf E \left [ \log {1 \over p(y|x) }\right] \end{align}

Relation between Mutual Information and Conditional Entropy

\begin{align} I(X,Y) &= H(Y) - H(Y|X) \\
I(X,Y) &= H(X) - H(X|Y) \end{align}

The mutual information is that part of the joint entropy \(H(X,Y)\) that is left over once we have removed the part \(H(X|Y) + H(Y|X) \) due to noise
Kullback-Leibler Divergence

\begin{align} D_{KL}(p(X)||q(X)) &= \int_x p(x) \log {p(x) \over q(x)} \end{align}

Mutual information between \(X\) and \(Y\) is the KL-divergence between the joint distribution \(p(X,Y)\) and the joint distribution \(p(X)p(Y)\) obtained by the outer product of the marginal distributions
Understanding from Bayes perspective: Mutual information is the expected KL-divergence between the posterior and the prior.

\begin{align} I(X,Y) &= \mathbf E_y [ D_{KL}(p(X|y)||p(X))] \end{align}

Channel Capacity:Continuous

Key question under consideration is, What input distribution \(p(X)\) maximises the rate at which the information can be communicated through a noisy channel
Input distribution which maximises information transmission depends on the channel
- Channel has fixed variance and infinite range
- Channel has a finite range
Shannon’s equation(\(P\) is the signal power and \(N\) is the noise power)

\begin{align} C &= {1 \over 2 } \log \left (1 + {P \over N} \right) \end{align}

Given that \( Y = X + \eta \), where \(X = g(S) \), and that the noise is negligible, maximizing the mutual information between the bounded input \( X \) and the output \( Y \) amounts to finding a form for \(g\) which makes \(p(X)\) uniform

Thermodynamic Entropy and Information

Entropy has been defined at least twice in the history of science. First, it was defined in physics as thermodynamic entropy by Boltzmann and Gibbs in 1870’s. Subsequently it was defined in mathematics by Shannon
Shannon’s information entropy is a measure of information
Thermodynamic entropy is a measure of number of states a physical system can adopt
Why are two measures required ? Are they related - Yes
Relationship matters because thermodynamic entropy can be used to measure the energy cost of Shannon’s information entropy
A single macro state can have several micro-states
Each macro-state is consistent with many equally probably micro-states. So, macro-states with many micro-states are more probably than macro-states with few micro-states
If you throw a pair of dice, then total score has several possible values(2-12).For each of this value, there could several combinations of individual dice outcomes. Think of each score as macro-state and combinations as micro-states
Landauer Limit : Lower limit to the amount of energy required to acquire or transmit one bit of information
No matter how efficient any physical device is, it can acquire one bit of information only if it expends at least 0.693kT joules of energy
Second law of thermodynamics - Entropy of an isolated system increases until it reaches a maximum value
Maxwell’s demon cannot exist because of Landauer limit

Information as Nature’s Currency

The last chapter talks about several interesting applications of Information theory. I found the chapter fascinating and will re-read at a later point in time to appreciate it better.

Takeaway

Thanks to this book, I have understood some of the basic principles of Information theory. All I can say is that I have learnt the basic vocab of the subject. It is like learning syntax of a programming language. I need to think through this syntax and apply this syntax to a problem. The problem that I would be solving is the non parametric estimation of volatility surface for stocks listed on SGX.

My next step in the journey would be to read and work through some of the following books

The Information a history, a theory, a flood by James Gleick
A Mind at Play How Claude Shannon Invented the Information Age by Jimmy Soni, Rob Goodman
Information theory, inference, and learning algorithms by David J. C. MacKay
Elements of Information Theory by Thomas M. Cover, Joy A. Thomas
Decoding Universe
An Introduction to Information Theory Symbols, Signals and Noise by John R. Pierce
The Logician and the Engineer How George Boole and Claude Shannon Created the Information Age by Paul J. Nahin
Information Theory by Robert B. Ash (z-lib.org)
Information and Coding Theory by Gareth A. Jones, J.Mary Jones

I am grateful to the author, James V Stone, for making the subject accessible to a total novice like me. Hopefully I will work through some of the principles, internalize them and create a useful application.

Information Theory Book Review

Contents