# Information Theory Book Review

## Contents

This blog post summarizes the book titled “Information Theory”

# What is Information ?

- Information is measured in bits and one bit of information allows you to choose between to equally probable alternatives
- If you are finding a treasure and it could be in any of the 8 different locations - If you are given data 001 and you zero in on to one of the locations, then the information you have is 3 bits
- One bit of information means that there is information that can help you choose between two equally probable options
- Shannon’s theory is super important as it gives precise ways to quantify signal and noise, put upper bounds to the rate at which information can be communicated within any system
- Bits and Binary Digits are fundamentally different entities
- One bit of information is the amount of information required to choose between two equally probable alternatives
- If you have n bits of information, then you can choose from \(m=2^n\) equally probable alternatives.
- Binary digit is the value of a binary variable, where the value can be either a 0 or a 1
- Binary digit is not information per se
- A bit is the amount of information required to choose between two equally probable alternatives, whereas a binary digit is the value of a binary variable
- Telegraphy
- 26 different lines to transmit 26 alphabets
- Cooke and Wheatstone two needle system to transmit 26 alphabets
- Morse code - Efficient transmission using one wire

- Encodings
- How you can encode the image ? Encoding means figuring out the information content in the message. This encoded message can be sent across to the other end to recreate the image
- Run-length encoding involves flattening the image and then encoding the greyscale values
- Difference encoding involves flattening the image and encoding only the difference values

- What is information ? It is what remains after every iota of natural redundancy has been squeezed out of a message, and after every aimless syllable of noise has been removed

# Entropy of Discrete variables

- Terminology
- source
- messages
- symbols
- communication channel
- channel inputs
- codewords
- codebook
- code
- error
- decoder
- error rate
- channel capacity

- A message comprising symbols \( s=(s_1,\ldots,s_k) \) is encoded by a function \( x=g(s) \) in to a sequence of codewords \( x= (x_1, x_2, \ldots, x_n) \), where number of symbols and codewords are not necessarily equal. These codewords are transmitted through a communication channel to product outputs \( y=(y_1, y_2, \ldots, y_n) \) which are decoded to recover the message \( s \)
- Shannon’s information propertie
- Continuity
- Symmetry
- Maximal value
- Additive

- Shannon information is a measure of surprise
- \( h(x) = \log_2 p(x) \)
- Entropy is average Shannon information
- A variable with an entropy of \( H(X) \) bits provides enough Shannon information to choose between \( m=2^{H(X)} \) equally probable alternatives
- Entropy is a measure of uncertainty
- The average uncertainty of a variable X is summarized by its entropy \( H(X) \). If we are told the value of \( X \), then the amount of information we have been given is, one average, exactly equal to its entropy
- Doubling the number of possible values of a variable adds one bit of entropy

# The Source Coding Theorem

- There are two reasons why information is so dilute in natural signals
- Values that are close to each other tend to have similar values
- Optimal distribution of values in a channel depends on the constraints that apply

- Signal can be conveyed through a communication channel most efficiently if
- it is first transformed in to a signal with independent values
- the values of the transformed signal have a distribution which has been optimized for the particular channel

- Shanon’s theorem does not talk about the way to encode the source messages. It talks only about the existence of encoding
- The capacity \( C \) of a discrete noiseless channel is the maximum number of bits it can communicate, usually expressed in units of bits per second
- The encoding process yields inputs with a specific distribution. The shape of this distribution determines its entropy \( H(X) \)
- The key idea is that you take a set of symbols with an entropy \( H(S) \) and then you transform in a way that you can reach channel capacity
- If a channel conveys \( x \) binary digits per second, then the maximum amount of information it can convey is an average of \( x \) bits per second

## Shannon’s Source Coding Theorem

Let a source have entropy \(H \) (bits per symbol) and a channel have a capacity \(C\)(bits per second). Then it is possible to encode the output of the source in such a way to transmit at the average rate \(C/H - \epsilon\) per second over the channel where ε is arbitrarily small. It is not possible to transmit at an average rate greater than \( C/H \)

- If the average number of binary digits in each codeword is \(L(X)\), then the
*coding efficiency*of this simple code is \( H(S)/L(X) \) - If you consider the outcomes of a pair of dice, then the entropy of the outcomes is close to 3.27 bits/symbol. Since there are 11 possible values, one can use a 4 binary digits/symbole to code.In this case, the coding efficiency turns out to be 0.818 bits/binary digit. Using Huffman coding, it is possible to get a coding efficiency of 0.99 bits/binary digits
- One can look at the english letters, find the relative frequency of the letters and compute the entropy of the symbol set. It is usually around 4.08 bits/letter
- In order to encode this data, one might also have to think about the fact that the occurrence of the alphabets are not truly iid. The alphabets depend on the context. Shannon entropy goes down as we consider blocks of text. In fact if we consider increasing lengths of blocks, it can be empirically shown that the entropy reaches an asymptotic value of 1.8 bits/letter
- Shannon proved the following in the case of dependent sequences
- the most common sub-sequences comprise a tiny proportion of the possible sub-sequences
- the common sub-sequences occur with a collective probability of about one
- Each of these sub-sequences occurs with about the same probability

- Kolmogorov’s complexity is the length of the shortest computer program capable of describing an object
- Kolmogorov complexity is
*non-computable*

- Kolmogorov complexity is

# The Noisy Channel Coding Theorem

- The
*mutual information*\( I(X,Y) \) between two variable \( X \) is the average information that we gain about \(Y\) after we have observed a single value of \(X\). - The amount of residual uncertainty that we have for \(Y\) after we \(X\) is called the
*conditional entropy*\(H(Y|X)\). Also called*noise entropy* - Entropy of Joint probability distribution

\begin{align} H(X,Y) & = \mathbf E \left[ \log {1 \over p(x,y)} \right] \end{align}

- If \(X\) and \(Y\) are independent, then the entropy of the joint distribution \(p(X,Y)\) is equal to the summed entropies of its marginal distributions

\begin{align} H(X,Y) & = H(X) + H(Y) \end{align}

- The
*mutual information*\(I(X,Y)\) can be represented as

\begin{align} H(X,Y) & = H(X) + H(Y) - I(X,Y) \end{align}

- Entropy of Joint Distribution

\begin{align} H(X,Y) &= \sum_{i=1}^N \sum_{j=1}^M \log {1 \over p(x_i,y_j)} \end{align}

- Conditional Entropy \(H(Y|X)\)

\begin{align}
I(X,Y) &= H(Y) - H(Y|X) \\

I(X,Y) &= H(X) - H(X|Y) \\

H(Y|X) &= H(\eta) \\

H(X,Y) &= H(Y) + H(X|Y) \\

H(X,Y) &= H(X) + H(Y|X) \\

\end{align}

- Transmission Efficiency

\begin{align} TE &= {I(X,Y) \over H(Y)} \end{align}

- One can introduce some redundancy in the message using error correcting codes so as to reduce the effect of channel noise. The redundancy helps in detection and correction of incorrect output messages
- Parity bit for row and column matrix of data is a simple yet effective way of detecting and correcting errors at the receiver end
- There is a tradeoff between the robustness of the encoded message and the number of extra binary digits required to make the message robust
- Capacity of a Noisy Channel

\begin{align}
C &= \max_{p(X)} I(X,Y) , \text{bits} \\

C &= \max_{p(X)} H(X) - H(X|Y) , \text{bits}
\end{align}

## Shannon’s Noisy Channel Coding Theorem

Let a discrete channel have the capacity \(C\) and a discrete source the entropy per second \(H\). If \(H \leq C \) there exists a coding system such that the output of the source can be transmitted over the channel with an arbitrary small frequency of errors. If \(H \geq C \), it is possible to encode the source so that the equivocation is less than \( H-C+\epsilon \) where \( \epsilon \) is arbitrarily small. There is no method of encoding which gives an equivocation less than \(H-C\)

- Fantastic example of a noisy type writer that can be used to communicate messages with zero error rate
- When averaged over all possible codebooks, if the average error rate is \( \epsilon \), then three must exist at least one codebook which produces an error as small as \(\epsilon\)

# Entropy of Continuous Variables

- For a continuous variable, as the discretization goes down, the entropy diverges to infinity
- The entropy of a continuous variable is infinite because it includes a constant term which is infinite. If we ignore the term, then we obtain the differential entropy

\begin{align} H_{dif}(X) &= \int^\infty_{-\infty} p(x) \log {1 \over p(x)} dx \end{align}

- If a variable \(X\) has entropy \(H\), and if each value of the probability distribution \(p(X)\) is estimated as a relative frequency, then the resultant estimated entropy \(H_{MLE}\) tends to be smaller than \(H\)
- Multiplying a continuous variable \(X\) by a constant \(c\) changes the range of values, which changes the entropy of \(X\) by an amount \(\log |c|\)
- Adding a constant \(c\) to a continuous variable \(X\) has no effect on its entropy
- Given an continuous variable \(X\) with a fixed range, the distribution with maximum entropy is the
*uniform distribution* - Given a continuous positive variable \(X\) which has a mean \(\mu \), but is otherwise unconstrained, the distribution with maximum entropy is the
*exponential distribution* - Given a continuous positive variable \(X\) which has a variance \(\sigma \), but is otherwise unconstrained, the distribution with maximum entropy is the
*Gaussian distribution* - Noise limits the amount of information conveyed by a continuous variable, and, to all intents and purposes transforms it into a discrete variable with \(m\) discriminable values, where \(m\) decreases as noise increases
- An initial uncertainty of 100% is reduced to 71% after receiving half a bit of information and to \(2^{-H}\) after receiving \(H\) bits.

# Mutual Information: Continuous

- Mutual information is sensitive to the strength of association between two variables, but it is essentially “blind” to the nature of that relationship. If the two variables are related to each other, then it does not matter how complex or subtle their relationships is.
- Conditional Entropy \(H(Y|X\)): A slice through \(p(X,Y)\) at \(X=x_1 \) defines a one-dimensional distribution \(p(Y|x_1) \) with entropy \(H(Y|x_1) \). The conditional entropy \(H(Y|X)\) is the average entropy of all such slices through \(p(X,Y)\). where this average is taken over all values of \(X\)
- Conditional Entropy \(H(X|Y\)): A slice through \(p(X,Y)\) at \(Y=y_1 \) defines a one-dimensional distribution \(p(X|y_1) \) with entropy \(H(X|y_1) \). The conditional entropy \(H(X|Y)\) is the average entropy of all such slices through \(p(X,Y)\). where this average is taken over all values of \(X\)
- Conditional Entropy

\begin{align} H(Y|X) &= \mathbf E \left [ \log {1 \over p(y|x) }\right] \end{align}

- Relation between Mutual Information and Conditional Entropy

\begin{align}
I(X,Y) &= H(Y) - H(Y|X) \\

I(X,Y) &= H(X) - H(X|Y)
\end{align}

- The mutual information is that part of the joint entropy \(H(X,Y)\) that is left over once we have removed the part \(H(X|Y) + H(Y|X) \) due to noise
- Kullback-Leibler Divergence

\begin{align} D_{KL}(p(X)||q(X)) &= \int_x p(x) \log {p(x) \over q(x)} \end{align}

- Mutual information between \(X\) and \(Y\) is the KL-divergence between the joint distribution \(p(X,Y)\) and the joint distribution \(p(X)p(Y)\) obtained by the outer product of the marginal distributions
- Understanding from Bayes perspective: Mutual information is the expected KL-divergence between the posterior and the prior.

\begin{align} I(X,Y) &= \mathbf E_y [ D_{KL}(p(X|y)||p(X))] \end{align}

# Channel Capacity:Continuous

- Key question under consideration is, What input distribution \(p(X)\) maximises the rate at which the information can be communicated through a noisy channel
- Input distribution which maximises information transmission depends on the channel
- Channel has fixed variance and infinite range
- Channel has a finite range

- Shannon’s equation(\(P\) is the signal power and \(N\) is the noise power)

\begin{align} C &= {1 \over 2 } \log \left (1 + {P \over N} \right) \end{align}

- Given that \( Y = X + \eta \), where \(X = g(S) \), and that the noise is negligible, maximizing the mutual information between the bounded input \( X \) and the output \( Y \) amounts to finding a form for \(g\) which makes \(p(X)\) uniform

# Thermodynamic Entropy and Information

- Entropy has been defined at least twice in the history of science. First, it was defined in physics as thermodynamic entropy by Boltzmann and Gibbs in 1870’s. Subsequently it was defined in mathematics by Shannon
- Shannon’s information entropy is a measure of information
- Thermodynamic entropy is a measure of number of states a physical system can adopt
- Why are two measures required ? Are they related - Yes
- Relationship matters because thermodynamic entropy can be used to measure the energy cost of Shannon’s information entropy
- A single macro state can have several micro-states
- Each macro-state is consistent with many equally probably micro-states. So, macro-states with many micro-states are more probably than macro-states with few micro-states
- If you throw a pair of dice, then total score has several possible values(2-12).For each of this value, there could several combinations of individual dice outcomes. Think of each score as macro-state and combinations as micro-states
**Landauer Limit**: Lower limit to the amount of energy required to acquire or transmit one bit of information- No matter how efficient any physical device is, it can acquire one bit of information only if it expends at least 0.693kT joules of energy
- Second law of thermodynamics - Entropy of an isolated system increases until it reaches a maximum value
- Maxwell’s demon cannot exist because of Landauer limit

# Information as Nature’s Currency

The last chapter talks about several interesting applications of Information theory. I found the chapter fascinating and will re-read at a later point in time to appreciate it better.

# Takeaway

Thanks to this book, I have understood some of the basic principles of Information theory. All I can say is that I have learnt the basic vocab of the subject. It is like learning syntax of a programming language. I need to think through this syntax and apply this syntax to a problem. The problem that I would be solving is the non parametric estimation of volatility surface for stocks listed on SGX.

My next step in the journey would be to read and work through some of the following books

- The Information a history, a theory, a flood by James Gleick
- A Mind at Play How Claude Shannon Invented the Information Age by Jimmy Soni, Rob Goodman
- Information theory, inference, and learning algorithms by David J. C. MacKay
- Elements of Information Theory by Thomas M. Cover, Joy A. Thomas
- Decoding Universe
- An Introduction to Information Theory Symbols, Signals and Noise by John R. Pierce
- The Logician and the Engineer How George Boole and Claude Shannon Created the Information Age by Paul J. Nahin
- Information Theory by Robert B. Ash (z-lib.org)
- Information and Coding Theory by Gareth A. Jones, J.Mary Jones

I am grateful to the author, James V Stone, for making the subject accessible to a total novice like me. Hopefully I will work through some of the principles, internalize them and create a useful application.