Probability Essentials : Summary

I was going through this book after a gap of 3 years, reason being, I had conveniently forgotten some important stuff relating to martingale theory. Now that my work demanded a thorough application of this theory, I had to go over it again. In order to refresh my memory, I thought I should go over this book by Jacod & Protter where I had read about Martingales for the first time.

However,after some thought,I went over the book from scratch.Why ? A book is like a good old friend you meet, who always tells you something that you can connect . So, instead of merely reading up martingale theory, I re-read the entire book. After 2 years of being away from this book, the re-read was worth the effort. This book cannot be the first book for someone looking to understand probability. It is too precise. However for some one who is already exposed to modern probability concepts, this book would appear awesome because such a person would appreciate precision than redundant ranting about stuff.

Ok, the purpose of this post is to summarize the chapters in plain English. Writing about math without equations sometimes skirts the danger of appearing ugly. Anyways let me give it a try.

The book starts off with defining the triple (State space, Events and Probability). State space is basically all the outcomes of an experiment. Event is a property that can observed after the experiment is done (a subset of state space) and Probability is a mapping from the family of all events to a number between 0 and 1. The three main ingredients of probability theory are clearly defined. Subsequently, the book introduces random variable and cautions the reader not to confuse it with a variable in the analytical sense. The random variable is in fact a function of the outcome of the experiment and hence the probability associated with X are termed as law of the variable X to distinguish it from the original probability measure on the entire state space.

Two axioms are given in Chapter 2 from which almost the entire theory of probability flows. These axioms are 1) probability of state space is 1, and second axiom relates to the concept of countable additivity which is different from finite additivity. These axioms are the foundations of the entire probability theory. Basic difference between the two types of additivity are given with the help of few examples. In the subsequent chapter (Chapter 3) , a basic definition of Conditional probability is given and the linkage between conditional probability and independence of events is explained using Partition equation and Bayes’ theorem.

Chapter 4 gives the initial flavour of the method to construct probabilities. By focusing on the state spaces which are countable, authors tie the frequentist intuition that we all have AND the definition of probability measure on a countable space. Well , if it is a countable space, we all know that probabilities of an event A is nothing but the proportion of number of times the event A can occur in the entire state space. The same intuition is shown using the two axioms of probability stated earlier in the book. For a finite or countable space, one usually constructs a probability measure by defining probability for the atoms in the finite space. Once you define probability for atoms of the sample space, you can easily compute probabilities of the events in the sigma algebra, which itself is finite. So, finite case is a trivial case where all the intuitive knowledge about classical probability comes true.
Chapter 5 of the book is about formalizing the definition of a random variable on a countable space.

Chapter 6 moves on to constructing a probability measure on a measurable space where there is no longer the restriction that state space is countable. Now this throws up a very large state space and hence a convenient smaller collection of sets (sigma-algebra) is used for defining the measure. I did not understand this aspect for quite some time since the first time I read this book years ago. However after getting used to these terms and seeing them in various theorems, I understand this stuff better. I will attempt to verbalize my understanding. If you want entire subsets of R, you have to sacrifice countable additivity feature as it breaks down, if you are considering all subsets of R. However if you restrict your universe to almost all subsets of R, meaning measurable sets (Lebesgue measurable space) , then countable additivity hold good for such sets and you can gladly compute all the events in the sigma algebra of the measurable sets. That’s the key trade off which is not usually mentioned properly in most of the books / maybe it is mentioned and I never understood it in the first attempt. Now how does one go about constructing a measure for countable sets ? Extension theorem is the tool/ technique. I think one must spend (whatever time it takes) to understand Extension theorem properly as this serves as a foundation for probability theory.

One usually ends up defining a measure on a semi-algebra and then extends it to the sigma algebra after imposing some conditions on the sets. In this book, though the author starts off with an algebra and then extends it to a sigma algebra. This chapter is very hard to understand as the author leaves the derivation of the key idea , the existence of a measure , and asks the reader to refer other books. So, what’s the point in knowing that this measure somehow exists and you end up reading uniqueness of this measure ? Ideally one should read this construction from a better book rather than trying to understand the terse statements from this chapter. Take away from this chapter : Skip this chapter and read the existence and uniqueness from a better book –:)

Chapter 7 is a special case of Chapter 6 where the state space is R. One comes across the distribution function induced by the probability P on the space( R, B). The distribution function F characterizes the probability. Basic properties of distribution function are given and a laundry list of the most popular continuous distributions is provided.

In a general case where the random variable maps the events on to a space, the task is to construct a random variable in such a way that for every set in the range of the function, there is a pre-image defined in the sigma algebra of the domain space. Hence the concept of measurable function becomes important.

Chapter 8 talks about the generic case where measurable spaces ( Ω, ƒ , Ρ) , (R,Β) are defined and the mapping between measurable spaces now becomes a measurable function. Thus a word like random variable , technically speaking is a measurable function which maps two measurable spaces. Why should one take an effort to understand this ? Well , because measures on Ω are difficult to construct mathematically whereas measures on (R,B) can be constructed and worked on. This is the key idea is thinking about distribution of X.
The concepts related to measurable functions are gradually built in this chapter. After giving a basic definition of measurable function, it starts off with an important theorem that can be used to check whether a function is a measurable function or not. The basic condition for a measurable function is that inverse image of every Borel set in Borel Sigma Algebra on R should be present in ƒ , the sigma algebra of the domain space. However to check every Borel set in R is painful and hence the theorem states that if you check for any class of function which generates the Borel Sigma algebra, then it is good enough. Thus if you can check for a class of interval like (-inf, a] then it is good enough for the entire Borel Sigma algebra. The good thing about measurable functions is that a lot of operations involving measurable functions retain the measurability, like sup, inf, lim sup, lim inf, closed under addition, multiplication, minimax operations. More over if a random variable converges pointwise, the function it converges to is also measurable. So, in a sense, the gamut of measurable functions is very large and it is really tough to produce a non measurable function. Infact it look a lot of time since Lebesgue introduced these functions in1901 for someone to come up with non-measurable functions. The key idea though of this chapter , is , one can define a law for Random variable X and thus can create valid probability triple ( R,B,Px) so that one can forget about the original domain space and happily work with this space as it is analytically more attractive.

Chapter 9 talks about the role of integration with respect to the probability measure. Why is integration figuring in probability ? Well, if it is countable state space then expectation of the random variable is in terms of the summation of the events and their respective probabilities. But if it is countably infinite set, then the summation is replaced with integral sign.

Another nifty way to explain the integration connection with probability measure is that , expectation is defined as follows: If you take a collection of simple random variables so that they are less than the random variable that is studied, take the Expectation of this collection and find supremum , you get the expectation of random variable. With out writing the equation and instead explaining in words, the above description sucks. In any case, the point to be understood is this :Any Riemann integral is obtained by Supremum of collection of simple random variables and in the case of bounded functions, Riemann integral and Lebesgue integral converge. That’s the reason for the connection between Integral and Expectation.

This chapter starts off with defining simple random variables and describes various properties. Using Simple random variables, expectation of general random variables are computed. Key theorem such as Monotone Convergence theorem and Dominated Convergence theorem are described. Towards the end of the chapter, Inequalities like Cauchy Schwartz , Chebyshev’s are introduced. I did not really understand the reason for introducing them arbitrarily in this chapter. So, in that sense this part of the chapter is really tangential.

Chapter 10 is about independence of two variables. Well the idea is pretty straightforward – If the joint distribution function splits in to product of marginals then the two variables are independent. However the chapter is something that I have skipped / speed read always. It was a little different this time as I could understand the various arguments made. The chapter starts off with the notion of independence of random variables and ties it to the fact that sub sigma algebras generated by those random variables are independent. This is the correct view to hold , rather than what is usually taught in analytically focused courses on probability. Atleast I remember the definition in this way : P(A^B) = P(A)P(B). This is true but fails to give the right picture. Only when you think in terms of sub sigma algebra’s things become clear and you can extend this definition in to equivalent forms such as : If X and Y are independent, f(X) and g(Y) are independent too for every pair(f,g) of measurable functions. Similarly independence also means E[ f(X) g(Y) ] = E[ f(X) ] E[ g(Y) ]. Product sigma algebras are explored as they become crucial for defining joint distributions and marginal distributions. The chapter ends with Borel-Cantelli theorem. I found the proof of this theorem simpler in Rosenthal’s book . It is nifty simple and clear. Protter’s proof is somewhat round about in nature. Borel Cantelli theorem is striking as it says if {An} sequence of events are independent then P(lim Sup An) is either 0 or 1.It is not ½ or ¼ etc. For a simple application , here is an example – In an infinite coin toss event space, let Hn be the event that nth coin toss is Heads, the theorem says that P(Hn infinitely often ) = 1 , meaning there is a probability 1 that infinite sequence of coin tosses will contain infinitely many heads. With out the knowledge of these theorems and terms, if one were to asked the same question, one can only answer based on intuition and sometimes it can lead to wrong answers!

After a very abstract and conceptually difficult chapter, Sanity is restored for a reader like me, in Chapter 11 :) where things can be understood/ can be related to the real life applications!. For certain measures, we can find density function so that probability and area under curve can be connected. Certain Probability measure determine density up to a set of Lebesgue measure zero. One must note the subtle difference between almost everywhere and almost sure. The former is used for functions while the latter is used for convergence in probability context. The chapter then talks about “law of unconscious statistician”, where expectation of a function of random variable is computed. Most of the practical applications of probability involve an appropriate function of random variables. For example, a payoff of a plain vanilla option is a function of random variable S denoting the price of the underlying security. One needs a method to construct density of a transformed random variable. For a specific transformation of a variable, one can investigate the monotonicity and differentiability of the function to split the domain of the function (so that it is bijective in its intervals) and then apply a few theorems mentioned in this chapter to arrive at the density. One example which is always quoted in this context is that of chi square distribution, i.e. Transforming a standard normal to a chi square variable). Chapter 12 is the extension of the previous chapter to n dimensional space. The key idea in this chapter is the computation of the density of transformed multivariate random variable. There are at least three different methods illustrated using examples but the easiest one is by using Jacobi’s Transformational formula.

One of the uses of transformations like Fourier and Laplace is to formulate the solution of a problem in the transformed space and map it back it to the original space. In the context of probability measure, Fourier transform of the measure has a name,”characteristic function”. These Characteristic functions are dealt in Chapter 13 and Chapter 14 . Typically these are extremely useful in computing higher order moments / testing the independence of random variables etc. A laundry list of characteristic functions for common random variables is stated. Also uniqueness of the Fourier Transform of the probability measure is proved. This means that if two measures have the same characteristic function / Fourier transforms then the two measures are identical. This is useful thing to keep in mind.

In stats, most of what is done involves linear transformation of random variables. Thus sum of independent random variables as an idea needs to be studied as there are tons of applications in real life. Take for example as simple as a sample average. It involves the sum of the random variables and one need tools to compute the probabilities and densities of such sums. Chapter 15 talks about the convolution product of the probability measures of the individual random variables and provides a methodology to compute the distribution measure of the sum of random variables. Usage of Characteristic function is made in all the examples to easily compute the distribution of sum of iids. Chapter 16 is a very important one as it deals with Gaussian variables in multi dimensional space. It is necessary to first analytically identify whether a set of variables is indeed from a multivariate normal. It is the form of characteristic function that plays an important role. For any combination of variables to be called multivariate normal, a simple litmus test is that any linear combination of the variables involved should be a normal distribution. As an example, a linear regression model is taken and the distribution of its estimates are computed. The authors avoid using matrix algebra for deriving the distributions and thus make the computations very ink-intensive :). The chapter ends with mentioning 6 important properties of multivariate normal distribution which make a multivariate normal distribution analytically attractive. One casual remark we often hear “ Normal distribution is everywhere in the nature “ . If one thinks about it, Normal distributions do not really exist in nature. It arises via a limiting procedure (Central Limit Theorem) and thus is an approximation of reality and often it is an excellent approximation. The irony is that normal distribution is a great approximation to the True distribution of most of the natural phenomena (which itself is not precisely known!!!).

Whenever we talk about limiting procedures, approximations, we need tools to compute and think about. Most of the classical stats is developed using asymptotics, where limiting behaviour of random variables are invoked to justify hypothesis tests and inferences of parameters in a model. Hence the study of convergence of random variables becomes important, which is dealt in Chapter 17. One usually comes across point wise convergence in calculus courses but such point wise convergence is too harsh / precise to be applicable to the probabilistic world. The chapter discusses 3 other types of convergence, which are, almost sure convergence, convergence in pth mean, convergence in probability. Here is a nice visual to summarize the relationship between various modes of convergence is

clip_image002

Chapter 18 introduces the most important type of convergence , in the context of Statistics, the weak convergence or convergence in distribution. Most of the stuff you come across in stats use convergence in distribution to make statements. With this type of convergence you can make statements relating to distribution of Xn and X without worrying about whether there is a relation between Xn and X. There is no mention about Xn and X, meaning they can exist in different sigma algebra, can have different laws etc. It doesn’t matter and the weak form can be applied away to glory. This is the strong point :) of the weak form of convergence. Firstly, how does one check whether Xn converges in distribution to X ? lim E[ f(Xn) ] should be equal to E[ f(x) ]. This condition must be checked for all f continuous and bounded functions. There is also a mention of a theorem which reduces the test cases. Instead of testing all the continuous functions, one can instead test bounded Lipschitz continuous functions. Slutsky’s theorem is derived which is a very useful theorem in statistics. It talks about convergence of a random variable based on the distance metric between two random variables.

Chapter 19 makes the relationship between weak convergence and characteristic functions. This relationship forms the key to limit theorems. When we say that, irrespective of the underlying distribution, the centered mean of the variables divided by the deviation converges to standard normal, the proof depends on this critical relationship between weak convergence and characteristic functions. Thus the three chapters 17, 18, 19 prepare the ground for launching in to developing limit theorems.

Chapter 20 talks about the strong law of large numbers : If there are n independent and identically distributed variables, then the average of the sum of the variables for large n converges ( almost surely & converges in L2 ) to the population mean . Weak law is the same as above but the convergence is in probability sense. The proof for these laws is elegantly shown using various modes of convergence discussed in the previous chapters. Finally an example of strong law is shown in the Monte carlo world where integration of a complex function can be computed using a simulation of uniforms.

Chapter 21 is all about, the most widely used theorem in stats, the Central Limit Theorem, which essentially says that, irrespective of the underlying distribution of random variables, a particular transformation of “sum of random variables “converges in distribution to standard normal. The proof of the theorem uses the relationship between weak form of convergence and characteristic function. The chapter also provides CLT in multidimensional case. There is also some stuff where you get to know the rate of convergence of strong law vis-a-vis CLT.

Chapter 22 might sound rather abstract. What’s the point in understanding that there Hilbert Spaces, What’s the connection anyway between Hilbert Spaces and Probability. Such questions are only answered in Chapter 23. So, a reader needs to understand these concepts and have a vague notion that they will be somewhere used in the book. Frankly when I had read this chapter for the first time, I was swamped by the sheer terminology like – complete spaces, inner product spaces, normed vector space, metric space, orthogonal operator etc. For any reader who is in a similar situation, my suggestion would be put this book aside for some time and read up on metric spaces , vector spaces and inner product spaces thoroughly. Understand their relevance, historical significance to general mathematics. Once you are at least familiar with some basic stuff about functional analysis, in the sense that, you must be able to cogently explain, all the following questions:

What are metric spaces?
How is a metric defined?
Can the same metric space have two different metrics?
What do you mean by metric space being complete?
What is vector space? What is normed vector space?
Can a metric always induce a norm?
Can a norm always induce a metric?
What is inner product space?
Is inner product space subset of vector space? Is it a subset of metric space?
Can inner product induce a norm on the vector space?
Can inner product induce a metric?
What is complete normed vector space?
What is complete inner product space?
How to check whether a space is complete, be it metric/ normed vector / inner product space?

Unless you convince yourself that you know the answers to the above questions, it is better to keep this book aside and work through metric spaces. Once you are comfortable with the above questions, this chapter can be read easily.

Chapter 23 is one of THE MOST important chapters of the book, from a math fin perspective. In almost all cases, be it option pricing / hedging / econometrics based forecasting, one always deals with conditional expectation model. Regression, which is considered as workhorse of statisticians is a conditional expectation model E(Y/X). Undergrad intro courses in probability usually introduce conditional probability and leave it at that. Or may be they scratch the surface of conditional probability models by spelling out a formula for E(X/B) where B is some event, E(X/Y) where Y is a discrete random variable. In all such cases, the formula based approach hides the complexities behind computing conditional expectation. For a case where E(X/Y) where both X and Y are random variables, how does one go about computing E(X/Y)? You need exposure towards Hilbert Spaces to understand Conditional Expectation. This is where all the slog one goes through in understanding sigma algebras, Borel functions, etc pay off. There are two things that one realized when computing E(X/Y). Well three things actually. First is that E(X/Y) is itself a random variable. Secondly, the sigma algebra of the inverse images of this random variable is a subset of sigma algebra of the inverse images of Y. One must not take this statement at face value. Just cook up some example and check it out for yourself. A simple example of a dice thrown twice and computing E(X/Y) where Y is the first throw and X being the sum on the two dice will convince you that it is indeed that Sigma algebra of E(X/Y) is a subset of Sigma algebra of Y. Third aspect to be kept in mind is that on any borel set belonging to Sigma algebra of Y, the expectations of E(X/Y) matches with E(X). Again this is abstract and makes sense once you work with an example and see for yourself that this is indeed the case. So, basically there are two properties which are damn important while thinking about Conditional probabilities.

clip_image004

Conditions (a) and (b) impose conflicting restrictions on E(X\Y). On the one hand, E(X\Y) needs to have a rich enough structure to satisfy (b). On the other hand, it cannot be too rich or the sigma-field clip_image006 . Meeting the two conditions simultaneously calls for a compromise.

So, the conditional expectation is calculated implicitly in such a way that the above two conditions are satisfied. I did not understand this aspect of conditional probability for a very looooooong time. However one you understand that E(X/Y) can be computed implicitly, you begin to appreciate Radon Nikodym theorem. So, the takeaway from this chapter is that you will start to appreciate that E(X/Y) should be looked at from E(X/ clip_image008 perspective. Thus you no longer care of Y in the sense of what values it takes, but all you are interested is in Sigma algebra of Y. This is the key to understanding Conditional Expectation.

Ok, I did not mention here the relevance of Hilbert Spaces. Here is the connection: If one looks at complete inner product spaces, one can use the orthogonality concept, E(X/Y) becomes the best estimate of X given information of Y. This is essentially projecting X in the sub algebra of Y. For projection to make sense, the concept of inner product is used and the first construction of Conditional Expectation is done on Hilbert Spaces. However L2 spaces are only a subset of L1 spaces which most of us would be interested. The extension of Conditional expectation from L2 spaces to L1 spaces is done using the standard procedure of 1) showing that it works for indicator functions 2) it works for simple functions 3) it works for non negative random variables 4) it works for general random variables. Another key aspect to understand about conditional expectation is that it is only unique in “almost sure “sense. Meaning there could be more than one conditional expectation variables that meet the criterion, where the variables only differ on 0 measure sets.

Chapter 24 deals with the properties of a sequence of random variables (Xn) instead of Sequence of iids, which is usually the norm. A specific type of sequence of random variables that is relevant to math fin area is “Martingale”. A Martingale is a sequence of random variables with the following properties

Each element of the sequence is in L1
Xn is Fn measurable
The most important being E[ Xn/ Fn ] = E[ Xm ]

Several properties of Martingales are very appealing from fin modeling perspective. Martingales have constant Expectation and hence attacking a problem like option valuation from a martingale perspective always makes one hopeful of ending up with a martingale.

The other class of variables that are discussed in the chapter are Stopping times. Bounded Stopping times form Martingales. These are extremely useful in American option pricing. Doob’s Optional Sampling theorem is also discussed in this chapter.

Chapter 25 explains super martingales and sub martingale, that form a class of useful mathematical objects in financial modeling. Most importantly it talks about decomposing a super martingale or a sub martingale in to martingale and an increasing/decreasing process. Chapter 26 and Chapter 27 explore Martingale Inequalities and Martingale Convergence Theorem. Chapter 28 is about Radon-Nikodym theorem. This theorem is used in a ton of places in math-fin area. However this chapter uses Martingales to prove the theorem. Ideally it would have been better if the theorem was proved using measure theory concepts. So, in that sense the organization of the last section of the book was little challenging for me. The book introduces martingales and then introduces measure change. As stated earlier, there is no neat closed formula for E(X/Y) where X and Y are both random variables. Radon- Nikodym derivative provides a neat way to show the existence of such a variable. One can always show that such a variable exists for Hilbert Spaces but for L1 spaces, one has to prove it using indicator functions, simple functions, non negative random variables and general random variables.

Takeaway:

I think this book is too concise for some one looking to understand concepts. Existence theorems are conveniently ignored for some important mathematical objects. However this book is an awesome reference to most of the theorems of modern probability.