Metropolis-Hastings Algorithm

Nicholas Constantine Metropolis

This is a great write up on the nuts and bolts of Metropolis Hastings Algorithm. I like such papers that summarize everything about an algorithm with a sound balance of rigor and simplicity. Nowadays even for a simple statistical analysis, one tends to specifying a BUGS model and run MCMC. With BUGS software, Bayes analysis has become accessible to a whole lot of data analysts. The heart of BUGS is the Gibbs sampling algorithm, which is a special case of Metropolis Hastings Algorithm. A crystal clear paper on the same has been written by George Casella and Edward George. The authors of this paper, Siddhartha Chib and Edward Greenberg say that one of their motives in writing down this article was to publish an Casella+George type paper on MH algorithm that can be read and understood by everyone. In the Gibbs sampling algorithm, one can dumb down many math details and provide a good enough overview. Not so in the case of MH algorithm. Some understanding of math concepts is inevitable. Having said that, the authors do not make “continuous Markov Chains “ understanding, as a prerequisite. In that sense, the article almost starts from scratch and gives a thorough explanation of various flavors of MH Algorithm.

In this post, I will attempt to summarize the paper

Introduction :
Metropolis Hastings Algorithm was developed by Metropolis,Rosenbluth, Teller(1953). It was subsequently generalized by Hastings(1970).MH algo has been used extensively in physics, but it was little known to statisticians until recently. Papers by Muller and Tierney in 1993-1994 were instrumental in exposing the value of this algorithm and stimulating interest among statisticians in its use. This paper gives a tutorial introduction to the algorithm deriving the algo from the first principles.

Acceptance-Rejection Sampling :

The classic simulation techniques generate non-Markov samples. An important method in this class is the A-R method. The basic funda behind this algo is to figure out a blanketing function that dominates the target density function through out is support. The blanketing function is chosen in such a way that simulation from it is easy. The crux of the method is to accept or reject a sample from the blanketing density based on the relative density levels of target and blanketing density. What is the relevance of discussing an IID random variate generation method ? MH algo looks similar to A-R algo. Hence some intuition on how A-R algo works is useful in understanding MH algo.

Markov chain Monte Carlo Simulation:

One of the defining characteristics of continuous time Markov chain is the transition kernel, a conditional density function based on which transitions happen to various states. An important aspect of continuous Markov chains is to determine conditions under which the transition kernel converges to an invariant distribution. So, given any transition kernel, you start with any state and move to various states, the question is, “After a lot of transitions, does the chain has a distribution behind it, based on which it moves to various states?”. MCMC methods turn the theory around. The invariant density and is known and the hunt if for the transition kernel. This hunt is made easy by the reversibility principle that is essentially a condition involving transition densities of moving amongst two states.

If one can find a transition kernel whose transition density obeys reversibility then we can be certain of obtaining the target distribution.

The Metropolis Hastings algorithm:
MH algo precisely does that and gives a transition density function, that satisfied the reversibility condition. The various steps of the algo are derived and at each step, the intuition is given by referencing the step to a pertinent step of the Accept-Reject algo. As in any MCMC method, the draws are regarded as a sample from the target density only after the chain has passed the transient stage and the effect of the fixed starting value has become so small that it can be ignored. The convergence of the invariant distribution occurs under the conditions of irreducibility and aperiodicity. The rate of convergence of MCMC to a target distribution is an ongoing research topic

Implementation Issues : Choice of q(x,y):
Implementing MH algorithm requires selecting a suitable candidate generating density. There can be five choices for the candidate density functions

Random walk MH
Independence chain MH
Exploiting the form of candidate density function form
Pseudo-dominating density
Autoregressive chains

The scale or spread of the candidate generating density is a critical question. The spread of the candidate generating density affects the behavior of the chain in at least two dimensions : one is the acceptance rate and the other is the region of the sample space that is covered by the chain. There are two applications of MH algo that are mentioned in the paper, one involving the Accept Reject method and other using block-at-a-time scans. In fact Gibbs sampling algorithm is a special case of Block-at-a-time algorithms. The paper ends with few simulation that give further insight in to the methods that have been mentioned in the paper.

I think this paper needs to be read at least a couple of times to grasp the multitude of paths that have been explored by various scientists and practitioners. One thing is for certain, this paper will be my goto paper to refresh concepts relating to MH algo.