## Lecture 38—Wednesday, April 18, 2007

### What was covered?

• The Gibbs sampler
• Basic features of the BUGS language

### Bayesian statistics—why should a frequentist care?

• Statistical analysis from a Bayesian perspective is carried out using a set of techniques classified under the rubric Markov Chain Monte Carlo (MCMC) that make it possible to sample from the posterior distribution. These methods don’t return a formula for the posterior distribution, just a set of values. With a large enough sample from this set we can characterize most features of interest about the distribution—quantiles, means, medians, etc.—just like we did with bootstrap approximations to sampling distributions.
• The term Monte Carlo refers to the casino. Technically a Monte Carlo method is a stochastic simulation in which sampling is done from a probability distribution. The parametric bootstrap is an example of a Monte Carlo method.
• Markov chain Monte Carlo refers to sampling from a particular stochastic process called a Markov chain. A stochastic process is an indexed set of random variables (the index is typically time). A Markov chain is a stochastic process in which the next step in the process depends only on the current state.
• While there are many good reasons for taking a Bayesian perspective, I address this section to someone who is committed to doing statistics from a frequentist perspective. So the question becomes why should a frequentist care about the Bayesian approach and grapple with the issue of choosing prior probabilities? I give two reasons.
1. If you have lots of data, then the choice of prior won’t matter much. So parameter estimates estimated from the posterior distribution will resemble parameter estimates for the likelihood. Thus one can get MLEs via MCMC.
• For instance, in a simple estimation problem with a normal likelihood and a normal prior it turns out that the posterior mean is just a weighted sum of the sample mean and prior mean, posterior mean = w1 sample mean + w2 prior mean, where the weights are measures of relative precision. The weights are of the form  and  where n is the sample size,  and  are the variances of the sample and prior respectively, and k is a normalizing constant. Clearly from the weights if the sample size n is large the sample mean will dominate this sum.
2. With only a moderate amount of data we can still carry out an "objective" Bayesian analysis. In this approach we choose a noninformative (vague, flat, diffuse) prior so that it has minimal impact on determining the posterior distribution. That this is possible is also clear from the weighted mean example above. If the variance of the normal prior is large (yielding a noninformative prior) the corresponding weight will be small.
• Upstart: MCMC can be used to obtain parameter estimates for distributions when it is far too complicated to do so using ordinary frequentist optimization methods. The typical situations where the Bayesian perspective may be necessary are the following:
1. Models with many random effects
2. Models using exotic probability distributions. Keep in mind that most probability models become exotic when coupled with random effects. For instance there are limited frequentist tools available for fitting negative binomial regressions with random effects.
3. There are many parameters to estimate
• These three conditions are all par for the course in environmental science. The basic problem with Markov chain Monte Carlo is that there is a temptation to toss parsimony out the window and fit extremely complicated models. This temptation should be resisted!

### The Gibbs sampler

• The Gibbs sampler is a Markov chain Monte Carlo method implemented in the software BUGS.
• Suppose  depends on k parameters . Then the Gibbs sampler requires k conditional probability statements of the form

These are readily obtainable in hierarchical models when viewed from a Bayesian perspective. The basic regression model is a conditional statement, random effects are conditional on the values of parameters of a normal distribution, and all the remaining parameters are conditional on their priors.

• The algorithm is given starting values  and then updates these values by drawing new values from the following probability distributions.

• The set of values obtained for each parameter form a Markov chain—a stochastic process (a sequence of random quantities indexed by time) with the property that at any particular point in the sequence (also called a chain) the current value only depends on the last value.
• It’s also an example of an ergodic Markov chain.  (Ergodic chains are irreducible—every point in subspace can be reached from every other, recurrent—the expected number of returns to any set of points is infinite, and aperiodic). Ergodic Markov chains are special in that they possess a limiting distribution.

,

i.e., the probability of going from  to  in n steps in the limit depends only on the final state. When a Gibbs sampler is formulated from the likelihood and priors from a Bayesian model, the limiting probability  is in fact the desired posterior probability.

• So we set up the Gibbs sampler and we let it run for a while (the burn-in period) to give the chain time to "reach" its limiting distribution. Once there the chain visits values roughly in proportion to their probabilities. Using a large enough sample we can obtain estimates of these probabilities and use them to estimate other parameters of interest.
• Bayesian problems are well-suited for this because they are already based on a series of conditional probability statements. The good news is that we don’t have to set up the equations for the Gibbs sampler ourselves, BUGS does it for us.

### The BUGS program

• BUGS is available only for Windows and Linux operating systems. The web site for downloading the software is http://www.mrc-bsu.cam.ac.uk/bugs/winbugs/contents.shtml
• For Macintosh users there are a couple of options.
1. For non-Intel Macs use a PC emulation program, for instance Virtual PC from Microsoft or the cheaper alternative, iEmulator (\$25.00 from www.iemulator.com).
2. For Intel Macs there is Boot Camp from Apple or Virtual Machine (with Virtualization) from Parallels. These two options are compared in the following document.

http://www.notebookreview.com/default.asp?newsID=2990&article=Apple+Bootcamp+versus+Parallels

• BUGS code
1. BUGS code looks just like R code but with important differences.
2. BUGS code is used to specify the model. The lines of code are not instructions to be executed. In fact BUGS code is not executed at all. Although lines are entered sequentially, they’re not interpreted sequentially. BUGS parses the entire model and then runs a process.
3. BUGS makes use of for loops for model specification but this is just a matter of convenience.
4. BUGS uses R probability functions but with a twist. For instance the dnorm function is parameterized in terms of precision (reciprocal variance) rather than the standard deviation.
• Typical BUGS objects
1. Modeled data. In code these are defined with a ~. For example y ~ followed by a probability distribution. The variable y here is the response in our regression model. Modeled data are objects that are assigned probability distributions.
2. Unmodeled data are objects that are not assigned probability distributions. Examples include predictors, constants, and index variables.
3. Modeled parameters: these are given informative “priors” that themselves depend on parameters called hyperparameters. These are what a frequentist would call random effects.
4. Unmodeled parameters: these are given noninformative priors. [So in truth all parameters are modeled]. All the rest of the frequentist parameters fall under this category.
5. Derived quantities: these objects are typically defined with the assignment arrow, <-
6. Looping indexes: i, j, etc.
• The issue of priors
• How noninformative priors are set up can have an effect on estimation. BUGS can become computationally unstable when parameters are defined to have too wide a range.
• With log-transformed model there is generally no problem because the normal range is already reduced. For all other situations the variables need to be rescaled so that parameters have a limited range. A good rule is to rescale things so that parameter values lie in the interval –100 to 100.
• Basic strategies.
1. For nonnegative quantities (standard deviations) or restricted range quantities(correlations) use a uniform prior.
2. For all other quantities use a normal prior.
• When there are more than two random effects that are correlated formulating priors becomes more complex and it is necessary to use a multivariate distribution, such as a Wishart distribution, that guarantees the matrix that results will be a well-defined covariance matrix.
• The most important modeling strategy when using BUGS is to start simple and complexify gradually each time checking that the current model works and makes sense before moving to a more complex model