Lecture 14—Wednesday, October 8, 2008
What was covered?
Terminology Defined
Predictive simulation
- Predictive simulation is an approach to goodness of fit that can be used to assess the fit of any regression model, no matter how complicated. In predictive simulation the regression model is used to repeatedly generate new data. Because predictive simulation uses a parametric model to generate new data, it is also an example of a parametric bootstrap.
- If the model does fit the data then you would expect that data generated from that model will resemble the data that were actually collected. To assess fit we look to see if the observed data set appears to be a typical member of the simulation set. The easy part is generating the data using the model. The hard part is figuring out what to look at when comparing the observed data to the simulated data.
- There are two steps to the process of generating the simulated data sets.
- Incorporating the inferential uncertainty of the model into the model predictions.
- Adding predictive uncertainty to the model predictions.
Inferential uncertainty
- Inferential uncertainty addresses the the fact that our parameter estimates are only estimates and thus are expected to vary with repeated sampling. This variability is captured in their standard errors. Furthermore the individual parameters estimated from a single data set are not independent and possess a covariance.
- Consider the negative binomial regression model that was fit in lab 5.


- The parameters here are
. The variance-covariance matrix of their estimates takes the following form.

- In R covariance matrices can be extracted from fitted model objects with the vcov function. The glm.nb function of R does not estimate all three parameters of the negative binomial model simultaneously. Instead it alternates between estimating the regression coefficients β0 and β1 and estimating the dispersion parameter θ. As a result the covariance matrix of
and
is obtained separately from the variance of
. Accordingly we have to assume that
is independent of the regression parameters.
- Likelihood theory tells us that the maximum likelihood estimates of the parameters have an asymptotic normal distribution. If we let β denote the vector of parameter estimates,
and
, and Σ their covariance matrix, then theory tells us

where the distribution shown is the multivariate normal distribution, the multivariate analog of the ordinary univariate normal distribution.
- Because
is estimated separately from the regression model coefficients it has its own standard error. Using likelihood theory we assume
.
- To introduce inferential uncertainty, we sample from the asymptotic multivariate and univariate normal distributions of β and θ, respectively. In R this is accomplished with the random multivariate normal function mvrnorm from the MASS package and with the rnorm function from base R.
- Suppose n separate samples of β and θ are generated. The negative binomial regression model uses β to predict log μ, where μ is the mean of a negative binomial distribution, according to the following equation.

For a given choice of β this equation can be used to predict the log mean richness of each of the 29 Galapagos islands in the data set. Repeating this process for the n different samples of β yields an n × 29 matrix of log means.
Predictive uncertainty: uncertainty in the individual outcomes
- The last step in predictive simulation is to use the predicted means obtained above along with the samples of the dispersion θ to generate the corresponding species richness values. In R this is done with rnbinom function that generates random values from a negative binomial distribution. Since there are 29 estimated mean values for each simulation, one for each island, and n simulations, we again need an n × 29 matrix to hold the results. Each row of this matrix corresponds to a simulated richness data set.
- There are no guidelines for determining which data features to examine, but obvious choices are those features of the data set that are highly relevant to the research problem of interest.
- We can graph the simulated data sets to determine if graphs of simulated data resemble the same graph of the actual data.
- More typically we can calculate a statistic of interest for each simulated data set, display the distribution of this statistic across all simulated data sets, and ask whether the same statistic when calculated from the actual data looks to be a typical member of this distribution.
Predictive simulation from a Bayesian perspective
- By taking a Bayesian perspective predictive simulation is made easier because WinBUGS does most of the work for us. Inferential uncertainty is accounted for in the posterior distributions that are returned for each parameter. The potential correlations among the parameters are addressed when WinBUGS samples from the posterior distributions at each step of the Gibbs sampler. As a result the frequentist protocol of sampling from the theoretical distributions of the parameters using the mvrnorm and rnorm functions is unnecessary here. The samples we need are automatically returned by WinBUGS.
- In particular, each column of the $sims.matrix component of the object returned by the bugs function is a sample from the posterior distribution of one of the model parameters. Each row of the $sims.matrix component corresponds to a single run of the Gibbs sampler for all the sampled parameters and any correlations that exist between the parameters will manifest themselves in the values reported here.
- Because we can include the island means in the list of parameters we wish returned, we can obtain samples from the posterior distributions of each of the 29 individual means directly. Because this accounts for the inferential uncertainty in the estimated means, there's no need to carry out the step in the frequentist approach of using the estimated regression equation to predict the individual means . The $sims.matrix component of the bugs object that is returned contains the samples we need for each of the 29 means in separate columns.
- So when we use WinBUGS to do predictive simulation all that is left for us to do is carry out the last step: use the rnbinom function to generate the simulated richness values. We need to do this separately for each row of the $sims.matrix component. Each row produces a different simulated data set.
Negative binomial regression in WinBUGS
- As is often the case, the real challenge when using WinBUGS is to formulate the BUGS model in such a way that the Gibbs sampler actually converges to the posterior distributions of the model parameters. Negative binomial regression can be particularly challenging in this regard.
- The built-in negative binomial distribution that is available in the BUGS language represents the mathematician's definition of the negative binomial distribution, the so-called Pascal distribution. This is not the version favored by ecologists, the so-called Polya distribution. The Pascal distribution models the number of trials until the rth success. In the ecological parameterization of the negative binomial distribution this interpretation gets lost because the parameter r becomes the overdispersion parameter. More important than the cosmetic changes that arise with this parameterization is the fact that ecologists require the overdispersion parameter to be continuous, while in the Pascal distribution the r parameter is necessarily discrete. Thus the WinBUGS built-in negative binomial distribution function is of no use to us. (For additional commentary on these issues see lecture 4.)
- Therefore we need to construct the negative binomial distribution in some other way. Recall that one way the negative binomial distribution arises in practice is as a nonhomogeneous Poisson process in which the rate parameter
has a gamma distribution. In a homogeneous Poisson process the Poisson rate constant
is the same for all observational units. In a nonhomogeneous Poisson process the rate constant
is allowed to vary according to some distribution. Given a particular realization from this distribution, say
, the resulting random variable X will have a Poisson distribution with
.
- We can express this formally using the notion of conditional probability as follows.

- The function
that appears in the above integral is called the mixing distribution for the Poisson. The choice of mixing distribution that yields the the ecologist's parameterization of the negative binomial distribution is the gamma distribution with α playing the role of θ.
- The gamma distribution is a two-parameter continuous distribution that takes the following form.

Here α and β are positive parameters (called the shape and scale parameters, respectively) and
is the gamma function. If
then the mean and variance of X are as follows.

- If our goal is to model the mean of the negative binomial distribution in terms of predictors we measure, then the above formulation turns out to be awkward to implement in WinBUGS. This is because the mean of the negative binomial distribution, the parameter that we wish to model in our regression equation, does not appear explicitly as a parameter in either the Poisson or gamma distributions given above.
- It turns out there is a second way to use the gamma distribution as a mixing distribution and still obtain a negative binomial distribution in the end that makes building a regression model much easier. This second way appears to be only a trivial modification of what we've described above but is different enough so that it is not immediately obvious that it will yield the same result. This approach is presented in a standard textbook on negative binomial regression (Hilbe 2007) and typically appears in the literature as the preferred way to implement negative binomial regression from a Bayesian perspective using WinBUGS (Durham et al. 2004).
It requires two modifications to the standard approach of working with a nonhomogeneous Poisson process in which the mixing distribution is a a gamma distribution.
- The first modification is the manner in which the Poisson distribution is formulated. We now assume X is Poisson with "parameter"
, i.e., we write the usual single Poisson parameter λ as a product of two parameters, r and μ. The choice of symbol for the second parameter is propitious because μ turns out to be the usual mean of the negative binomial distribution.
- The second modification is in how the mixing distribution is specified. We assume that the first parameter r in the Poisson "product" rate parameter is distributed as a gamma random variable in which the shape and scale parameters are identical.
- Thus our new assumptions are the following:
where
.
- With these choices it turns out that X has a negative binomial distribution with mean μ and dispersion parameter α.

- So, in this approach we write the mean of the Poisson distribution as a product of two parameters in which the second parameter is a random variable drawn from a gamma distribution with a mean of 1. (Recall that the mean of a gamma distribution is given by the ratio of its two parameters which in this case are same.) It will turn out that the μ appearing in the Poisson "parameter" is in fact the mean of the negative binomial distribution that results. Hence the attraction of this formulation is that it makes modeling the negative binomial mean in a regression setting easy because the mean appears as an explicit parameter in our model.
- Here's the way this model is formulated in WinBUGS.
model{
for(i in 1:n) {
y[i]~dpois(mustar[i])
mustar[i]<-rho[i]*mu[i]
#log link
log(mu[i])<-b0+b1*x[i]
rho[i]~dgamma(alpha,alpha)
}
b0~dnorm(0,.00001)
b1~dnorm(0,.00001)
alpha<-exp(logalpha)
logalpha~dnorm(0,0.0001)
}
- In Monday's class we'll implement both the frequentist and the Bayesian approach to predictive simulation for the Galapagos Island species area curve example at which time we'll flesh out some of the details that are missing in the above discussion.
Generalized linear models (continued)
A family tree of statistical models
- Historically a number of different modeling approaches have arisen in statistics that at least initially seem to be unrelated to each other.
- Over time the various modeling strategies involving a normally distributed response—inear regression, analysis of variance (ANOVA), and analysis of covariance (ANCOVA)—were recognized as being variations on the same theme. To reflect this unifying perspective these models are called general linear models (GLMs).
- The next major step in the synthesis was the recognition that many of the specialized techniques developed for non-normal response variables—logistic regression, Poisson regression, and the like—also possessed a common thread. The common theme found in all these models defines what is now called the generalized linear model and is the modeling perspective we began discussing last time.

Exponential family of distributions
- Crucial to the definition of the generalized linear model is the concept of the exponential family of probability distributions. Formally, a probability distribution is a member of the exponential family if its probability density (mass) function can be written in the following form.

- In this expression θ is called the canonical parameter. It's also referred to as the location parameter of the distribution.
- The parameter φ is referred to as the dispersion parameter. It's also called a scale parameter.
- a, b, and c are functions whose form will vary between different members of the exponential family. There is no constraint on what form these functions can take except for which arguments they are allowed to take.
- Observe that b is only a function of θ. In particular it is not a function of y.
- a is a function of φ only.
- c is a function of y and φ only. It is not a function of θ.
- Observe also that the variable y occurs multiplied by the canonical parameter θ.
- The fact that a large number of probability distributions can be written in this generic form means that an efficient way to obtain maximum likelihood estimates is to write a single optimization routine based on this generic form. Thus a single protocol can be used to find maximum likelihood estimates for regression parameters in models employing a diverse assortment of random components.
The Poisson distribution is a member of the exponential family
- I next demonstrate that the Poisson distribution can be written in the generic form of the exponential family. The probability mass function for the Poisson distribution is shown below.

- Getting this expression into the proper form for the exponential family is just an exercise in algebra. The first step in doing this is always the same. We use the fact that exp(x) and log(x) are inverse functions and hence
.

- Using this last expression we can match up the generic functions and parameters in the formula for the exponential family with the specific values from the Poisson probability mass function. Thus we see
- Since θ is the canonical parameter for the exponential family we should call log λ the canonical parameter for the Poisson distribution.
- Recall that λ is the mean of the Poisson distribution. Thus the canonical parameter for the Poisson distribution can also be written
from which it follows that
.
- This suggests that a sensible choice for the link function is
because then we would have, using the link function to link the systematic and random components, the following.

- From the last equation we see that although η can take on any value from –∞ to ∞, μ is constrained to be positive. Thus the constraint that the mean of the Poisson distribution has to be positive is satisfied if we use the logarithm as the link function. Because this choice of link function arose directly from the canonical parameter, it is called the canonical link for the Poisson distribution.
- Note: the canonical link is not the only link function one might choose, but it is the "natural" one. Using the canonical parameter as the canonical link also turns out to have some advantages in the numerical estimation of the parameters in the model.
Details on the probability distributions and link functions used in GLIMs
- The table below lists some of the more common probability distributions typically used as the random component in generalized linear models along with their canonical link functions. The table also lists other link functions available in R, and the historic name for models now subsumed under the GLIM rubric.
Probability Distribution |
Canonical Link g(μ) |
Other links supported in R |
Historic name for these models |
Poisson |
log:  |
identity, sqrt |
Poisson regression, loglinear model |
Normal (Gaussian) |
identity:  |
log, inverse |
ordinary linear regression |
Binomial |
logit:  |
probit, cloglog, log |
logistic regression, probit analysis |
Gamma |
inverse:  |
identity, log |
gamma regression |
"Negative Binomial" |
|
log, sqrt, identity |
negative binomial regression |
- The negative binomial distribution deserves additional comment.
- The two-parameter negative binomial distribution is not a standard member of the exponential family. But if we treat the dispersion parameter as a known, fixed constant, then it is a member.
- In order to fit a negative binomial regression model as a GLIM, an initial guess is made for the value of the dispersion parameter. Using this guess the parameters of the linear predictor, η, are estimated within the generalized linear model framework. Using the parameter estimates obtained for η, a new estimate is obtained for the dispersion parameter. This process is repeated until the estimated values of both sets of parameters no longer change.
- The canonical link for the negative binomial distribution is rather complicated and hard to interpret, so it is rarely used. Instead to facilitate comparisons with the Poisson generalized linear model, a log link is typically used.
- For a binomial random component there are three common link functions: the logit, probit, and cloglog. All three links satisfy the boundary constrains of a proportion because when inverted they each map p into the interval (0, 1). They have similar behavior except near 0 or 1.
- The logit is popular because of its interpretation as a log odds.
- The probit link is the inverse cumulative distribution function (cdf) for the standard normal distribution and is denoted
.
- The cloglog link, complementary log log link is defined as
.
Link functions versus Transformations of the Response
- In lecture 11, the two models ranked best among the various species area models considered. They were a normal linear model with log-transformed richness as the response and log area as the predictor (model 4), and a negative binomial linear model with richness as the response and log area as the predictor (model 6). In the negative binomial model a log link function was used. From a practical point of view the choice between these two models is a no-brainer.
- The negative binomial model operates within the original scale of the response. It predicts species richness.
- The log-Arrhenius model operates on a log-transformed scale of the response. While there is a temptation to just back-transform the log-Arrhenius model to the original scale of the response by exponentiating both sides of the equation, such an approach is technically invalid. Our fitted model is for the mean of the log of species richness. When we back transform we do not get the mean of species richness.

Exponentiating this yields the nth root of the product of the individual values. This is called the geometric mean. There is a famous mathematical inequality that relates the geometric and arithmetic means that states that the geometric mean is always less than or equal to the arithmetic mean. Thus the log-Arrhenius model can perhaps be interpreted as estimating the geometric mean of the data as a function of log(area).
- Back transformation of the negative binomial model poses no problems. The log is in the link function; we never altered the raw responses in any way. Our model for the raw responses says that their mean grows exponentially and the exponent is a linear function in log(area).

- Observe that the final model for the mean is just the Arrhenius model. The difference though is that here the equation represents the mean of a negative binomial distribution (while in the standard Arrhenius model it represents the mean of a normal distribution). A negative binomial distribution allows the scatter around the fitted curved to increase as the mean gets larger. This is not allowed in the classical Arrhenius model.
- Clearly for the Galapagos data, a model that allows there to be heteroscedasticity on the scale of the original response is to be preferred.
The offset in count data regression models
- It can happen in obtaining count data that the observed counts are not equivalent.
- If the counts are obtained over time, the lengths of time, ti, may vary for each observation.
- If the counts are obtained in space, the areas in which the counts occur, Ai, may vary between observations.
- Even if the time interval and area are standardized, populations densities, Ni, may vary across sample units.
- Note in each of these cases it would make sense to work with the rate of occurrence—number of observations per unit time, number of observations per unit area, or a per capita rate. This should usually be avoided because the array of tools available for analyzing count data far exceeds what is available for analyzing rate data. If you really are interested in analyzing rates you should instead follow the guidelines described next.
- A standard approach to dealing with such a lack of equivalence while still treating this as a model of count data is to include what's called an offset in the regression model for these data. An offset here means a term of the form,
,
, or
included in the model but with coefficient constrained to be equal to 1. Thus an offset is a term for which a coefficient is not estimated.
- To understand what this is supposed to accomplish consider the scenario where the counts are obtained from different sized areas. Suppose that our model has p predictors,
. We fit a count regression model for the mean (either as a Poisson or negative binomial regression) using a log link and an offset.

- An equivalent way of writing this equation is the following.

Thus by including an offset we end up fitting a model for the rate of occurrence, as was desired.
- The use of an offset then is just a trick that allows us to use Poisson or negative binomial regression, which are only appropriate for count data, to fit a rate model.
- Note: The inclusion of an offset is primarily for purposes of interpretation. If the interpretation of the response as a rate is not important, but the controlling for the lack of equivalence is, it is perfectly legitimate to include a time, area, or population term in the model as a covariate. When included as a covariate rather than an offset, its regression coefficient is estimated instead of being set to one. Also when a variable is included as a covariate it may not be necessary or desirable to log-transform it first as is done when it is entered as an offset.
- Covariate is the term used for a predictor that is included for control purposes only, i.e., to reduce bias in or to increase the precision of parameters of interest. Covariates generally are variables that are of no interest by themselves but are included solely because it is believed that to omit them would deleteriously alter the results.
Cited References
- Durham, Catherine A., Iain Pardoe, and Esteban Vega. 2004. A methodology for evaluating how product characteristics impact choice in retail settings with many zero observations: An application to restaurant wine purchase. Journal of Agricultural and Resource Economics 29(1): 112–131.
- Hilbe, Joseph M. 2007. Negative Binomial Regression. Cambridge University Press.
- McArdle, Brian H. and Marti J. Anderson. 2004. Variance heterogeneity, transformations and models of species abundance: a cautionary tale. Canadian Journal of Fisheries and Aquatic Sciences 61: 1294–1302.
Course Home Page