Lecture 18—Monday, February 13, 2006
What was covered?
- Using AIC to compare models in which some, but not all, of the models use a transformed response variable
- Models for excess zeros
Terminology Defined
AIC for comparing models when some, but not all, use a transformed response variable
- As noted last time, when using AIC to compare models all the models being compared must use exactly the same response (all y, or all log y, etc.). I address here how to adjust the loglikelihoods of models that use transformed responses so that the AIC values that result can be compared to the AIC from models in which the response is not transformed.
- Here is a simple example of when this would be necessary. We fit the following two models to the same count data set using maximum likelihood estimation.
Model 1: Y ~ Poisson(λ)
Model 2: log(Y) ~ Normal(μ, σ2)
- We obtain the loglikelihood from each fit and use it to calculate the AIC, but unfortunately these two AIC values are not comparable. The loglikelihood for model 1 is the loglikelihood of Y, but the loglikelihood for model 2 is the loglikelihood of log(Y). What we need from model 2 is the loglikelihood of Y.
- Note: my example here is a bit contrived. We know that if log(Y) ~ Normal(μ, σ2), then Y~ lognormal (μ, σ2). Since R has a set of lognormal probability functions, dlnorm, plnorm, etc., we could use them to construct the likelihood of Y directly. But in practice this is rarely done. Instead a transformation is carried out and then ordinary normal-based methods are used in evaluating the model constructed with the transformed response.
- More importantly the methodology I describe here will work for transformations other than the logarithm, as long as the transformation is monotonic (and hence one-to-one). Basically we obtain the maximum likelihood estimates from a model that uses a transformed response, but then use these estimates in the correct likelihood for the untransformed response to obtain the AIC.
- Let g(y) denote the desired probability density function for the untransformed response. Recall that
where
, the cumulative probability distribution function for Y. Starting with G(y) I can write the following.

- The last equality follows because the set of sample points that satisfy the first inequality, Y ≤ y, are the same sample points that satisfy the second inequality, logY ≤ logy. Formally, random variables are functions defined on a sample space, where the sample space is the set of all possible outcomes from a random experiment. Probabilities are set functions defined on this sample space. So if Ω is the sample space and ω is a typical outcome in that sample space, then formally probability statements have the following meaning:

So here we see that the random variable Y only serves to identify the set of ω elements whose probability we wish to calculate. Any function of Y that yields the same set of ω will also yield the same probability. The logarithm is such a function.
- Because the logarithm function is monotonic it follows that

i.e., the sample space elements that make up the set on the left side of the equality also make up the set on the right side of the equality, and vice versa. This follows because the logarithm is a one-to-one mapping (Fig. 1). Hence each logY that satisfies logY ≤ log y corresponds to a unique and distinct Y satisfying Y ≤ y and vice versa. Hence the set of elements from our sample space that satisfy one inequality automatically satisfy the other.
- Continuing with our argument, I let Z denote log Y. Since we assume that the logarithm is a normalizing transformation here it follows that
, where
, a normal distribution. Thus we have the following.

- Next we differentiate G(y) to obtain g(y). Recall the following result from calculus. If we differentiate an integral with respect to a variable that occurs in the limits of integration we have the following.

where
. When only the upper limit of integration is a function of y, we have

- Using this result and continuing with our construction of g(y) we have the following.

- Thus the desired likelihood is the following.

To calculate AIC from the above likelihood we would use the maximum likelihood estimates of μi and σ2 obtained from the log-transformed model.
- In general then, if Z = T(y) is some transformation of the response variable such that T(y) has probability density function f(z). The likelihood as a function of y is given by the following expression.

- Some additional examples.
- Suppose we decide a square root transform is called for after which we're willing to assume that the transformed variable is normally distributed, i.e.,
. Then using our general result above we have

- If some of the observed values of the response variable are zero, we have to modify our protocol slightly because the logarithm function is undefined at zero. The standard approach is to add a small constant c, e.g. c = 1, to all values before applying the logarithm. In this case if we once again assume that the logarithm is a normalizing transformation we obtain the following for the likelihood of the untransformed variable.

Models for excess zeros
- A model for excess zeros, also called a zero-inflation model, starts with a probability distribution that underestimates the number of zeros in a data set and then "corrects" this distribution in a particular way to account for the extra zeros. Thus there are zero-inflated Poisson (ZIP) models, zero-inflated negative binomial (ZINB) models, etc. An excess zero model can be tacked onto virtually any other probability distribution.
- Excess zeros models fall into two broad categories depending on how the "correction" is done. They are the following.
- Mixture models (zero-inflated models proper). Lambert (1992) is the usual citation
- Conditional models (called hurdle models in the econometrics literature). In ecology Welsh et al. (1996) is often cited.
- All excess zero models begin by assuming that two probability models,
and
, determine the distribution of the random variable X. Where the two categories of excess zero models diverge is in the origin of the zero class.
- A mixture model assumes the zeros can come from both
and
.
- The conditional model assumes the zeros can only come from
.
- There seems to be no general agreement on a taxonomy for excess zero models, largely because they have taken a separate evolutionary path in different applied disciplines. The nomenclature I propose here is one that is commonly used in the literature, but there are exceptions.
- For example, the term zero-inflated Poisson (ZIP) model is routinely used as a label for both of the model types described below. Here I restrict it to the mixture model class of excess zeros models.
- The only way to be sure which version of these models is being used by a particular author is to read his/her protocol carefully. In addition, one can check their literature citations to see if one but not the other of Lambert (1992) or Welsh et al. (1996) is cited.
A derivation of the probability distributions for excess zero models
- The probability formulas for all excess zeros models have the same probabilistic basis. I develop this generic formula first and then specialize it for mixture and conditional excess zero models.
- Let A denote some event of interest and let B denote any other event and Bc its complement. Following the usual probability convention I let U denote the universe, or sample space, of which A, B, and Bc are subsets. By definition

Then we have the following.

where in the last step I use the definition of conditional probability for these events.

- To apply this formula to excess zero models, I make the following identifications.
- Let A be the event X = x.
- Let B be the event X ~ g1(x)
- Let Bc be the event X ~ g2(x)
Thus we explicitly assume that the distribution of X is entirely governed by the two distributions g1 and g2.
- Using these new identifications in the formula for P(A) above I obtain the following.

- Next I let
. θ is a parameter that will be estimated in fitting the model. As noted above the mixture and conditional version of the excess zeros model only differ in how they define g1 and g2. In what follows I assume the excess zeros are being measured relative to a Poisson distribution.
Zero-inflated Poisson (ZIP) model (mixture model)
- Define g1 and g2 as follows. I use the R function dpois to represent the Poisson probability mass function.

- Using the formula above for P(X = x) we get different results depending on whether or not x = 0.


- Combining these two results into one single formula yields the following.

Hurdle Poisson model (conditional model)
- Define g1 and g2 as follows. I use the R function dpois to represent the Poisson probability mass function.

The expression that appears in the formula for
is called a truncated Poisson distribution. It's truncated because in this case we've removed the zero category from the Poisson distribution. The denominator is there to renormalize the probabilities so that they still sum to 1.
- Using the formula above for P(X = x) we get different results depending on whether or not X = 0.


- Combining these two results into one single formula yields the following.

Comparing the two excess zero models
Mixture Poisson (ZIP) model
- In the mixture model we assume the observed zero counts are a heterogeneous mixture and come from two sources.
- Some of the observations are zero just by chance, i.e., the Poisson law predicts that some proportion of events will yield zero counts. This proportion is given by the g2 distribution.
- The rest of the zeros are generated by a different probability generating mechanism altogether. This is the g1 distribution.
- A good example of where a heterogeneous population of zeros might arise is in habitat suitability models.
- If we're dealing with a rare species then we would expect there to be zero counts in our sample because some of the sampled habitat is simply unsuitable for the species to exist, either for biological or nonbiological reasons.
- On the other hand even if we restrict ourselves to suitable habitat, the species might by chance still fail to occur in some of our samples. We would classify these sites as locations where the species could theoretically thrive, but for whatever reason doesn't, either because it hasn't reached the site yet or perhaps it has reached the site but was then locally extirpated, or any number of other reasons.
Conditional Poisson (Hurdle) Model
- Hurdle models are easy to motivate when it makes sense to dichotomize processes into those that lead to the presence or absence of a population and those that facilitate continued maintenance of a population. This would be the case in habitat suitability models that treat colonization and growth as separate processes and treat the possibility of subsequent extinction as unlikely.
- Often in both mixture and conditional models the primary goal is to develop separate regression models for θ and λ. Hurdle models are attractive in this regard because in formulating the loglikelihood for hurdle models the parameters involved in explaining θ and those involved in explaining λ occur in distinct portions of the loglikelihood.
- This facilitates estimation because that means the presence-absence portion of the model can be fit separately from the model for the nonzero counts.
- The parameters also have a clean interpretation. The parameters related to θ deal with habitat invasion while those used to predict λ relate to sustaining the population once it gets established and there is no overlap between them.
- This is not the case for the classical ZIP model. In the loglikelihood for the ZIP model θ occurs both by itself and also with terms that involve λ.
- Hurdle models easily generalize to continuous distributions. Thus one can have a point mass associated with the zero values while the remainder of the nonzero continuous observations are assumed to come from a density function with strictly positive support, one such as a lognormal or gamma distribution. Thus hurdle models are popular in fisheries research where a continuous measure such as catch per unit effort is used to quantify fishing success, but many fishing trips wind up catching no fish.
Some final remarks
The R packages pscl and zicounts provide implementations of both kinds of excess zero models. Poisson and negative binomial distributions for the nonzero counts are supported. The statistical package Stata also has functions dedicated to fitting ZIP and ZINB models.
There is the suggestion in the literature that zero-inflated negative binomial (ZINB) models often have convergence problems (Famoye and Singh 2006).
Zero-inflated models have a long history in the econometrics literature but the interest in these kinds of models in disciplines such as ecology appear to stem from Lambert (1992).
It's worth noting that Warton (2005) argues that many of the published uses of excess zero models are probably unnecessary. He argues that the negative binomial probability model by itself is sufficient to handle most occurrences of zero-inflation (relative to a Poisson distribution) in environmental and ecological data.
Some references on excess zero models
- Bohning, D., E. Dietz, P. Schlattmann, L. Mendonca, and U. Kirchner. 1999. The zero-inflated Poisson model and the decayed, missing, and filled teeth index in dental epidemiology. Journal of the Royal Statistical Society A 162: 265–280.
- Christian, R. R., and W. O. Pipes. 1983. Frequency distribution of coliforms in water distribution systems. Applied and Environmental Microbiology 45(2): 603–609.
- El-Shaarawi, A. H. 1985. Some goodness-of-fit methods for the Poisson plus added zeros distribution. Applied and Environmental Microbiology 49(5): 1304–1306.
- Famoye, F. and K. P. Singh. 2006. Zero-inflated generalized Poisson regression model with an application to domestic violence data. Journal of Data Science 4: 117–130. http://www.sinica.edu.tw/~jds/JDS-257.pdf
- Lambert, D. 1992. Zero-inflated Poisson regression, with an application to defects in manufacturing. Technometrics 34: 1–14.
- Mayer, D., D. Roy, J. Robbins, I. Halliday, and M. Sellin. 2005. Modeling zero-inflated fish counts in estuaries—A comparison of alternate statistical distributions. In Zerger, A. and Argent, R. M. (eds) MODSIM 2005 International Congress on Modelling and Simulation. Modelling and Simulation Society of Australia and New Zealand, December 2005, pp. 170–176. ISBN: 0-9758400-2-9. http://www.mssanz.org.au/modsim05/papers/mayer.pdf
- O'Neill, M. F. and M. J. Faddy. 2003. Use of binary and truncated negative binomial modelling in the analysis of recreational catch data. Fisheries Research 60: 471–477.
- Pearce, J., and S. Ferrier. 2001. The practical value of modelling relative abundance of species for regional conservation planning: a case study. Biological Conservation 98: 33–43.
- Podlich, H. M., M. J. Faddy, and G. K. Smyth. 2002. A general approach to modeling and analysis of species abundance data with extra zeros. Journal of Agricultural, Biological, and Environmental Statistics 7: 324–334.
- Ridout, M. S., C. G. B. Demetrio, and J. P. Hinde. 1998. Models for count data with many zeros. Proceedings of the XIXth International Biometric Conference, Cape Town, Invited Papers, pp. 179–182. http://www.kent.ac.uk/IMS/personal/msr/ibc_fin.pdf
- Warton, D. I. 2005. Most multivariate abundance data do not have extra zeros, compared to the negative binomial. Environmetrics 16: 275–289.
- Welsh, A. H., R. B. Cunningham, C. F. Donnellly, and D. B. Lindenmayer. 1996. Modelling the abundance of rare species: statistical models for counts with extra zeros. Ecological Modelling 88: 297–308.
Course Home Page