Lecture 16—Wednesday, February 8, 2006
What was covered?
- Information-theoretic approaches to model selection
- An interpretation of AIC based on Kullback-Leibler information
Terminology Defined
Information-Theoretic Approaches to Model Selection
- Choosing a best model from a collection of candidates is a nontrivial task for which statisticians can only offer rough guidelines. The first problem of course is choosing a suitable collection of candidate models. This is fundamentally a biological problem rather than a statistical one (although statistics can offer guidelines in, for example, the choice of probability generating mechanisms). In what follows I will assume a manageable set of models based on clear biological principles has already been assembled.
- Having determined the field of candidates the next problem is choosing a criterion on which to evaluate them. Methods based on significance testing are of limited utility here.
- Approaches such as the LR test require that models be nested and is of no help when they are not.
- If you wish to compare models using different probability generating mechanisms, such as Poisson, negative binomial, lognormal, or some other transformation to normality, significance testing is helpful in only special circumstances.
- Significance testing has a lot of baggage associated with it and has been heavily criticized in many scientific disciplines (including ecology) as being divorced from the scientific method.
- Information-theoretic approaches offer an alternative to significance testing. A number of criteria fall under this rubric. I will focus on three: AIC, AICc, and BIC.
- In what follows I use the following notation.
- θ is the set (vector) of model parameters.
-
is the likelihood of our model given the data when evaluated at the maximum likelihood estimate of θ.
- n is the number of observations.
- K is the number of estimated parameters in our model.
- Using this notation the three information-theoretic criteria I mentioned are defined as follows.
- AIC: Akaike Information Criterion

- AICc: corrected AIC for small samples (use when
)

- BIC: Bayesian Information Criterion (also called SIC for Schwarz Information Criterion)

Best models are those that yield the smallest value of these statistics.
- Observe that all of these statistics involve the loglikelihood. Since the likelihood is the probability of obtaining the data you did under the given model it makes sense to choose a model that makes this probability as large as possible. Taking the log doesn't change anything, but putting the minus sign in front does. Now instead of maximizing you need to minimize. Hence that is why you want the smallest value of these statistics.
- All three statistics share the same first term but they differ in the rest. Notice that in each case the terms being added to
are positive. Thus these terms make the value of the corresponding statistic bigger in turn making it harder for that model to be chosen best.
- Typically you'll see in the literature that these terms are referred to as penalty terms and their function is to prevent overfitting. Thus you'll often see it stated that AIC and BIC are more or less equivalent statistics differing only in that they use different "penalty" terms, 2K for AIC versus Klogn for BIC. Since the BIC penalty grows in magnitude more quickly with K, BIC favors smaller models than does AIC.
- While it's useful to think of model building as a balance between fit improvement (increased loglikelihood) and parsimony (fewer parameters), treating the additional term in the expression of AIC as a penalty term is really incorrect and derives from a complete lack of understanding of how AIC is derived. As we'll see the 2K term is an essential part of the definition of AIC and is not just some add-on penalty term.
- AIC has two primary champions, David Anderson and Ken Burnham both of Colorado State University. Anderson is a wildlife biologist and Burnham is a statistical ecologist. These men have been promoting AIC has part of a comprehensive program of model selection in the biological sciences. They have done so through
- a series of articles (a partial list can be found at the end of this document),
- a book, Model Selection and Multimodel Inference, now in its second edition and with over 1300 citations in Science Citation Index, mostly in ecology journals, and
- workshops and courses.
- Anderson and Burnham have been fairly successful in their mission. As an example, the Journal of Wildlife Management now actively promotes the use of AIC instead of significance testing for model selection in solicited manuscripts. Roughly half of the editors of the journal are well-versed in information-theoretic approaches to model selection and are promoters of it.
Kullback-Leibler Information and AIC
Basic Definitions
- The Kullback-Leibler (K-L) information between models f and g is defined for continuous distributions as follows.

- Here f and g are probability distributions. The verbal description of I(f, g) is that it represents the distance from model g to model f. Alternatively it is the information lost when using g to approximate f.
- Typically, g is taken to be an approximating model while f is taken to be the true state of nature. The quantity θ represents the various parameters used in the specification of g.
- As a measure of distance, I(f, g) is a strange beast. Because of the asymmetry in the way f and g are treated in the integral, I(f, g) ≠ I(g, f).
- The form of K-L information for discrete probability models may be a bit more enlightening. Let the true state of nature be

and the approximating model be

The K-L information between models f and g is defined for discrete distributions to be

- Since the log of a quotient is the difference of logs, the K-L information is also given by

You may recognize that one of the two terms in this difference has the same form as H, the Shannon-Weiner diversity index, another information-based measure.
- Note: the logarithm function plays a key role in any formulation of information. This is the case because while independent bits of information multiply on a probability scale, they should add on an information scale.
- For independent events A and B, the probability of observing both is the product of the individual probabilities.

- But if the events are independent then the information accrued in observing both should be additive.

- Because the logarithm turn products into sums: log(ab) = loga + logb, it is the natural candidate for transforming probabilities into information.
The True State of Nature Drops Out as a Constant
- Using properties of logarithms the integral form for I(f, g) can be written as a difference of integrals.

where in the last step I use the fact that the form of each integral is that of an expectation.
- Now suppose we have a second approximating model h for the true state of nature f. The information lost in using h to approximate f is given by the following.

- Observe that
is a common term in each expression. If we want to compare model g to model h it makes sense to consider the difference I(f, g) I(f, h). If we do so then we find

- Observe that
has canceled out. Thus if our goal is to compare models, truth drops out! For a generic model g then all we need to estimate is the following

- This last expression is called relative Kullback-Leibler information. Its absolute magnitude has no meaning. It is only useful for measuring how far apart two approximating models are. Relative K-L information is measured on an interval scale, a scale without an absolute zero. (The absolute zero is truth and is no longer part of the expression.) This interval-scale property will later carry over to AIC.
- So if our goal is only model comparison our overall objective can be more limited. Rather than estimate K-L information for a model we can estimate instead relative K-L information.
Akaike's Contribution
- Hirotugu Akaike (Ah-kah-ee-key), currently professor emeritus at the Institute of Statistical Mathematics in Japan, observed in a classic paper in 1973 that there is yet an additional problem in trying to estimate relative K-L information. In our approximating model we typically won't know the exact value of θ. Instead we will have to use an estimate
. Since this estimate is likely to be in error, another layer of uncertainty is added to the mix.
- So Akaike suggested that what we should do is to calculate the average value of relative Kullback-Leibler information over all possible values of
. In terms of expectation we would call this quantity expected relative K-L information and write it (suppressing the reference to f) as

- The notation makes this expression look fairly intimidating but in fact Akaike found an unbiased estimator of it. The estimator is

Here
is the loglikelihood function for model g evaluated at the maximum likelihood estimate of the parameter set θ. K is the number of parameters that are estimated in maximizing the likelihood.
- Strictly for historical reasons, Akaike chose to multiply this quantity by 2. The resulting quantity has come to be known as Akaike's information criterion or AIC.

- Models with smaller values of AIC are better models than models with larger values of AIC. Thus AIC can be used to compare models.
Some Additional Comments
- AIC can only select from the models we provide it. If all of the models in the candidate list are bad models then AIC will pick the best of a bad lot.
- Once an AIC-best model has been determined this does not excuse you from examining other measures of fit, such as R2, and checking model diagnostics. You will still need to decide if the model that AIC has selected is in fact a good model. Remember AIC is a relative measure, not an absolute measure.
- Having said this, if among the candidates there are some models that fit the data well and some models that fit the data poorly, AIC will not pick the poor models over the good models.
References on AIC
Note: a number of David Anderson's papers can be downloaded from his web site.
- Akaike, Hirotugu. 1973. Information theory and an extension of the maximum likelihood principle. Proceedings of the 2nd International Symposium on Information Theory (edited by B. N. Petrov and F. Csaki). Akademiai Kiado, Budapest. (Reproduced in S. Kotz and N. L. Johnson (editors), 1992, Breakthroughs in Statistics, New York: Springer-Verlag, pp. 610–624.)
- Anderson, D. R., K. P. Burnham, and W. L. Thompson. 2000. Null Hypothesis Testing: Problems, Prevalence, and an Alternative. Journal of Wildlife Management 64(4): 912–923.
- Anderson, D. R. and K. P. Burnham. 2002. Avoiding pitfalls when using information-theoretic methods. Journal of Wildlife Management 66(3): 912–918.
- Anderson, D. R., K. P. Burnham, W. R. Gould, and S. Cherry. 2001. Concerns about finding effects that are actually spurious. Wildlife Society Bulletin 29(1): 311–316.
- Burnham, K. P. and D. R. Anderson. 2001. Kullback-Leibler information as a basis for strong inference in ecological studies. Wildlife Research 28: 111–119.
- Burnham, K. P. and D. R. Anderson. 2002. Model Selection and Multimodel Inference. Springer-Verlag, New York.
- Burnham, K. P. and D. R. Anderson. 2004. Multimodel inference: understanding AIC and BIC in model selection. Sociological Methods & Research 33: 261–304.
- Franklin, A. B., T. M. Shenk, D. R. Anderson, and K. P. Burnham. 2001. Statistical model selection: the alternative to null hypothesis testing. Pages 75–90 in T. M. Shenk and A. B. Franklin (editors), Modeling in Natural Resource Management: Development, Interpretation, and Application, Island Press, Washington, D. C.
- Kullback, S. and R. A. Leibler. 1951. On information and sufficiency. Annals of Mathematical Statistics 22(1): 79–86.
…and a few words from the critics
- Guthery, F. S., L. A. Brennan, M. J. Peterson. and J. J. Lusk. 2005. Information theory in wildlife science: critique and viewpoint. Journal of Wildlife Management 69(2): 457–465.
- Richards, Shane A. 2005. Testing ecological theory using the information-theoretic approach: examples and cautionary results. Ecology 86(10): 2805–2814.
- Stephens, P. A., S.W. Buskirk, G. D. Hayward, C. M. del Rio. 2005. Information theory and hypothesis testing: a call for pluralism. Journal of Applied Ecology 42(1): 4–12.
Course Home Page