Assignment 5
Due Date: Friday, February 24, 2006
Comparing Models with AIC
Data
slugsurvey.txt (from the web site of the textbook Statistical Computing by Michael Crawley)
The Problem
- In this assignment I ask you to finish the work we began in class on Tuesday, Feb 14, in which we fit a series of models to Crawley's slug data set and used AICc to compare them. So far we have fit the following models.
- Common Poisson model
- Poisson model in which λ is allowed to differ in the two field types (Nursery and Rookery)
- Common ZIP model
- ZIP model in which λ is allowed to differ in the two field types (Nursery and Rookery)
- ZIP model in which θ is allowed to differ in the two field types (Nursery and Rookery)
- A common normal model using a log-transformed response
- A normal model using a log-transformed response in which the mean μ is allowed to differ in the two field types (Nursery and Rookery)—not done in class but will be added to the notes.
- To this list of models I ask that you add the 8 models listed below.
- A ZIP model in which both λ and θ are allowed to differ in the two field types (Nursery and Rookery)
- A normal model using a log-transformed response in which the standard deviation σ is allowed to differ in the two field types (Nursery and Rookery)
- A normal model using a log-transformed response in which both the mean μ and the standard deviation σ are allowed to differ in the two field types (Nursery and Rookery)
- A common normal model using a square root transformed response
- A common negative binomial model
- A negative binomial model in which the mean μ is allowed to differ in the two field types (Nursery and Rookery)
- A negative binomial model in which the dispersion θ is allowed to differ in the two field types (Nursery and Rookery)
- A negative binomial model in which both the mean μ and the dispersion θ are allowed to differ in the two field types (Nursery and Rookery)
- What you should turn in.
- The code that you used to fit the 8 new models. For the normal models that use a transformed response this should include both the code you used to fit the models as well as the code you used to construct the loglikelihoods.
- A final AIC table using the format recommended by Burnham & Anderson (2002) as developed in class. This table should include all 15 models, the 8 new models as well as the 7 old models.
- An explanation of what the table tells you. In particular
address the following things.
- Does the table reveal that there is a single best model?
- Explain what the column of Akaike weights tell you about the various models.
- A goodness of fit test of your choice for what you deem to be the "best" model. Note: if there are a couple of models that are really close in terms of AICc you may wish to choose the more parsimonious model as your "best" model even if it isn't the winner in terms of AICc. In addition to having philosophical appeal, choosing simpler models will buy you extra degrees of freedom if you use the parametric version of the Pearson chi-squared test as your lack of fit test.
Hints
- The square root transformed model should be handled in the same way the log-transformed model was handled in class. As was the case there you will need to write a function to fit the model and another function for calculating the loglikelihood in terms the original response. The formula for the likelihood you will want to use in your loglikelihood function appears at the end of the notes to Lecture 18.
- Added 2/21/06. In your square root model you would think it would not be necessary to first add a constant to the response variable because unlike the logarithm the square root function is defined at zero. BUT, in the likelihood expression in which we back-transform to the original response this would lead to division by zero. Thus in order to calculate the likelihood for comparison with other models we do need to add a constant. For consistency this should be done both in the negative loglikelihood function you provide to nlm and the likelihood function you write to back-transform things in terms of the original response.
- For the four NB models I recommend setting up your functions just like the ZIP functions we wrote in class. Begin by creating the field.dummy variable and then include two separate lines that define mu and theta in terms of the components of the vector p and perhaps the field.dummy variable. Unlike the ZIP model you won't need to split the data into zero terms and other terms nor will you need to use the ifelse construction we used. The one line you used in the function you wrote to fit the NB model in Assignment 4 can be adapted here. Thus your four NB functions should only differ in the mu and theta lines. The rest of the code will be the same for each function.
- To produce the AIC table I ask you to turn in, just use the AIC.func code we created in class on Tuesday (and will also appear in the notes to that class). To use this function you will only need to change the list of model names and the list of models that you use as input arguments to the function.
- For the goodness-of-fit test of the model you select, just collapse the data into a single set of count categories. It is not necessary (for this exercise anyway) for you to assess the fit separately in the two field types or by concatenating the counts from the two field types into one long vector. Of course, if one of the models that allows parameters to differ across field types ends up being the best model you will need to fit the model separately for the two field types and obtain separate expected frequencies. Having done so you may then combine the results into a single set of expected frequencies for your goodness of fit test.
- Added 2/22/06. Let's assume for the moment that your best model has different parameters for nursery and rookery slugs. Using the output from nlm, calculate the values of the parameters needed in the probability function for nursery slugs. Then using this probability function calculate the expected probabilities for nursery slugs. Denote the vector of probabilities you get as pn. Don't forget the tail probabilities.
- Next calculate the values of the parameters for rookery slugs for the same probability function. Use this probability function to calculate the expected probabilities for rookery slugs. Denote the vector of probabilities you get as pr.
- Suppose there are n nursery slugs and r rookery slugs total. If all the categories in pn and pr match up, you can find the average expected probabilities for all slugs as follows.

Because n = r in your data, this amounts to just averaging the two sets of probabilities.
- Now if you want to do the parametric version of the Pearson chi-squared goodness of fit test, multiply
by the total number of slugs, then pool categories as needed. If you want to do the randomization test, use
as the probability vector and the total number of slugs in the various categories as your observed frequencies.
- Note: you could also do separate goodness of fit tests for nursery and rookery slugs. Because this further dilutes your cell counts I did not recommend this, but the randomization test would still be viable here. If you were to do the randomization test separately for the two field types you'll reach the same conclusion as with the pooled data.
- On Tuesday, Feb 21 we will revisit the slug data set and attempt to refit some of these models as generalized linear models. You may wish to use the results we obtain then as a further check on your work.
Cited Reference
Burnham, K. P. and D. R. Anderson. 2002. Model Selection and Multimodel Inference. Springer-Verlag, New York.
Course Home Page