Problem 1
Data
There is no required data set for this problem, but you might use the parasite data set from Assignment 7 to test the function that you've written.
Background
In Question 2 of Assignment 7 it turned out that there was no way to create categories for the expected counts and still satisfy the assumptions of the parametric Pearson chi-squared goodness of fit test that permit the use of a chi-squared distribution for the test statistic. The quadratic logistic regression model that was fit to the data required three parameters to be estimated. Thus in order to have any degrees of freedom left for the chi-squared distribution of the Pearson statistic, we needed to have at least four categories of expected successes. Using quartiles of the predictor age to create the four categories, it turned out that 2 of the 8 cells of the resulting contingency table had expected cell counts less than five. This exceeds the 20% threshold for cells with small counts, which is the recommended threshold in order for the calculated Pearson statistic to actually have a chi-squared distribution. For these data increasing the number of categories beyond four only made the situation worse. Thus there is no way to carry out the parametric version of Pearson test and still be assured that the use of chi-squared distribution to assess statistical significance of the test statistic was valid.
Clearly what's needed in a problem like this is a non-parametric alternative to the Pearson test, something analogous to the simulation-based p-value that is available using the base R function chisq.test. Unfortunately the chisq.test function can't be used to test the fit of logistic regression models. Although the results of logistic regression can be summarized in a 2 × n table of expected successes and failures, the columns of the table correspond to independent trials. Thus the sum of the expected probabilities of all the cells in the table is equal to n rather than to one, a situation for which the chisq.test function is not designed.
The Question
Write a function that calculates a simulation-based p-value for the Pearson statistic when calculated from a goodness of fit test for a logistic regression model. Your function should take three arguments.
- One argument is the matrix of expected counts, the 2 × n table of expected counts estimated from the model. The two rows correspond to successes and failures and the columns correspond to the different groupings (natural or artificial) of the response.
- A second argument is the matrix of observed counts, a 2 × n table of observed values that is arranged in exactly the same fashion as the table of expected counts.
- The third argument should denote the number of simulations, numsims, that are to be used in calculating the p-value for the simulation-based test.
Your function should return two values:
- the observed value of the Pearson statistic calculated using the table of expected counts obtained from the model and the table of observed counts, and
- the simulation-based p-value for the observed value of the Pearson statistic.
Demonstrate the use of your function in two ways.
- Use it to test the fit of the final quadratic model for the parasite data from Question 2 of Assignment 7 using five categories for the successes (thus corresponding to a 2 × 5 matrix).
- Also use it to test the fit of the final model from Question 1 of Assignment 7 using the observed and expected counts for all 33 trays, i.e., without first merging any of the categories.
**************
Problem 1 Hints
The simulation-based test I'm asking you to construct is based on the following principle. If a model provides a good fit to the data, then data generated using the model should be indistinguishable from the data that were actually observed. More specifically if we were to use the model to generate new data (pseudo-data) and treat these new data as the observed counts in a Pearson chi-squared goodness of fit test (with the expected counts being those predicted by the model), then the Pearson statistic we calculate using the pseudo-data should resemble the Pearson statistic we obtain when we use the actual observed values. By repeatedly generating pseudo-data and calculating the corresponding Pearson statistic, we can produce a probability distribution of pseudo-Pearson statistics. If the observed Pearson statistic looks to be a typical member of this distribution, then we can conclude that the model fits the data.
Construct your function for this problem as follows.
- Each of the n columns of the table of expected counts has a different probability of success. Obtain that probability by dividing the number of successes in that column by the total number of observations in that column. Arrange those probabilities in a vector p and collect the total number of observations for the different columns in a vector n.
- Denote the probability of success in column i by pi and the total number of observations in that column by ni. In terms of the vectors constructed in part 1, pi = p[i] and ni = n[i]. Use the rbinom function separately on each column to generate one random observation from a binomial distribution with parameters pi and ni. These are the arguments p= and size= of the rbinom function. Write this calculation in the form of a function such that the function takes a column number i and returns a binomial observation (number of successes) appropriate for that column.
- Use the sapply function to obtain separate binomial observations for each of the n columns. Also calculate the corresponding number of failures in each column and arrange your pseudo-data in a 2 × n table.
- Calculate the Pearson statistic using the expected counts predicted by the model and letting your pseudo-data play the role of the observed counts.
- Embed steps 3 and 4 in a for loop so that they are repeated the number of times specified in the numsims argument. Be sure to save the calculated pseudo-Pearson statistic at the end of each loop so that in the end you have a total of numsims separate Pearson statistics calculated using numsims different sets of pseudo-data.
- Finally calculate the observed value of the Pearson statistic using the expected counts predicted by the model and the actual observed counts.
- Count up the number of pseudo-Pearson statistics that are equal to or exceed the observed Pearson statistic. Be sure to count the observed Pearson statistic as one of these.
- To calculate the simulation-based p-value take the total obtained in step 7 and divide it by numsims + 1. The "+ 1" counts the observed value of the Pearson statistic as one of the simulations.
- Your function should return the simulation-based p-value as well as the observed value of the Pearson statistic each with appropriate labels.
Problem 2
Data
The file http://www.unc.edu/courses/2008fall/ecol/563/001/data/assign6/corals.csv contains the data for this problem. It is a comma-delimited text file and is the same data set that was used in Question 2 of Assignment 6. The variables of interest to us this time are the following.
- PREV_1: the number of coral colonies infected with white syndrome
- CORAL_COVE: the percentage of the bottom covered by living coral
- WSSTA: a thermal stress metric developed by the authors
- DATE: the calendar date at which the observation was made
- SECTOR: a rough geographical identifier of the reef's location
Background
The data were described in Assignment 6 where we reproduced the analysis of Bruno et al. (2007) and examined the fit of their model. The model they chose to fit is a negative binomial regression in which PREV_1 is the response and CORAL_COVE, WSSTA, and the interaction of CORAL_COVE and WSSTA are the predictors. Using predictive simulation and simple goodness of fit tests we found some evidence that the model does not fit the data. Here we examine an additional problem with their model and try to fix it.
PLOS Biology solicits online comments to the articles they publish. The following portion of a comment on this paper appears at the journal's web site.
It is not unusual in ecology that data collected for one project are used for another purpose altogether, a purpose for which the original sampling design is less than ideal. The authors cite the Australian Institute of Marine Science (AIMS) Long-term Monitoring Program as the source of their disease data. As is clearly documented in the AIMS annual reports (cited by the authors) the 48 reefs included in the monitoring program were chosen for logistical and historical reasons and are not a random sample. Particularly relevant here is the fact that the 48 reefs were not independently sampled but were visited in groups each year in five or six cruises. During a cruise the reefs in a particular geographic region were all visited in rapid succession within a few days of each other. A hiatus of a few weeks or months then followed after which another cruise commenced and the reefs in a different location were surveyed. Thus reefs that are spatially close were all sampled together and at roughly the same time of year. Clearly space and time are totally confounded in these data. Correctly accounting for this fact can have profound effects on parameter estimates and their standard errors, substantially decreasing the effective sample size.
[snip]
The defining structure of their data is a spatiotemporal unit corresponding to a group of reefs measured at a specific time (a week or two of a particular year). Observations from the same spatiotemporal unit will be far less variable than observations coming from different units in the same year or from different years. One way to account for this in their regression would be with a mixed effects model in which a separate random intercept is included for each spatiotemporal unit.
The objective of this exam question is to determine if the issues described above invalidate the analysis done by the authors and if so to fix their analysis by following the recommendations made in the comment.
The Questions
- Verify that the basic argument made in the comment above is correct, namely that the observations are clustered in time and that those observations that were sampled at roughly the same time also turn out to be geographically close to each other. Show that this pattern permeates the entire data set.
- Calculate the temporal autocorrelation in the raw data, i.e., the temporal autocorrelation in the response variable PREV_1 at different time lags, and display the results in a graph.
- Calculate the temporal autocorrelation in the deviance residuals extracted from the negative binomial regression model that was fit by the authors and display the results in a graph. Has the inclusion of the predictors WSSTA, CORAL_COVE, and their interaction, which themselves vary over time and space, accounted for the autocorrelation in the response that was seen in question 2?
- Create a categorical variable that groups together those observations that were both sampled roughly at the same time and are geographically close. Based on your results from question 1 a reasonable classification criterion might be the following: when the observations are arranged consecutively by the date they were sampled, a new group should be started any time the time gap between consecutive observations is more than two weeks.
- Use the categorical variable created in question 4 as the structural variable in a random effects model. Fit the authors' same negative binomial regression model but include random intercepts. The random intercepts should be assigned such that observations with the same value of the structural variable have the same intercept, but observations with different values of the structural variable should have different intercepts. Does this random intercepts model effectively account for the autocorrelation in the data? Provide evidence in the form of a table of autocorrelations and a graph.
- Extra credit. There is a commercial package called AD Model Builder that can be used to fit negative binomial regression models with random effects as well as many other kinds of models. The developers of this software have created a freely available R package called glmmADMB that fits negative binomial regression models with random effects in R. Their package is not available at the CRAN site, but can be downloaded from http://otter-rsch.com/admbre/examples/glmmadmb/glmmADMB.html. Use their glmm.admb function to fit the authors' negative binomial regression in which random intercepts are included using the group variable from question 4 as the structural variable. Extract the deviance residuals from this model and test them for autocorrelation. Are the results in agreement with the results obtained using WinBUGS in question 5?
Cited reference
Bruno, John F., Selig, Elizabeth R., Casey, Kenneth S., Page, Cathie A., Willis, Bette L., Harvell, C. Drew, Sweatman, Hugh, and Melendy, Amy M. 2007. Thermal stress and coral cover as drivers of coral disease outbreaks. PLOS Biology 5(6): 1220–1227.
**************
Problem 2 Hints
Question 1
- Download the dates package from the CRAN site. Create a new variable by using the as.date function from this package to convert the DATE variable that is currently in the coral data set into a Julian date. Use this new variable to sort the observations in the corals data set by calendar date. The as.date function acts only on character data so you'll first need to use the as.character function on the variable DATE to convert its values into character data. (Currently the variable DATE is a factor.)
- In the code below ordered.coral is the name I use for the corals data set after it has been sorted so that the observations are in date order. This line of code calculates the number of days that separate consecutive observations in the ordered.coral data frame.
as.date(as.character(ordered.coral$DATE))[2:(dim(ordered.coral)[1])]- as.date(as.character(ordered.coral$DATE))[1:(dim(ordered.coral)[1]-1)]
- Next pre-append the value 0 to the vector of differences created using the line of code above so that the resulting vector has length 280, equal to the number of observations in the corals data set. Create a new data frame consisting of this vector of differences and the vector of SECTOR names obtained from the ordered.coral data frame. Scroll through this data frame and notice that all the large difference values occur exactly at the time when the sample switches to a different sector.
- The map below identifies the sectors by name and number (the number that is assigned to the levels of the variable SECTOR by R when it was made into a factor). Verify that whenever the time difference between consecutive sample times is 9 days or less the consecutively sampled observations either both come from the same sector or come from sectors that are adjacent to each other geographically (typically SW and CB but also once from TO and WH).

Question 2
- Calculating the temporal autocorrelation in these data cannot be done using the built-in R functions acf (base R) or ACF (from the nlme package). The problem is that the observations are not all equally spaced in time. So we need to calculate the temporal distances between all pairs of observations and then calculate the autocorrelation ourselves. To simplify your life I'm supplying you with the code you need to do these calculations. Now in truth we don't need to calculate the temporal distance between every pair of observations, only those pairs that are close enough in time that they contribute to the autocorrelation at the particular lags we're interested in. The following one line of code generalizes the calculation that was done in Question 1 so that not only are the time differences between consecutive observations calculated, but also the time differences between observations that are separated by one other observation, the time differences between observations that are separated by two other observations, etc. As before the code assumes that ordered.coral is the name that was given to the corals data set after it was sorted by sample date.
sapply(1:10, function(i) as.date(as.character(ordered.coral$DATE))[(1+i): (dim(ordered.coral)[1])]- as.date(as.character(ordered.coral$DATE))[1: (dim(ordered.coral)[1]-i)])->the.lags
The variable the.lags that is created by this line of code is a list variable. Each component of the list is a vector of time differences between pairs of observations. The list component the.lags[[1]] is the vector of time differences between consecutive observations that was used in Question 1. Currently the last component the.lags[[10]] is the vector of time differences between pairs of observations that are separated by 9 other observations. The smallest time difference in this last list component turns out to be 12.
min(the.lags[[10]])
[1] 12
This tells us that we have obtained all the pair-wise differences needed to calculate autocorrelations at lag 11 or less. This is adequate for our purposes. (If you want to calculate autocorrelations to greater lags you should change 10 in the above sapply to a larger number.)
- I've written a function that calculates autocorrelations when the observations are not equally spaced. It's called get.acf and can be found here. It has three required arguments:
- the maximum number of lags at which the autocorrelation is to be calculated,
- the variable for which the autocorrelation is being calculated (this variable must be sorted in time), and
- a list variable that contains all the relevant pair-wise temporal differences between observations. The get.acf function assumes that the entry for this third argument has the structure of the variable the.lags that was created above.
The output of the get.acf function is a matrix.
- Column 1 identifies the lag.
- Column 2 is the calculated autocorrelation at that lag.
- Column 3 is the sample size—how many pairs of values were used in calculating the autocorrelation.
So, to obtain the autocorrelation of the raw prevalence data up to lag 10 I would enter the following.
get.acf(10, ordered.coral$PREV_1, the.lags)
I chose 10 for the first argument because we know that the.lags variable, as it was created above, contains enough information to correctly calculate autocorrelations up to lag 10 (to lag 11 in fact).
- I've also written a function that plots the output from the get.acf function. I call it plot.acf and it can be found here. It takes as its only required argument the output from the get.acf function. It has an optional second argument, alpha.n, that allows the user to employ a Bonferroni correction in the plotted confidence bands if desired. The value of alpha.n should equal the number of autocorrelations that are being formally tested (against the null hypothesis that the autocorrelation is zero). So, for instance, if the output of get.acf is stored in the variable acf1, the function call plot.acf(acf1,10) plots the autocorrelations at various lags and adds confidence bands that incorporate a Bonferroni correction to the individual error rate so that the nominal family-wise error rate when 10 independent tests are carried out does not exceed α = .05.
Question 3
- The residuals function when applied to a glm.nb object extracts the deviance residuals by default. For instance, if the fitted model is called out.nb, then the syntax for extracting the residuals is residuals(out.nb) or more formally residuals(out.nb, type='deviance').
- Be sure when you fit the model that you use the data set in which the observations are sorted by observation date.
Question 4
- Create a Boolean expression that tests when a temporal gap in the vector of consecutive time differences exceeds two weeks. You should find that there are 29 such gaps thereby dividing the data set into a total of 30 spatiotemporal groups. Use the same Boolean expression to locate the numerical positions (vector indices) where these gaps occur. (Just create a vector of consecutive position numbers and use the Boolean expression inside a set of brackets to pull out the positions where the Boolean condition evaluates to TRUE.) Subtracting the adjacent numerical values in this vector of positions gives you how many observations occur in each group except for the first and last groups. The number of observations in the first and last groups can be easily obtained by inspection (and knowing how long the original vector is).
- Use the rep function of R and the group membership numbers just obtained to create a new vector that assigns the correct group number to each observation. This vector is the structural variable needed in fitting the random intercepts model.
Question 5
- The formula for calculating the squared deviance residual in a negative binomial regression model is given below.

Here
is the observed count for observation i,
is the estimated negative binomial mean for that observation, and θ is the overdispersion parameter. The deviance residual is then calculated as follows.

where sign(x) is the sign function that is defined as follows.

The WinBUGS code for obtaining the deviance residuals from a negative binomial regression is given below.
val1[i]<-equals(y[i],0)*2*alpha*log((alpha+mu[i])/alpha)
term1[i]<-2*y[i]*log(y[i]/mu[i])
term2[i]<-2*(alpha+y[i])*log((alpha+y[i])/(alpha+mu[i]))
val2[i]<-equals(val1[i],0)*(term1[i]-term2[i])
val[i]<-sqrt(val1[i]+val2[i])
e[i]<-step(y[i]-mu[i])*val[i] - step(mu[i]-y[i])*val[i]
The alpha appearing in the code is the overdispersion parameter (as well as the same named parameter of the gamma distribution), mu is the mean of the negative binomial distribution, and y is the response variable. The last line of BUGS code generates the deviance residuals, e. The rest of the variables in the code just define intermediate calculations. This code should be placed within the first for loop of the WinBUGS code in which the level-1 model is defined.
- I suggest that you begin by refitting your WinBUGS model from Assignment 6 after including the above deviance residual code in your model definition file. Practice extracting the deviance residuals from the final run of the Markov chains (the mean of the posterior distribution is an adequate summary statistic) and then running the autocorrelation functions I've provided on your results. The autocorrelations and autocorrelation plot you obtain should closely resemble what you obtained using the deviance residuals extracted from the same model when it was fit using the glm.nb function.
- The random intercepts negative binomial regression model you need for this problem is a fairly trivial amalgamation of the random intercepts model we constructed in class for a normal response and the negative binomial regression model without random effects that you've used previously. If you find yourself working too hard here then you are doing something wrong.
- It took a very large number of iterations on my computer and a lot of thinning before I obtained WinBUGS results in which all of the diagnostics looked good. I found that there are at least a couple of random intercepts for which the chains exhibit a severe amount of autocorrelation and mix very slowly. For my final run I ran nearly 800,000 iterations using the default thinning rate (which in this case thins the chains so that only every 2400th observation is returned) but specifying a minimal burn-in period. You may have an easier time than I did with your own hardware.
- [Added 12/5] Here's the way I would set about writing the WinBUGS model for this problem.
- Obtain the WinBUGS code for the negative binomial model that appears in the solution to Assignment 6 (available online but without a link until I finish editing it). You can think of the negative binomial regression model of Assignment 6 as being what I called a common pooling model when we discussed hierarchical models.
- In the notes to lecture 23 find the section that shows the common pooling model and the separate intercepts model side by side. Make the changes shown there to turn your negative binomial common pooling regression model into a negative binomial separate intercepts model. The variable species should be replaced by the grouping variable you created in Question 4 that groups the different reefs together according to their location and when they were sampled.
- Next switch to the section in the notes to lecture 23 that shows the separate intercepts model and the random intercepts model side by side. Make the changes shown there to turn your negative binomial separate intercepts regression model into a negative binomial random intercepts model. You're done.
- Back in R you will need to do the following.
- Define J so that it records the correct number of level-2 units as they were defined in Question 4.
- Be sure you use the data that you sorted by date (not the original data) when you create the individual variables that get passed to WinBUGS, otherwise your categories won't correspond to anything meaningful. This is one of the reasons I recommended above that you run the model first without the random intercepts, get the residuals and calculate their autocorrelation. If the result doesn't match what you got after fitting the model with glm.nb you're probably feeding WinBUGS the wrong data.
- The .inits and .parameters objects that get fed to the bugs function will need to be modified so that they combine features of the negative binomial run of Assignment 6 and the random intercepts run in lecture 23. For instance in the .inits object you will need J initial values for the intercept as well as initial values for the two parameters of the random intercepts distribution. Also be sure to add the names of the new variables, J and the grouping variable, to the .data object that gets fed to bugs.
- You'll probably need to do multiple runs of WinBUGS in which you use the last values of the previous run as initial values for the new run. After 50,000 or so iterations I kept doubling the number of iterations from the previous run until I got satisfactory chain behavior and diagnostics.
- Once you get convergence or near convergence and are ready to do your final run, add the deviance residuals, e, to the parameter list that you pass to WinBUGS.
Problem 3
Data
There is no required data set for this problem.
Background
I received the following question by email last week from a colleague.
I have conducted a series of experiments growing calcifying organisms in 4 different aquaria each formulated at different carbonate saturation states, which are predicted to occur in the oceans over the next 200 years because of rising atmospheric CO2. I grew 15 organisms in each of the four tanks. I then measured calcification rates after 60 days of growth in the experimental seawaters. The data are plotted as saturation state (0.7, 1.7, 2.1, 2.5) vs. calcification rate (continuous variable).
I recently submitted the manuscript for publication and the reviews were supportive of publication pending my ability to demonstrate that the results are indeed statistically significant. I originally employed a student's t-test to test for differences between treatments, most of which yielded p < 0.05. However, the reviewers argue that because all specimens were maintained within the same tank at each of the four saturation states, the n value is 4 (not 60), even though I had 15 specimens in each tank. They argue that the observations are not truly dependent because all organisms at a given saturation state were grown in the same tank. Do you think that there is a way to get around this problem without completely redoing these experiments?
The Questions
- Do you agree with the reviewers' criticisms? Why or why not?
- Do your best to answer the researcher's final question to me.
- [New!] It turns out this experiment was repeated 25 times for 25 different species and the researcher repeated all of his treatment comparisons using t-tests to test for treatment differences within each of his species. (With six treatment comparisons possible this means the author may have done upwards of 150 tests.) Suggest an alternative to the author's philatelic approach to data analysis.
Problem 3 Hints
Question 1
Here's an agricultural analog of the experiment that was done in Problem 3. Suppose four plots of land were chosen and four different fertilizer concentrations were applied to each plot. 100 seeds were sewn in each plot so that a total of 400 seeds were used. At the end of the experiment the height of the seedlings was measured as the response. Assume all seeds germinated and there are no missing data. The goal is to assess the effect of fertilizer concentration on plant growth. Is the sample size for assessing the treatment effect in this experiment 4 or 400?
Let's change the experiment slightly. Suppose there were 8 plots of land to which were randomly assigned one of the four fertilizer concentrations so that in the end each fertilizer concentration was used twice. 50 seeds were sewn in each plot so that a total of 400 seeds were used. At the end of the experiment the height of the seedlings was measured as the response. Assume all seeds germinated and there are no missing data. The goal is to assess the effect of fertilizer concentration on growth. Is the sample size for assessing the treatment effect in this experiment 8 or 400?
The sample size for a treatment is determined by the number of observational units to which that treatment was independently applied. Was the treatment in this experiment applied independently to plots of soil or to individual seeds? In terms of multilevel modeling would you call fertilizer concentration a level-1 or a level-2 predictor?
Maybe this is a cleaner example. Suppose we randomly select four apple trees and we apply one of four different concentrations of fertilizer to the four trees. On each tree we randomly select fifteen apples and we randomly apply one of fifteen different concentrations of insecticide to the apples. At the end of the experiment we measure the size of the 60 apples on the four trees as our response. In testing the effect of insecticide, what is the sample size? In testing the effect of fertilizer, what is the sample size? Why are the sample sizes not the same?
I claim that if we remove the insecticide treatment from this last experiment so that there is only the fertilizer treatment, then we have the design used in Problem 3.
Question 2
Here's a simulation of the experiment described in Problem 3. I start by randomly generating four groups of observations consisting of 15 observations each using the rnorm function. The groups have different means but the same variance. The numerical value generated by rnorm will serve as the simulated calcification rate for that observation.
sapply(c(4,6,8,10),function(x) rnorm(15, mean=x, sd=1))->temp
#organize into a data frame adding saturation state and tank number
my.dat<-data.frame(as.vector(temp), rep(c(0.7,1.7,2.1,2.5), rep(15,4)), rep(1:4, rep(15,4)))
names(my.dat)<-c('y','trt','tank')
#Test if saturation state affects calcification rate
#The wrong analysis (according to the reviewers)
lm(y~factor(trt), data=my.dat)->hey1
summary(hey1)
#The correct analysis (according to the reviewers)
library(nlme)
lme(y~factor(trt), random=~1|tank, data=my.dat, method='ML')->hey2
summary(hey2)
The p-values for the test of individual treatment differences are missing in the last analysis. The individual mean differences cannot be formally tested because there are no degrees of freedom available for the t-statistic (because each group has a sample size of one). Is there another option available to us here? Is there some other way to include the variable trt in the model besides as a factor?
Problem 4
Data
The same data set that was used in Exercise 9: http://www.unc.edu/courses/2008fall/ecol/563/001/data/assign9/birds.csv
The Questions
Repeat your analysis from Exercise 9 but this time assume that the bird richness counts have a Poisson distribution.
- Do any of the basic results of your analysis from Exercise 9 change?
- Explain how the interpretations of the regression coefficients change when using the Poisson model. Illustrate these differences in terms of your final model here and in Exercise 9.
- The researcher would have preferred to model density, species richness per unit area, rather than richness per se. Explain how such an analysis can be carried out in the context of Poisson regression and then perform the analysis using these data.
- Having carried out the researcher's wishes in question 3, explain why modeling density rather than species richness is clearly a bad idea with these data.
Problem 4 Hints
- In a Poisson multilevel model there is no level-1 variance and hence an intra-class correlation coefficient can't be calculated.
- It's not necessary to go through all the parts of Exercise 9 again. Just focus on those portions pertinent to the new analysis and that are also likely to change under the new probability model.
- [Added 12/5] The AIC function doesn't work directly on a model created by the lmer function. You have to apply it to logLik of the model instead. Here's an illustration of what I mean.
mod1<-lmer(y~x+(1|group), data=mydata, REML=FALSE)
#obtain AIC
AIC(logLik(mod1))
To use this on multiple models with sapply you need to write it as a generic function: function(x) AIC(logLik(x))
- Note: At this moment (December 2008) in the evolution of the lme4 package, the log-likelihood being returned by lmer for a Poisson model is not correct. It's apparently missing a scaling factor. The upshot is that while you can use the reported log-likelihoods and AIC values to compare and test among different Poisson mixed effects models, they cannot be used to compare, for instance, a Poisson mixed effects model with a normal mixed effects model. I'll have more to say about this in the solutions.
Problem 5
Data
The same data set that was used in Exercise 9: http://www.unc.edu/courses/2008fall/ecol/563/001/data/assign9/birds.csv
Background
In subsequent emails with the researcher who supplied the data set of Exercise 9, she indicated that she might want to treat the landscape variable as a structural variable with patches nested in landscape. I don't know whether this is a reasonable thing to do here. Here's a situation for which it might make sense.
Suppose in her study area there were distinct regions identified as an agricultural area, an urban area, a forest preserve, and a bauxite mine from which she then took multiple samples yielding her different patches. In this scenario then not only do the patches from the same landscape share similar land use characteristics, but they are also probably close to each other. This introduces a new source of heterogeneity in her data and the variable landscape clearly denotes more than land use differences. Landscape becomes a proxy for all those characteristics shared by these patches due to a common geology, edaphic environment, microclimate, etc. By including a common random effect for patches from the same landscape we may be able to account for potential spatial correlation in the data set.
On the other hand suppose her study area is just a patchwork of land usages with farms, bauxite mines, urban areas, and forest preserves scattered across the landscape. If her patches comprise a random sample from this patchwork, then landscape would probably better be treated as a predictor rather than a structural variable.
The Questions
- Regardless of whether it's appropriate or not redo Exercise 9 but this time treat landscape as a structural variable rather than a predictor. Do any of your conclusions change?
- Quantify the amount of variability present at each of the three levels.
- What is the correlation among patches within the same level of landscape?
- Based on the results of this analysis plus your analysis in Exercise 9, do you see any reason to treat landscape as a structural variable rather than a predictor? Explain.