Lecture 43 (lab 11)—Tuesday, April 4 , 2006
What was covered?
- R topics covered
- nlme package for mixed effect models
- xyplot for generating multi-panel displays of structured data sets
- lme syntax for multilevel models
- VarCorr function for extracting variance components
- intervals function for obtaining confidence intervals of model parameters
- Graphing structured data sets using the lattice package and directly as groupedData objects
- Basic syntax of the lme function
- Unconditional means model
- Extracting variance components and computing within group correlations
- Random intercepts model
- Random slopes and intercepts model (unconditional growth model)
R functions and commands demonstrated
- groupedData (from nlme) creates a groupedData object
- intervals (from nlme) produces confidence intervals for all estimated parameters
- lme (from nlme) sets up and fits linear mixed effect models
- sample for selecting a random sample of a specified size from a vector
- set.seed for resetting the random number stream
- trellis.par.set (from lattice) for setting graphics parameters in trellis graphics
- VarCorr (from nlme) extracts variance components from lme models
- xyplot (from lattice) creates individual graphs in panels each panel corresponding to a level-2 unit
- %in% is used to find the intersection of sets
R function options
- fixed= (argument to lme) for defining from the fixed effects portion of a linear mixed effects model
- method= (argument to lme) for changing the estimation method used in obtaining parameter estimates. The default setting is "REML"
- panel= (argument to xyplot) used to define a panel object that describes what is to be plotted in each panel
- random= (argument to lme) for defining from the random effects portion of a linear mixed effects model
- strip= (argument to xyplot) used to modify default characteristics of the strip that appears at the top of each panel
R packages used
- nlme defines the functions lme, VarCorr, intervals, and groupedData for mixed effect models
- lattice defines the functions for lattice graphics—xyplot and various panel functions
Overview of the data
- We analyze the larval development time data set we've been discussing in class.
> #obtain data
> inverts<- read.table('http://www.unc.edu/courses/2006spring/ecol/145/001/data/lab11/74species.csv', sep=',', header=TRUE)
> names(inverts)
[1] "species" "meantemp" "temp" "PLD" "taxon" "climate.3"
[7] "feeding.type" "lntemp" "lnPLD" "n" "PLD.15c" "lnPLD.15c"
[13] "X1000.k" "lograte"
- The variables of interest are lnPLD, lntemp, species, feeding.type, and climate.3.
- lnPLD is the response
- lntemp is the level-1 predictor
- species defines the level-2 units in the data set
- feeding.type is a level-2 predictor
- climate.3 is a level-2 predictor
- I display a few lines of the data set so that the structure is clear. Observe that both lnPLD and lntemp vary within species (and hence are level-1 variables) while feeding.type and climate.3 are constant within species.
> inverts[1:10, c("species", "lnPLD", "lntemp", "feeding.type", "climate.3")]
species lnPLD lntemp feeding.type climate.3
1 Amphiprion melanopus 2.509599 3.218876 P tropical
2 Amphiprion melanopus 2.197225 3.332205 P tropical
3 Armases miersii 2.960105 3.173878 L tropical
4 Armases miersii 2.397895 3.377588 L tropical
5 Balanus amphitrite 1 2.833213 2.708050 P temperate
6 Balanus amphitrite 1 2.639057 2.995732 P temperate
7 Balanus amphitrite 1 2.302585 3.135494 P temperate
8 Balanus amphitrite 1 2.079442 3.218876 P temperate
9 Balanus amphitrite 1 2.197225 3.295837 P temperate
10 Balanus eburneus 2.028148 2.995732 P temperate
- The model we wish to fit is a nonlinear power law model relating pelagic larval duration time (PLD) with temperature. The standard approach is to linearize this equation by log-transforming both sides of the equation.

- We discussed the issue of linearizing nonlinear equations back when we considered the Arrhenius and log-Arrhenius models for the species-area relation. We pointed out then that although the original model and the linearized model are algebraically equivalent, they are not statistically equivalent. Since the researchers chose to log-transform their model, I'll follow their lead and ignore the ramifications this decision might have for data analysis.
- The primary goal of the analysis is to determine if a single model provides a good fit to all species. The authors speculate that a common slope model will work but not a common intercept model. Thus the models we build will largely be geared to addressing this question. Let's see how much data there is.
> dim(inverts)
[1] 218 14
> length(unique(inverts$species))
[1] 74
- So we have 218 observations scattered among 74 species. How many observations do we have for each species?
> table(tapply(inverts$PLD,inverts$species,length))
2 3 4 5 6
29 29 9 5 2
- So we see that most of the species are contributing very little data. The notion of fitting a separate regression model to each species is clearly not very attractive. Species with only two observations we will fit perfectly, an unrealistic result because it ignores the uncertainty in the data. A multilevel model is ideally suited to such a situation, because it draws strength from all species simultaneously weighting them by the amount and reliability of the data they provide.
Examining the data structure
- As with all statistical analysis a sensible first step is to take a look at the data graphically. Because the data here are structured our graphical displays should reflect that structure. The lattice graphics package of R is especially suited for displaying structured data.
- With 74 species to consider, looking at all the data at once is unwieldy. Instead I'll randomly sample enough species to get a picture of what the data look like. The sample function of R can be used to extract random samples of a specific size from a data set. I use the sample function in conjunction with the set.seed function, to reset the random number stream, so that I can repeatedly extract the same "random" sample.
> set.seed(10)
> sub1<-sample(unique(inverts$species),16)
> sub1
[1] Limulus polyphemus Dendraster excentricus Haliotis rufescens
[4] Ostrea lurida Callianassa tyrrhena Chthamalus stellatus
[7] Crassostrea virginica Spirorbis spirorbis Lytechinus variegatus
[10] Gadus morhua Mactra solidissima Hydroides elegans
[13] Cancer irroratus Laqueus californianus Cryptasterina pentagona
[16] Elminius modestus
74 Levels: Amphiprion melanopus Armases miersii Balanus amphitrite 1 ... Upeneus tragula
- So sub1 contains a list of randomly selected species. I next use the %in% operator, the intersection operator, to extract the observations from the data set that correspond to these species.
> sub2<-inverts[inverts$species %in% sub1,]
> dim(sub2)
[1] 50 14
- So now we have 50 observations allocated to 16 different species to examine. I first demonstrate the default graphics that are available from the nlme package for structured data. First we need to identify this structure. This is done by defining what's called a grouped data object using the groupedData function.
> library(nlme)
> sub3<-groupedData(lnPLD~lntemp|species, data=sub2)
- The first argument to the groupedData function is a model formula in which we identify the response and a predictor. For a multilevel model these should be level-1 variables. Observe that the predictor is followed by a vertical bar, |, after which appears the variable that defines the level-2 units in the data set. If we plot this grouped data object we get a panel display in which the response variable is plotted against the predictor separately for each unique value of species (Fig. 1).
> plot(sub3)
> library(lattice)
> trellis.par.set(col.whitebg())
- Next I write a function to produce the desired panel display described. The function call appears next followed by a brief description.
> xyplot(lnPLD~lntemp|species, data=sub2, panel=function(x,y) {
panel.xyplot(x,y)
panel.abline(lm(y~x), lty=1, col=4)
},
strip=strip.custom(par.strip.text = list(cex=0.75)))
- xyplot is the lattice function used to generate the overall plot. Its first argument is the structured formula described previously for the groupedData function in which species is used as the grouping variable. This is followed by the data set name.
- The third argument to xyplot is the panel function. It defines the contents of each panel (where each panel corresponds to a different species).
- panel.xyplot produces a scatter plot.
- panel.abline plots a line in this case a regression line since its first argument is a linear model.
- The strip argument that appears last customizes the strip that appears at the top of each graph. I change the size of the text that is used so that the entire species name is readable in each panel.
- Fig. 2 shows the result. The fit of the linear regression lines to the points appears to be quite good. (This is of course a bit deceptive since nearly half of the species displayed have only two data points.) It is also clear that the slopes and intercepts vary quite a bit across species.
Model building
- There are two packages in R for fitting multilevel models. The older and more comprehensive package is nlme, an acronym for nonlinear mixed effects models. It's limitation is that it only fits normal-based models and was not designed to fit mixed models to nonhierarchical data. The newer package is lme4. It can handle generalized linear mixed effect regression models such as logistic and Poisson regression. It currently lacks the nonlinear features of nlme though.
- Since the examples I wish to consider are all normal-based I will focus on the nlme package. We may have time later to consider the lme4 package, at least to examine the syntax differences that exist. The primary reference for the nlme package is Pinheiro and Bates (2000)
The unconditional means model
- I begin by fitting the unconditional means model. Recall that although this is a model without predictors it does account for the structure in the data. To allow for heterogeneity across level 2 units, it includes a random effect for each level 2 unit (species).

where 
To understand how the R specification of this model works, I substitute the level-2 equation into the level-1 equation to obtain the composite model.

- This last equation maps directly into R as shown next. The function in the nlme packaged for fitting linear mixed effect models is lme.
model0<-lme(fixed=lnPLD~1,random=~1|species,data=inverts,method="ML")
- The first argument defines the fixed part of the model. It specifies the response and the predictor, in this case only an intercept. Since we will always list the fixed part of the model first, it is unnecessary to include the argument name fixed=, and I will omit it from now on.
- The second argument random= defines the random part of the model. Observe that the response is not specified again in the random part so the model statement begins with just a ~. Because only the intercept includes a random effect in its level 2 equation, the intercept is identified as the predictor on the right hand side of the model expression. Alternatively in the composite model if we think of
as being a coefficient then the predictor it is multiplying is 1. The vertical bar once again separates the model specification from the structural specification. The variable species is listed as defining the level-2 units.
- The third argument data = inverts defines the data set.
- The last argument method="ML" requests that estimates be obtained using full maximum likelihood rather than the default estimation method of REML, restricted (residual) maximum likelihood. This is the appropriate setting for the tests we will carry out today.
- The summary output from the model is displayed below.
> summary(model0)
Linear mixed-effects model fit by maximum likelihood
Data: inverts
AIC BIC logLik
451.2372 461.3906 -222.6186
Random effects:
Formula: ~1 | species
(Intercept) Residual
StdDev: 0.8198609 0.4529645
Fixed effects: lnPLD ~ 1
Value Std.Error DF t-value p-value
(Intercept) 2.888758 0.1008001 144 28.65829 0
Standardized Within-Group Residuals:
Min Q1 Med Q3 Max
-2.15822703 -0.58084922 -0.05855462 0.55220371 2.38062759
Number of Observations: 218
Number of Groups: 74
- I've highlighted in bold the quantities of interest to us.
- In the first line we're told that maximum likelihood is the estimation method that was used. Thus the AIC and loglikelihood that appear on lines 3 and 4 are correct and can be used to make comparisons between models with different fixed effects (or random effects).
- The next section lists the estimates from the random effects part of the model. Recall that we don't actually estimate the random effects, we instead estimate the parameters of their distribution,
and
for the unconditional means model. Notice that in the line where the numerical estimates appears the label is StdDev, so in fact what's displayed here are τ (Intercept) and σ (Residual). Since it is the squares of these quantities that are interpretable and there is a another way to obtain the squares, I'll wait to comment further on these numbers.
- In the Fixed effects section is the reported value of the intercept, its estimated standard error, and a Wald test for whether its value is significantly different from zero.
- Since the primary reason for fitting the unconditional means model is to obtain the variance components, I turn to these next. Variance components are extracted from the model with the VarCorr function.
> VarCorr(model0)
species = pdLogChol(1)
Variance StdDev
(Intercept) 0.6721719 0.8198609
Residual 0.2051768 0.4529645
- What's reported in the first column are the squares of the values reported in the summary table, so these are variances. From the output we determine that
= 0.672 and
= 0.205. From this we learn a couple of things.
- Since
> 0, heterogeneity is present. Put another way we have evidence of a correlation structure (as opposed to independent observations).
- Furthermore because the variability between species is nearly three times the variability within species, level-2 modeling in which predictors are included to explain differences between species is likely to be productive.
- We can use the output from VarCorr to calculate the intraclass correlation coefficient.

- Unfortunately VarCorr has not produced a matrix, but a table, and the elements of the table are actually character values.
> VarCorr(model0)[1,1]
[1] "0.6721719"
As a result our calculations will turn out to be rather awkward. I need to convert each character value to a numeric value before I can do arithmetic on it.
> tau.sq<-as.numeric(VarCorr(model0)[1,1])
> sigma.sq<-as.numeric(VarCorr(model0)[2,1] )
> tau.sq/(tau.sq+sigma.sq)
[1] 0.76614
So the correlation between observations coming from the same species is 0.766, a sizeable value and far from 0, the value required for ordinary linear regression to be appropriate here.
- The intervals function calculates confidence intervals for all the estimated parameters.
> intervals(model0)
Approximate 95% confidence intervals
Fixed effects:
lower est. upper
(Intercept) 2.689977 2.888758 3.08754
attr(,"label")
[1] "Fixed effects:"
Random Effects:
Level: species
lower est. upper
sd((Intercept)) 0.6844931 0.8198609 0.9819995
Within-group standard error:
lower est. upper
0.4034828 0.4529645 0.5085144
- Unfortunately the confidence intervals for the random effects part of the model are for the standard deviations, rather than the variances. In any case the standard errors used in calculating the confidence intervals for the variance components are generally regarded as not being very reliable.
The random intercepts model
- The random intercepts model enhances the unconditional means model by including a level-1 predictor. We add the variable lntemp to the model.

For completeness I've included equations at level 2 for all level-1 parameters. Since the slopes are not assumed to vary across species in this model, I set the level-2 equation for the slope for each species equal to a constant.
- The composite equivalent of the multilevel random intercepts model is shown next.

- From the composite model we can directly construct the R code for fitting this model.
> model1<-lme(lnPLD~lntemp, random=~1|species, data=inverts,method="ML")
This is identical to the unconditional means model specification except that we now have a predictor in the fixed effects portion of the model.
- Recall the R convention that if a predictor is included in regression model, the intercept is included automatically. Thus an equivalent way of writing the fixed part of the random intercepts model would be
fixed=lnPLD~1+lntemp
- The model summary is shown below. The only new information we obtain is the estimated coefficient of the level-1 predictor lntemp, shown here to be negative and significantly different from zero.
> summary(model1)
Linear mixed-effects model fit by maximum likelihood
Data: inverts
AIC BIC logLik
239.5597 253.0977 -115.7798
Random effects:
Formula: ~1 | species
(Intercept) Residual
StdDev: 0.8891414 0.2115642
Fixed effects: lnPLD ~ lntemp
Value Std.Error DF t-value p-value
(Intercept) 6.164731 0.17954444 143 34.33541 0
lntemp -1.161889 0.05168823 143 -22.47879 0
Correlation:
(Intr)
lntemp -0.811
Standardized Within-Group Residuals:
Min Q1 Med Q3 Max
-3.08734705 -0.45362977 -0.04267881 0.50275266 3.62335848
Number of Observations: 218
Number of Groups: 74
- The first use we can make of the random intercept model is in assessing how important it was to account for the structure of the data. To do so we need to first fit a model in which the structure is ignored. That would be an ordinary linear regression model with lntemp as the predictor.
> model.OLS<-lm(lnPLD~lntemp,data=inverts)
We can test for the importance of structure using a likelihood ratio test. The degrees of freedom for the test is just the difference in the number of estimated parameters. The random intercepts model estimates one additional parameter
.
> logLik(model.OLS)
'log Lik.' -273.5342 (df=3)
> logLik(model1)
'log Lik.' -115.7798 (df=4)
> 1-pchisq(2*(logLik(model1)-logLik(model.OLS)),1)
[1] 0
Clearly the difference in the loglikelihoods is gigantic. The AIC tells an identical story. The random intercepts model is a huge improvement over the ordinary least squares model. Clearly it is important that we account for data structure in fitting our models.
- Technically the likelihood ratio test we just carried out is incorrect. The null hypothesis we are testing asserts that the variance of the random effects is zero. Because variances are non-negative, a value of zero is a boundary value. One of the conditions for the chi-squared approximation to hold (called a regularity condition) is that the value being tested does not lie on the boundary of the parameter space. This condition fails here and hence the chi-squared distribution of the likelihood ratio statistic is suspect.
- A number of solutions have been proposed. One that is easy to implement is the following (Verbeke and Molenberghs 2000, p. 64–73) and I describe it for the general case. Suppose the models in question differ by one random effect. Thus the goal is to compare two nested models such that one model includes q random effects and the second model includes one additional random effect for a total of q + 1 random effects. The likelihood ratio statistic in this case has a distribution that is approximately a 50:50 mixture of two chi-squared distributions, one with q degrees of freedom and one with q + 1.
- Thus to compare the two models above we should use a mixture of a
and a
random variable with equal weights 0.5. A
random variable gives a probability mass of 1 to the value 0 and a mass of 0 to every other value. Thus in our example above the p-value of our test is calculated as follows.
> .5*0 + .5*(1-pchisq(2*(logLik(model1)-logLik(model.OLS)),1))
[1] 0
so in this instance our conclusion does not change.
- The second use we can make of the random intercept model is to assess the importance of the level-1 predictor. From the Wald test that appears in the summary table we already know that the level-1 predictor is statistically significant. We can obtain the likelihood ratio test version of this test by comparing this model to the unconditional means model, using R's anova function.
> anova(model1,model0)
Model df AIC BIC logLik Test L.Ratio p-value
model1 1 4 239.5597 253.0977 -115.7798
model0 2 3 451.2372 461.3906 -222.6186 1 vs 2 213.6775 <.0001
- We can also use these two models to obtain a pseudo-R2 statistic for the importance of the level-1 predictor. The relevant formula (taken from Lecture 42) is shown below.

The variance components are shown next.
> VarCorr(model1)
species = pdLogChol(1)
Variance StdDev
(Intercept) 0.7905725 0.8891414
Residual 0.0447594 0.2115642
> VarCorr(model0)
species = pdLogChol(1)
Variance StdDev
(Intercept) 0.6721719 0.8198609
Residual 0.2051768 0.4529645
Pulling out the relevant quantities we have the following.
> (as.numeric(VarCorr(model0)[2,1]) - as.numeric(VarCorr(model1)[2,1]))/ as.numeric(VarCorr(model0)[2,1])
[1] 0.7818496
So 78% of the variability of lnPLD at level 1 (the individual level) is explained by its linear relationship to lntemp.
- This calculation also highlights why this statistic is called a pseudo-R2 statistic. Observe from the variance component table that while
decreased (from 0.205 to 0.045) as a result of adding lntemp to the model at level 1,
went up (from 0.67 to 0.79). Thus if we were to calculate an R2 at level 2 here, it would be negative! Because of the inter-relationship between the variances at different levels it is difficult to take the pseudo-R2 calculation too seriously here.
The random slopes and intercepts model
- The random slopes and intercepts model adds a second equation at level 2 to the random intercepts model by allowing slopes to vary across species. The multilevel model formulation is shown below.

where

- The equivalent composite model formulation is shown next.

- From this we can write down the R formulation of the model.
> model2<-lme(lnPLD~1+lntemp, random=~1+lntemp|species, data=inverts, method="ML")
or equivalently
> model2<-lme(lnPLD~lntemp, random=~lntemp|species, data=inverts, method="ML")
- The summary output is slightly more complicated because there are now three parameters estimated for the random part of the model.
> summary(model2)
Linear mixed-effects model fit by maximum likelihood
Data: inverts
AIC BIC logLik
210.4372 230.7441 -99.21859
Random effects:
Formula: ~lntemp | species
Structure: General positive-definite, Log-Cholesky parameterization
StdDev Corr
(Intercept) 2.0544348 (Intr)
lntemp 0.5301851 -0.92
Residual 0.1520061
Fixed effects: lnPLD ~ lntemp
Value Std.Error DF t-value p-value
(Intercept) 7.088667 0.29554220 143 23.98530 0
lntemp -1.454417 0.08465556 143 -17.18040 0
Correlation:
(Intr)
lntemp -0.942
Standardized Within-Group Residuals:
Min Q1 Med Q3 Max
-2.56320100 -0.36679640 -0.01297869 0.41001698 3.64570103
Number of Observations: 218
Number of Groups: 74
- To assess the whether including random slopes has improved the model we should compare this model with the random intercepts model.
> anova(model2,model1)
Model df AIC BIC logLik Test L.Ratio p-value
model2 1 6 210.4372 230.7441 -99.21859
model1 2 4 239.5597 253.0977 -115.77985 1 vs 2 33.12252 <.0001
So the addition of random slopes is a significant improvement.
- Again this test is technically not correct. Instead to compare these two models we should use a mixture of
and
random variables with equal weights 0.5. Thus in our example above the p-value of our test is calculated as follows.
> .5*(1-pchisq(33.12252,1)) + .5*(1-pchisq(33.12252,2))
[1] 3.642662e-08
- I examine the variance components next.
> VarCorr(model2)
species = pdLogChol(lntemp)
Variance StdDev Corr
(Intercept) 4.22070229 2.0544348 (Intr)
lntemp 0.28109623 0.5301851 -0.92
Residual 0.02310587 0.1520061
- From the output we see that
= 4.22,
= 0.281, and
= 0.023. Notice that
is not reported. Instead an estimate of the correlation is shown. We can calculate the covariance from the formula
as follows .
> as.numeric(VarCorr(model2)[2,3]) * as.numeric(VarCorr(model2)[1,2]) * as.numeric(VarCorr(model2)[2,2])
[1] -1.002092
- The level-2 variance components are not directly comparable to each other because
needs to be scaled by the predictor lntemp. Any change in the scale of lntemp will affect the reported value of
. Thus although it's tempting at this point to say that there is greater variance in the intercept random effects than in the slope random effects (and thus directly address one of the research questions of the investigators) that conclusion does not follow. In truth the variance relationship changes with the value of lntemp. Furthermore there is a nonzero covariance between the random effects which means the two variances are not independent of each other.
- The large reported negative correlation between the random effects, –0.92, can potentially lead to estimation problems down the road (as we'll see tomorrow). A standard way to fix this problem is to center the predictor. Centering the predictor generally means subtracting off its mean. I refit the model after centering lntemp and check the correlation. I use the I() function so that the centering arithmetic is carried out before the model is fit.
> model2.5<-lme(lnPLD~I(lntemp-mean(lntemp)), random=~I(lntemp-mean(lntemp))|species, data=inverts, method="ML")
> VarCorr(model2.5)
species = pdLogChol(I(lntemp - mean(lntemp)))
Variance StdDev Corr
(Intercept) 0.81092490 0.9005137 (Intr)
I(lntemp - mean(lntemp)) 0.28109627 0.5301851 -0.444
Residual 0.02310586 0.1520061
Observe that the correlation has been reduced by half.
- The random slopes and intercepts model is typically the baseline upon which the level-2 model is built. We'll build a level-2 model tomorrow.
Cited References
- Pinheiro, J. C. and Bates, D. M. 2000. Mixed-Effects Models in S and S-Plus . Springer-Verlag, New York.
- Verbeke, G. and G. Molenberghs. 2000. Linear Mixed Models for Longitudinal Data. Springer-Verlag, New York.
Course Home Page