Lecture 29—Wednesday, December 3, 2008
What was covered?
Terminology defined
R functions and commands demonstrated
- dmvnorm (from mvtnorm) is the multivariate normal density function
- gnls (from nlme) is the generalized nonlinear least squares function. It is analogous to gls but for nonlinear models. It permits the modeling of group-level correlation and variance structures without also including random effects.
- if...else construction for conditional execution of R statements
- predict (method from nlme) to obtain the predicted means from a fitted lme model
- qq.plot (from car) is a version of qqnorm that adds confidence bounds to the plot
- sample is used to take a random sample from a vector of values
- segments is used to draw line segments between specified points
- split is used to divide a vector into a groups as dictated by the categories of a second vector
- write.table is used to write out text files from an R data frame
R function options
- las= (argument to axis and other graphics function) controls the orientation of displayed text. Choosing las=2 orients text perpendicular to an axis.
- lend= (argument to par function) defines the type of line ends used when drawing line segments. The default is "rounded". Setting lend=2 changes this to "square".
- level= (argument to predict for lme models) specifies the level in the multilevel model at which the mean should be computed. A choice of level=0 corresponds to the population level.
- log= (argument to plot function) changes the scale of the specified axis to logarithmic, for instance, log='x' is used to change the scale of just the x-axis.
- replace= (argument to sample) when set to TRUE causes the random sample to be drawn with replacement. Sampling with replacement is necessary for bootstrapping.
- weights= (argument to lme or nlme) is used to specify a variance function for modeling the variance structure of the level-1 residuals.
R packages used
- arm for the bugs function
- car for the qq.plot function
- lattice defines the functions for lattice graphics
- mvtnorm for the multivariate normal density function dmvnorm
- nlme for fitting linear and nonlinear mixed effects models
Variance Heterogeneity
- I use this last (online only) lecture to tie up some loose ends for two of the analyses we did in the course and to extend a result involving the likelihoods of transformed response variables (lecture 10) to the mixed model setting. The analyses I wish to re-examine are the following:
- the three-way analysis of variance we carried out for the tadpole data in lecture 2 and lecture 5, and
- the nonlinear mixed effects model we fit to the kestrel data in lecture 27 and lecture 28.
- Although the analyses we used in these two cases were quite different, both exhibit the same shortcoming in that they failed to adequately account the heterogeneous variances of the response variable. Variance heterogeneity is a problem when it arises in normal-based models. The basic assumption we make in such models is

Thus although it's permissible for different observations to come from normal distributions with different means,
, we require that each of these normal distributions share a common variance,
.
- How can variance heterogeneity manifest itself in the two analyses we did?
- In an ANOVA model each group (where by "group" I mean a specific treatment combination) is allowed to have a different mean, but the variability in each group should be the same. Thus if the variances differ for different groups we have evidence for variance heterogeneity.
- If the constant variance assumption holds in ordinary regression then we expect to see roughly the same variation about the regression surface at all points. If instead we see for instance that the spread about the regression surface increases as the predicted value of the response increases, that would be evidence of variance heterogeneity.
- So, if variance heterogeneity exists it means that our basic model is wrong. Contrary to our assumptions, the response is not normally distributed with a variable mean and a constant variance.
- There are two basic approaches for dealing with variance heterogeneity.
- Keep the normality assumption but allow the variance to vary in a manner as dictated by the data. This is clearly an ad hoc approach. We pretend that our probability model is sound and apply enough Band-Aids in the form of alternative variance models to fix what's wrong. The formal statistical methodology for doing this is weighted least squares or its extension generalized least squares.
- Reject the assumption of normality and choose instead a probability model in which the variance is not constant but varies in a some systematic fashion with the mean. Variance heterogeneity often arises in practice because a natural boundary exists that restricts the extent to which the data can vary in one direction but not the other. With biological data that natural boundary is usually zero—the value of the response cannot be negative. The normal distribution places no such restriction on its values. Two probability distributions that are commonly used for positive-only data are the lognormal and gamma distributions.
- Although I'm generally not fond of the first approach, using generalized least squares with an explicit model for the variance is a sensible alternative when
- there is no theory to assist in choosing a probability model and
- there data are inadequate for making the decision empirically.
- The second option of choosing an alternative probability model is the preferred one when the biology of the phenomenon under investigation makes it clear that a normal probability distribution is a silly probability model.
- In what follows I give examples of using both approaches by revisiting two previous analyses we've carried out.
Variance Heterogeneity in an ANOVA Model
- I reload and process the tadpoles data set just as was done in lecture 2.
tadpoles <- read.table( 'http://www.unc.edu/courses/2008fall/ecol/563/001/data/lab1/gooddat.txt', header=TRUE)
prelim <- as.character(tadpoles$var)
treatment<-substring(prelim,6,9)
response<-as.numeric(substring(prelim,10,nchar(prelim)))
- Next, I reproduce the various plots we used to compare the locations and variabilities of the distributions of each group.
#obtain treatment means
treat.mean <- tapply(response, treatment, mean, na.rm=T)
#produce plot with customized y-axis labels
boxplot(response~treatment, horizontal=T, axes=F, outline=F, xlab='Mitotic activity', ylim=range(response,na.rm=T))
axis(1)
axis(2,las=2,at=1:12,labels=names(treat.mean))
box()
#overlay raw data
points(response, jitter(as.numeric(factor(treatment))), pch=16, col='seagreen', cex=.6)
#add group means
points(treat.mean,1:12,pch=8,col=2)
Fig. 1 Graphical summary of the results of the tadpole experiment
apply(response,treatment, var, na.rm=T)->treat.var
dotplot(names(treat.var)~treat.var, xlab='Variance')
tapply(response, treatment, mad, na.rm=T)->treat.mad
dotplot(names(treat.mad)~treat.mad, xlab='MAD')
(a)  |
(b)  |
| Fig. 2 Dotplots of (a) group variances and (b) group MADs (median absolute deviations) |
- In lecture 2 I drew the following conclusions from these Figures.
- While there is evidence that the group variances are not homogeneous, the problem is not widespread. It appears that only one or two groups stand out as intrinsically more variable than the rest. In other groups with large variances the cause appears to be the presence of a few outlying observations. (CoD1 is an example.)
- Variance heterogeneity due to outliers is easy to deal with. We just do the analysis twice, once with and once without the outliers, and see if our results change at all. If they do then we have to make a decision about what to do about the outliers. If the results don't change then the effect of the outliers is unimportant and we can ignore the heterogeneity. Intrinsic variance heterogeneity on the other hand is a more complicated problem with which to deal.
- The boxplot display in Fig. 1 suggests that the variance heterogeneity is not having a large effect on the values of the means. If we were to rank groups based on their treatment means or on their treatment medians, we would get essentially the same ranking.
- Based on these considerations I'm willing to tentatively conclude that the variance heterogeneity we are seeing here is probably not all that important. So, for the moment I'm going to ignore it. I'm willing to proceed with the standard analysis of variance as if there were no problems present. At the end of the course we'll return to this data set and tackle the variance heterogeneity issue explicitly and see if doing so changes our conclusions in any way.
- The time to revisit these data has arrived! Should we be worried about the variance heterogeneity? The usual rule of thumb is that ANOVA is fairly resistant to variance heterogeneity when there is complete balance across the groups, i.e., when each group has the same number of observations. Is that the case here?
tapply(response, treatment, function(x) sum(!is.na(x))) -> grplength
grplength
CoD1 CoD2 CoS1 CoS2 NoD1 NoD2 NoS1 NoS2 RuD1 RuD2 RuS1 RuS2
13 17 24 24 19 24 22 24 12 24 12 24
Because of all the missing data it turns out there is a substantial imbalance across the groups. So we can't appeal to robustness.
- There is something else of interest in the above list of sample sizes. Observe that the three groups with the most extreme variances in Fig. 2 all have smaller sample sizes than do most of the rest. Variances require a lot of observations to obtain accurate estimates. Hence it's possible that the differences in variability that we're seeing may reflect our inability to get good variance estimates for those groups that have many missing values.
- There is nothing especially unusual about any of the distributions shown in Fig. 1 that would lead us to reject normality out of hand. The empirical distributions are symmetric or exhibit only a slight skew. Consequently using a probability model such as a lognormal distribution (equivalent to first log transforming the data and fitting an ordinary regression model) or a gamma distribution (and then fitting a generalized linear model) is probably not worth considering.
- Both of these distributional choices would make sense if the data were coming up against a natural boundary (such as zero) that forced the distributions to have long upper tails. The distributions here are all far removed from zero and are symmetric.
- Now it is the case that the distributions of a couple of the Co groups might be better represented by a lognormal or gamma, but they could also be "fixed" by the removal of a single aberrant outlying observation. In short choosing a probability model different from normal is not strongly supported by the data.
Weighted least squares
- A standard way of handling variance heterogeneity analytically is through weighted least squares. In weighted least squares a weighted sum of squared residuals is minimized to obtain parameter estimates. Typically the weights are chosen to reflect different group precisions and a sensible choice for the weights is the reciprocal of the variance of each group. One problem with weighted least squares when the weights are estimated from the data is that usual degrees of freedom for the F-distribution used in assessing the statistical significance of partial F-tests of model parameters is only approximate. As a result the accuracy of the p-values for these significance tests is uncertain. This problem can be resolved to some extent by bootstrapping as I illustrate.
- The code shown below carries out weighted least squares on the tadpole data. A sequence of F-tests is carried out in a logical fashion in which the three-factor interaction is tested first followed by separate tests of the different two-factor interactions. The same order for testing the two-factor interactions is used as was done in the unweighted analysis (where the least significant two-factor interaction was examined first).
fac1<-factor(substring(treatment,1,2))
fac2<-factor(substring(treatment,3,3))
fac3<-factor(substring(treatment,4,4))
tapply(response, treatment,var, na.rm=T)->grpvar
frame.var<-data.frame(names(grpvar), grpvar)
colnames(frame.var)[1]<-'treatment'
frame.var2<-data.frame(treatment, response, fac1, fac2, fac3)
frame.var3<-merge(frame.var, frame.var2, all=T)
frame.var3$wght<-1/frame.var3$grpvar
#fit full model
lm(response~fac1*fac2*fac3, data=frame.var3, weight=wght)->out1a
#weighted: test 3-factor interaction
update(out1a,.~.-fac1:fac2:fac3)->out2a
anova(out2a,out1a)
Analysis of Variance Table
Model 1: response ~ fac1 + fac2 + fac3 + fac1:fac2 + fac1:fac3 + fac2:fac3
Model 2: response ~ fac1 * fac2 * fac3
Res.Df RSS Df Sum of Sq F Pr(>F)
1 229 227.79
2 227 227.00 2 0.79 0.3949 0.6742
#weighted: test fac1:fac3 interaction
update(out2a,.~.-fac1:fac3)->out3a
anova(out3a,out2a)
Analysis of Variance Table
Model 1: response ~ fac1 + fac2 + fac3 + fac1:fac2 + fac2:fac3
Model 2: response ~ fac1 + fac2 + fac3 + fac1:fac2 + fac1:fac3 + fac2:fac3
Res.Df RSS Df Sum of Sq F Pr(>F)
1 231 229.587
2 229 227.790 2 1.797 0.9034 0.4066
#weighted: test fac1:fac2 interaction
update(out3a,.~.-fac1:fac2)->out4a
anova(out4a,out3a)
Analysis of Variance Table
Model 1: response ~ fac1 + fac2 + fac3 + fac2:fac3
Model 2: response ~ fac1 + fac2 + fac3 + fac1:fac2 + fac2:fac3
Res.Df RSS Df Sum of Sq F Pr(>F)
1 233 234.326
2 231 229.587 2 4.739 2.3841 0.09444 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#weighted: test fac2:fac3 interaction
update(out4a,.~.-fac2:fac3)->out5a
anova(out5a,out4a)
Analysis of Variance Table
Model 1: response ~ fac1 + fac2 + fac3
Model 2: response ~ fac1 + fac2 + fac3 + fac2:fac3
Res.Df RSS Df Sum of Sq F Pr(>F)
1 234 244.03
2 233 234.33 1 9.70 9.6449 0.002134 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
- We can compare the weighted results to the results of the unweighted analysis.
lm(response~fac1*fac2*fac3,data=frame.var3)->out1
#unweighted: test 3-factor interaction
update(out1,.~.-fac1:fac2:fac3)->out2
anova(out2,out1)
Analysis of Variance Table
Model 1: response ~ fac1 + fac2 + fac3 + fac1:fac2 + fac1:fac3 + fac2:fac3
Model 2: response ~ fac1 * fac2 * fac3
Res.Df RSS Df Sum of Sq F Pr(>F)
1 229 13.8533
2 227 13.7838 2 0.0695 0.5723 0.565
#unweighted: test fac1:fac3 interaction
update(out2,.~.-fac1:fac3)->out3
anova(out3,out2)
Analysis of Variance Table
Model 1: response ~ fac1 + fac2 + fac3 + fac1:fac2 + fac2:fac3
Model 2: response ~ fac1 + fac2 + fac3 + fac1:fac2 + fac1:fac3 + fac2:fac3
Res.Df RSS Df Sum of Sq F Pr(>F)
1 231 13.908
2 229 13.853 2 0.055 0.4542 0.6355
#unweighted: test fac1:fac2 interaction
update(out3,.~.-fac1:fac2)->out4
anova(out4,out3)
Analysis of Variance Table
Model 1: response ~ fac1 + fac2 + fac3 + fac2:fac3
Model 2: response ~ fac1 + fac2 + fac3 + fac1:fac2 + fac2:fac3
Res.Df RSS Df Sum of Sq F Pr(>F)
1 233 14.2100
2 231 13.9083 2 0.3017 2.5057 0.08384 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#unweighted: test fac2:fac3 interaction
update(out4,.~.-fac2:fac3)->out5
anova(out5,out4)
Analysis of Variance Table
Model 1: response ~ fac1 + fac2 + fac3
Model 2: response ~ fac1 + fac2 + fac3 + fac2:fac3
Res.Df RSS Df Sum of Sq F Pr(>F)
1 234 14.68
2 233 14.21 1 0.47 7.7069 0.005948 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Checking the sensitivity of the weighted analysis to the choice of weights
- The reported p-values from the weighted analysis may be unreliable when the weights used are estimated rather than exact. Following the recommendations of Kutner et al. (2005), we can evaluate the stability of the p-values by obtaining bootstrap samples from the treatment groups, which we use to estimate new weights for a weighted analysis of the original data, and then rerunning the sequential ANOVA tests. The role of the bootstrap here is to generate new samples of response values from a surrogate for the population from which we actually drew our sample.
- The logic behind the bootstrap is as follows. By assumption, the tadpoles used in this study are a random sample from each of the treatment groups. We're allowed to make this assumption because for each classification factor, such as sex, we've randomly selected tadpoles of each sex, and for each true treatment factor, such as diet, the treatment has been randomly assigned to individual tadpoles (after controlling for the classification factor). Under these assumptions the distribution of the response variable in the sample from each treatment group should be a reasonable approximation of the distribution of the response variable in the population for that treatment.
- Fig. 3 illustrates this idea. Here I've isolated the sample observations that received treatment RuS2. (The details on how this was done are given below.) What's shown is the raw sample distribution of the response (as a histogram) along with the smoothed version of the sample distribution (superimposed red density curve). The bootstrap assumption is that the sample's probability density curve shown here should be very similar to the true, but unknown, probability density of the population.
hist(z$RuS2, probability=T, xlim=c(3.8,4.8), xlab='Response', main='Treatment Group RuS2', col='grey90, cex.main=1.05')
lines(density(z$RuS2), col=2)

Fig. 3 Distribution of response for treatment RuS2
- To get a new sample from the population we just resample the sample values we already have. We carry out the resampling probabilistically so that the more common sample values are more likely to be selected. In other words we draw from the sample as dictated by the probability density shown in Fig. 3. The way to do this with a finite sample is to sample with replacement, i.e., whenever an observation is drawn we place it back in "population" so that it can be potentially drawn again.
- The function below that I call boot.wt.func obtains the bootstrap samples, recalculates the weights, fits the various regression models using the original data (not the resampled data) with these weights, and returns a matrix of p-values for tests of the four interactions listed in Table 1. This is then repeated enough times to get a stable estimate of the distribution of these p-values. The bootstrapping portion of boot.wt.func is carried out in the following four lines of code.
y<-response[!is.na(response)]
x<-treatment[!is.na(response)]
z<-split(y,x)
boot.samp<-lapply(z,function(x) sample(x, replace=T))
- The split function in line three divides the response variable y by categories of the treatment variable x. The result is a list of vectors. Below I show the first two components of that list.
z[1:2]
$CoD1
[1] 4.166 3.247 3.289 3.195 3.431 3.180 3.140 3.301 3.253 3.155 3.119
[12] 3.250 4.454
$CoD2
[1] 3.711 3.375 3.622 3.315 3.473 3.820 3.548 3.428 3.502 3.572 3.497
[12] 4.395 3.075 3.341 3.300 2.915 3.257
- The lapply function in line four above is used to apply the sample function (argument 2) separately to each component (treatment group) of the list z. We get a bootstrap sample because the sample function is used with the replace=T argument. Without specifying replace=T we would just get the original sample back again, albeit with the elements in a different order.
- In the lines below I compare the original sample for treatment group CoD1 with its bootstrap version.
sort(z$CoD1)
[1] 3.119 3.140 3.155 3.180 3.195 3.247 3.250 3.253 3.289 3.301 3.431
[12] 4.166 4.454
sort(boot.samp$CoD1)
[1] 3.119 3.140 3.140 3.155 3.247 3.247 3.247 3.253 3.289 3.301 3.431
[12] 3.431 3.431
- We can immediately some differences in the bootstrap sample. Observations 3.180, 3.195, 3.250, 4.166, and 4.454 were not selected in the bootstrap sample. Instead 3.140 was selected twice, 3.247 was selected three times, and 3.431 was selected three times.
- The rest of the boot.wt.func function uses the bootstrap samples from each treatment group to calculate new weights. These are then merged with the original data set (by first creating a new data frame containing the weights along with a common key field, in this case 'treatment'). These new weights then are used to fit the various linear models that allow us to carry out sequential tests of the three-factor interaction and the three two-factor interactions. The boot.wt.func function has a seed argument so that the same bootstrap sample can be obtained again if desired.
#num.boots is the number of times to resample
#myseed sets the random number seed for the bootstrap samples
boot.wt.func<-function(num.boots,myseed=10)
{
set.seed(myseed)
boot.results<-matrix(NA,num.boots,4)
#obtain bootstrap weights and run ANOVA tests
for(i in 1:num.boots) {
#split data into treatment groups
y<-response[!is.na(response)]
x<-treatment[!is.na(response)]
z<-split(y,x)
#resample each treatment group
boot.samp<-lapply(z,function(x) sample(x, replace=T))
#calculate weight for each treatment group
new.wt<-1/sapply(boot.samp,var)
#add weights to raw data
frame.wt<-data.frame(names(new.wt), new.wt)
colnames(frame.wt)<-c('treatment', 'wght')
frame.dat<-data.frame(treatment, response, fac1, fac2, fac3)
frame.boot<-merge(frame.wt, frame.dat, all=T)
#### carry out ANOVA tests ####
#test 3-factor interaction
lm(response~fac1*fac2*fac3, data=frame.boot, weight=wght) -> mod1
update(mod1,.~.-fac1:fac2:fac3) -> mod2
anova(mod2,mod1) -> anova.int3
p1 <- anova.int3["Pr(>F)"][2,1]
#test fac1 : fac3 interaction
update(mod2,.~.-fac1:fac3) -> mod3
anova(mod3,mod2) -> anova.int13
p2 <- anova.int13["Pr(>F)"][2,1]
#test fac1 : fac2 interaction
update(mod3,.~.-fac1:fac2) -> mod4
anova(mod4,mod3) -> anova.int12
p3 <- anova.int12["Pr(>F)"][2,1]
#test fac2 : fac3 interaction
update(mod4,.~.-fac2:fac3)->mod5
anova(mod5,mod4)->anova.int23
p4 <- anova.int23["Pr(>F)"][2,1]
boot.results[i,] <- c(p1, p2, p3, p4)
}
return(boot.results)
}
- I run the function 1000 times (with an initial seed of 15) and obtain the median along with 50% and 95% percentile-based confidence intervals of the p-value distributions.
boot.wt.func(1000,15)->boot.dist
apply(boot.dist,2,function(x) quantile(x, c(.025, .25, .5, .75, .975))) -> out
colnames(out) <- c('3-factor','fac1 x fac3','fac1 x fac2','fac2 x fac3')
out
3-factor fac1 x fac3 fac1 x fac2 fac2 x fac3
2.5% 0.6125987 0.2743471 0.04583161 0.0005427165
25% 0.6799251 0.3823111 0.08452139 0.0019939765
50% 0.7233457 0.4550797 0.11821048 0.0036579286
75% 0.7570539 0.5317377 0.18328789 0.0072876328
97.5% 0.8244707 0.8409729 0.42074773 0.0227337888
- From the distributions we can get a sense of what a typical p-value might be given the uncertainty in our estimations of the inverse variances that are used as weights. It's far more informative to display this information in a graph. The result shown in Fig. 4 is sometimes called a probability smear graph.
- Because the p-values span a couple orders of magnitude, I use a log scale for the x-axis in Fig. 4. This is accomplished by including a log argument in the plot function.
- The error bars are drawn with the segments function.
- I use the lend graphics option is used to reset the shape of the bar ends from the default round to square.
- I use the las argument to the axis function to rotate the tick mark labels so that they are printed horizontally.
#expand left margin to make room for tick labels
par(mar=c(5.1,6.1,1.1,1.1))
#change bar ends to square
par(lend=2)
plot(range(out) ,c(.75,4.25), type='n', log='x', axes=F, ylab=' ', xlim=c(.0001,1), xlab='p-value')
axis(1, at=10^(-4:0), labels=c('.0001', '.001', '.01', '.1', '1'), cex.axis=.9)
axis(2, at=1:4, labels=colnames(out), las=2, cex.axis=.9)
box()
rbind(1:4,out)->out.full
apply(out.full, 2, function(x) segments(x[2], x[1], x[6], x[1], lwd=3))
apply(out.full, 2, function(x) segments(x[3], x[1], x[5], x[1], lwd=6, col='grey70'))
abline(v=.05, col=2, lty=2)
apply(out.full ,2, function(x) points(x[4], x[1], pch=16, col='white'))
apply(out.full,2, function(x) points(x[4], x[1], pch=1, col=1, cex=1.1))
segments(.045, 1.5, .06, 1.5, lwd=12, col='white')
text(.05, 1.5, expression(alpha==.05), col=2, cex=.9)

Fig. 4 Probability smear graph summarizing the results from using bootstrapped weights
- Graphs of the sort shown in Fig. 4 have been popularized by Gelman and Hill (2006). The rationale they give for such a picture is that it gives a range of uncertainty for the statistic. They note, p. 18, that "a 50% interval is particularly easy to interpret since the true value should be as likely to be inside as outside the interval." Another way to view the display is as an uncertainty smear. Instead of treating the line segments literally as confidence intervals with all the accompanying frequentist intellectual baggage, the diagram can be thought of as contrasting a crap shoot, the 50% confidence interval, with a sure thing, the 95% interval. A reasonable compromise might lie somewhere between these two extremes.
- From the reported confidence intervals we see that although the p-values are very sensitive to the choice of weights, the conclusions we would draw from Fig. 4 are consistent with the conclusions we drew from Table 1. The 3-factor interaction and the factor 1 × factor 3 interaction are still not significant (confidence interval lies entirely to the right of α = .05), while the factor 2 × factor 3 interaction is clearly statistically significant (confidence interval lies entirely to the left of α = .05). The significance of the factor 1 × factor 2 interaction, which was ambiguous before, remains so.
- Because our goal is to validate the results reported in Table 1, it might make more sense to construct one-sided confidence intervals for this purpose: 95% lower-tailed intervals for those effects not found to be statistically significant in Table 1 and a 95% upper-tailed interval for the one effect found to be statistically significant. Fig. 4 gives us 97.5% one-tailed confidence intervals.
- From the results summarized in Fig. 4 we can be 97.5% confident that p-value for the test of fac2 × fac3 is less than .05.
- Similarly we can be 97.5% confident that the p-values for the tests of fac1 × fac3 and the three-factor interaction are greater than .05.
- At the 97.5% confidence level we are unable to say the p-value for the test of fac1 × fac2 is greater than .05.
- We can obtain 95% one-tailed tests by recalculating the percentile-based confidence intervals with a .05 probability in the appropriate tail.
apply(boot.dist[,1:3],2,function(x) quantile(x,.05))
[1] 0.62826954 0.30110470 0.05350924
quantile(boot.dist[,4],.95)
95%
0.01692254
The lower limit of the one-sided confidence interval for the fac1 × fac2 interaction p-value is 0.0535. Thus we see that 95% of the time the weights obtained through resampling yield p-values for the test of the factor 1 × factor 2 interaction that are larger than 0.05.
-
I examine the actual distribution of the p-values for the test of this interaction in Fig. 5. Notice that many of the weights yield extremely non-significant results and that the mode of the distribution is very close to the p-value of 0.09 reported in Table 1.
hist(boot.dist[,3], xlab='p-value', ylab='density', breaks=seq(0,.7,.025), main='', probability=T)
lines(density(boot.dist[,3]), col=2, lwd=2)

Fig. 5 Distribution of p-values for a test of the fac1 × fac2 interaction
using weights calculated from bootstrapped samples
- So, after having accounted for the observed variance heterogeneity by carrying out a weighted regression and having accounted for the uncertainty in the weights by using bootstrap estimates, our conclusions do not differ substantively from those we obtained using an unweighted analysis (lecture 5).
Variance Heterogeneity in a Mixed Effects Model with a Non-negative Response
Model diagnostics for the kestrel nonlinear growth model
- I reload the kestrel data set from lecture 27 and lecture 28 and refit the final Gompertz model in which the a0 parameter is affected by sex, treatment, and their interaction while the b0 parameter is influenced by just sex. As part of the preliminaries to fitting the model I also have to recalculate the various cell means for use as initial values for the various model parameters when fitting the model.
kestrel <- read.table('http://www.unc.edu/courses/2008fall/ecol/563/001/data/lab13/kestrel_data.csv', header=TRUE, sep=',', na.strings='.')
kestrel2<-kestrel[kestrel$egg==1 & !is.na(kestrel$nstlmass),]
kestrel3<-kestrel2[!is.na(kestrel2$sex),]
library(nlme)
#begin calculations to obtain initial estimates
as.numeric(names(table(kestrel2$nest)))[table(kestrel2$nest)>=4] -> good.nests
kestrel2.short<-kestrel2[kestrel2$nest %in% good.nests,]
just.enough<-kestrel2.short[,c("nstlmass", "age", "nest", "sex", "trtmnt")]
#least squares estimates from individual fits
out.nls<-nlsList(nstlmass~SSgompertz(age,a0,b0,b1)|nest, data=just.enough)
#make a data frame out of coefficient estimates
coef.mat<-data.frame(rownames(coef(out.nls)), coef(out.nls))
names(coef.mat)[1]<-'nest'
#make data frame of individual coefficient estimates
just.enough[tapply(1:nrow(just.enough), just.enough$nest, function(x) x[1]),]->abbrev
coef.nests<-merge(abbrev,coef.mat)
#initial estimates for b1
parm.means<- apply(coef(out.nls), 2, mean)
#initial estimates for a0
a0.means<-as.vector(tapply(coef.nests$a0, list(coef.nests$sex, coef.nests$trtmnt), mean))
amat<-matrix(c(1,1,1,1,0,1,0,1,0,0,1,1,0,0,0,1), ncol=4, nrow=4)
ainv<-solve(amat)
a0.guesses<-as.vector(ainv %*% a0.means)
#initial estimates for b0
tapply(coef.nests$b0, coef.nests$sex, mean) -> b0sex.means
guesses.b0sex<-c(a0.guesses, b0sex.means[1], b0sex.means[2]-b0sex.means[1], parm.means[3])
#fit model
a0b0.intsex <- nlme(nstlmass~SSgompertz(age,a0,b0,b1), fixed=list(a0~sex*trtmnt, b0~sex, b1~1), random=a0~1|nest, data=kestrel3, na.action=na.omit, start=guesses.b0sex, control=nlmeControl(maxIter=1000, niterEM=500, msMaxIter=500))
- Residual analysis is typically used to assess model adequacy in ordinary linear regression models, and the same is true with mixed models. With mixed models though there are residuals at each level of the hierarchy to consider.
- The level-1 residuals are comparable to the residuals of ordinary regression.
- The level-2 residuals are the predicted random effects.
- When the number of level-2 units is small, as it is here, the level-2 residuals are not especially useful as diagnostics. Hence I will focus on level-1 residual analysis. The level-1 residuals can be used to address a number of important questions.
- Has the level-1 model been properly specified?
- Have the fixed effects included in the final model coupled with the correlation structure induced by the random effects adequately accounted for the original correlations in the response?
- Is there evidence for variance heterogeneity across values of the level-1 predictor?
- The reported significance tests require that the conditional distribution of the response is normal. The level-1 residuals can be used to examine this assumption also.
- In addition to these standard statistical assumptions, there are some theoretical issues to consider. In particular the use of the normal distribution as a probability model for biological growth data poses some special problems
- Because we're dealing with growth, the response variable must be positive, being bounded below by zero. Initially when the organism is small the presence of this lower boundary on size will necessarily limit how much the size of organism can vary. Subsequently as the organism grows this lower boundary will become less important in limiting variability. Consequently we might expect variability to increase with age, something that is presently not accounted for in our Gompertz model. The subject-specific estimates from the mixed effects model yield an individual trajectory for each bird about which the observed responses are assumed to be normally distributed with constant variance. Thus our present model assumes that variability about the growth trajectory does not change with age.
- The second problem is a direct consequence of the same model assumptions that failed to account for variance heterogeneity. Because we are using a normal distribution for the response, positive and negative deviations from the mean trajectory are equally likely. A normal distribution with a Gompertz mean has a constant variance at all ages. Thus negative deviations when the bird is young and hence small have a non-negligible probability of yielding negative predictions of bird size. In particular, if we were to generate new data using this model it is likely that some of the simulations would produce negative values of size at the lower age classes. The problem arises because a normal probability model always has an infinite support set and hence any real value is possible. This becomes an acute problem with positive data only when the variance that was estimated for the normal distribution is such that it is likely that the boundary value will be surpassed.
Variance heterogeneity
- To check for variance heterogeneity we can proceed as with the ANOVA model above and plot model residuals against model predictions. To focus specifically on the relationship with age, we can plot model residuals against age. The method for predict when dealing with an nlme object, denoted predict.nlme, defaults to predicting the mean response at the highest level. In our case this means that it includes predictions of the nest-level random effects in its estimate of the mean response.
#Fig. 6a: plot residuals against fitted values
plot(predict(a0b0.intsex), residuals(a0b0.intsex, type='pearson'), xlab='Predicted values', ylab='Pearson residuals')
#Fig. 6b: plot residuals at each age
boxplot(residuals(a0b0.intsex,type='pearson')~factor(kestrel3$age), xlab='Age', ylab='Pearson Residuals', col='gray85', outline =F, ylim=range(residuals(a0b0.intsex,type='pearson')))
points(jitter(as.numeric(factor(kestrel3$age)), amount=.1), residuals(a0b0.intsex, type='pearson'), pch=19, cex=.6, col=4)
#add means to boxplots
points(1:6, tapply(residuals(a0b0.intsex, type='pearson'), factor(kestrel3$age), mean), pch=8, col=2)
(a)  |
(b)  |
Fig. 6 (a) Residuals from Gompertz model plotted against predicted values. There is a noticeable increase in spread from left to right particularly as reflected by the negative residual values. (b) Residuals plotted against age. The spread appears to increase with age. This is especially noticeable for the first three age classes. |
As can be seen residual variance increases both with age and the level of the predicted value. Much of the increased variability is probably due mostly to the increased spread in the negative residuals, which is largely a consequence of the model mean moving away from zero, the lower boundary restriction on the data.
Temporal correlation
- As we've discussed elsewhere, part of the rationale for including random effects in a model is to account for observational heterogeneity, the fact that observations made on the same nestling are likely to be correlated but should be uncorrelated with observations made on different nestlings. In addition to the correlation structure imposed by the random effects, the direct modeling of the mass-age relationship should also account for some of the within-nestling correlation. Still, because the observations on each nestling were made in temporal succession, something not accounted for by the random effects, correlation may still persist. Because the age measurements are equally spaced in time we can examine this possibility by plotting the empirical autocorrelation function (ACF) of the residuals.
- The nlme package provides an ACF function that when applied to an nlme model calculates the autocorrelations of the residuals at various lags. The ACF function, unlike the base acf function, automatically accounts for the fact that the observations come in clusters, one set per nest. It assumes the residuals are ordered correctly, although it supports a form argument that can be used to identify an ordering variable. I use it here, form=~age, but it's not necessary because the observations are already sorted by age within each nest.
plot(ACF(a0b0.intsex, form=~age), alpha=.05)
plot(ACF(a0b0.intsex, form=~age), alpha=.01)
(a)  |
(b)  |
Fig. 7 Autocorrelation plot of the residuals treating them in age order. Displayed confidence bands use (a) α = .05 and hence no correction for multiple testing; (b) α = .01 using a Bonferroni correction. |
- The ACF plot is shown in Fig. 7a with superimposed 95% confidence bands (not corrected for multiple testing). As we can see, there is a significant negative correlation at lag 2. In Fig. 7b a Bonferroni correction is used. For five tests this means we set α =.01. Now the negative correlation at lag 2 appears not to be statistically significant.
Normality
- The normality question is best addressed graphically with normal probability plots. Although the qqnorm function in base R can be used to produce such a plot, the qq.plot function in the car package adds some additional features to the plot that make the resulting graph easier to interpret. The car package is not part of the standard installation and needs to first be installed from a CRAN mirror site.
library(car)
qq.plot(residuals(a0b0.intsex, type='pearson'), ylab='Pearson residuals')

Fig. 8 Normal probability plot of level-1 residuals
-
Pearson residuals are a form of standardized residuals and are usually a good choice for diagnostic plots. As the normal probability plot in Fig. 5 shows, the level-1 residuals appear to be far from normal. The ordered residuals should closely hug the plotted line and all observations should fall within the displayed confidence bands. Instead we see a distinct curvilinear trend to the residuals indicating a skewed distribution.
- Thus although graphs of model trajectories we produced last time suggest that the model provides a reasonable fit to the data, the diagnostics discussed above say otherwise.
An ad hoc approach to dealing with variance heterogeneity
-
The violation of normality, the presence of variance heterogeneity, and the temporal autocorrelation of the residuals may derive from a common cause. In producing an autocorrelation plot it is assumed that the variance of the residuals is constant in order that the estimated correlation only depends on the time lag. Since we have already seen that the variance is not constant across age, this may be leading to the significant lag-2 autocorrelation.
- What I call here the "ad hoc approach" handles all of these problems by applying a Band-Aid to the normal distribution. Instead of rejecting the normal distribution as being an inadequate description of the data at hand, we extend it slightly to accommodate the vagaries of the data. We do this by relaxing the constant variance assumption and allow the variance of the normal distribution to change as a function of some measured variable. In this new model instead of having a sequence of constant variance normal distributions centered on the Gompertz curve for growth, the width of these normal distributions is allowed to vary in a systematic way.
-
The nlme package of R includes a suite of variance functions for modeling heteroscedasticity in mixed effects models. These functions are used to define a formula for the variance that in turn is entered as the value of the weights= argument of lme or nlme. I focus on functions that model the variance as a function of a covariate, which in our case will be age. The relevant functions are:
- varFixed in which the within-group variance is known up to a proportionality constant,
- varPower in which the variance is an estimated power of a covariate,
- varConstPower in which a constant is added in the power function, and
- varExp which models variance as an exponential function of a covariate.
- The variance models are used to create weights for each observation which are then incorporated into the model fit.
A plot of the residual variance in each age category versus age suggests a linear relationship might suffice (Fig. 8).
tapply(residuals(a0b0.intsex, type='pearson'), kestrel3$age,var) -> vars
plot(as.numeric(names(vars)), vars, xlab='Age', ylab='Residual Variance')
lines(lowess(vars~as.numeric(names(vars))), lty=1, col=2)
abline(lm(vars~as.numeric(names(vars))), lty=2, col=4)
legend(5, 130, c('lowess', 'linear regression'), col=c(2,4), lty=c(1,2), cex=.9, bty='n')

|
Fig. 9 Variance of the Pearson residuals at each level of age with locally weighted (lowess) and linear regression lines superimposed. |
- I try each of the variance models in turn and compare them using AIC. Table 2 summarizes the results. Note: To ensure convergence of the estimation algorithms I’ve rescaled the variable age so that its values are the integers 1 through 6, rather than the original units of 0, 5, 10, 15, 20, and 25. The new variable is denoted in the R code as ageclass. Each of these variance models was added to the Gompertz model a0b0.intsex. The use of the notation form= is optional for the variance function varFixed, but is mandatory for those variance functions that take multiple arguments.
kestrel3$ageclass<-as.numeric(factor(kestrel3$age))
var1<-update(a0b0.intsex, weights=varFixed(~ageclass))
var2<-update(a0b0.intsex, weights=varPower(form=~ageclass))
var3<-update(a0b0.intsex, weights=varConstPower(form=~ageclass))
var4<-update(a0b0.intsex, weights=varExp(form=~ageclass))
sapply(list(a0b0.intsex, var1, var2, var3, var4), AIC)
600.0304 578.2691 557.2259 559.2259 570.5778
Table 2 A comparison of the variance models
Variance Model |
R function |
K |
AIC |
None |
— |
9 |
600.03 |

|
varFixed |
9 |
578.27 |

|
varPower |
10 |
557.23 |

|
varConstPower |
11 |
559.23 |

|
varExp |
10 |
570.57 |
The AIC-best model expresses the variance as a power of age. Diagnostic plots for this model are shown in Fig. 10 below.
#Fig 10a: plot residuals against fitted values
plot(predict(var2), residuals(var2, type='pearson'), xlab='Predicted values', ylab='Pearson residuals')
#Fig 10b: plot residuals at each age
boxplot(residuals(var2, type='pearson')~factor(kestrel3$age), xlab='Age', ylab='Pearson Residuals', col='gray85', outline =F, ylim=range(residuals(var2, type='pearson')))
points(jitter(as.numeric(factor(kestrel3$age)), amount=.1), residuals(var2, type='pearson'), pch=19, cex=.6, col=4)
points(1:6, tapply(residuals(var2, type='pearson'), factor(kestrel3$age), mean), pch=8, col=2)
#Fig 10c: autocorrelation plot
plot(ACF(var2, form=~age), alpha=.05)
#Fig 10d: normal probability plot of the residuals
qq.plot(residuals(var2, type='pearson'), ylab='Pearson residuals')
- The diagnostic plots are much improved. The slight decrease in variability for the older ages in Fig 10(b) is probably due to the decrease in sample size as few birds survived this long. Abbreviated output from the summary function below gives the parameter estimates for this model. The exponent of the ageclass variable is estimated to be 1.6.
summary(var2)
Nonlinear mixed-effects model fit by maximum likelihood
Model: nstlmass ~ SSgompertz(age, a0, b0, b1)
Data: kestrel3
AIC BIC logLik
557.226 581.1704 -268.6130
Random effects:
Formula: a0 ~ 1 | nest
a0.(Intercept) Residual
StdDev: 0.003781860 1.509470
Variance function:
Structure: Power of variance covariate
Formula: ~ageclass
Parameter estimates:
power
1.609958
Fixed effects: list(a0 ~ sex * trtmnt, b0 ~ sex, b1 ~ 1)
Value Std.Error DF t-value p-value
a0.(Intercept) 160.96021 22.700680 57 7.09055 0.0000
a0.sexm -43.64042 15.471584 57 -2.82068 0.0066
a0.trtmntcontrol 7.50702 8.820041 57 0.85113 0.3983
a0.sexm:trtmntcontrol 26.60904 12.220438 57 2.17742 0.0336
b0.(Intercept) 2.74664 0.137231 57 20.01480 0.0000
b0.sexm -0.21784 0.125478 57 -1.73612 0.0879
b1 0.91324 0.008117 57 112.51116 0.0000
- Observe that the gender effect on b0 is no longer significant at α = .05. If I drop this term, AIC still argues that it be retained, although weakly.
guesses.var2a<-c(a0.guesses, parm.means[2:3])
var2a <- nlme(nstlmass~SSgompertz(ageclass,a0,b0,b1), fixed=list(a0~sex*trtmnt, b0~1, b1~1), random=a0~1|nest, data=kestrel3, na.action=na.omit, start=guesses.var2a, weights=varPower(form=~ageclass))
sapply(list(var2,var2a),AIC)
557.2259 558.3585
- Interestingly the variance component for the random effects is now negligible (relative to the level-1 variance), suggesting that once the variance has been explicitly modeled the random effects may no longer be needed.
#Variance components of model with modeled variance
VarCorr(var2)
nest = pdLogChol(list(a0 ~ 1))
Variance StdDev
1.430247e-05 0.003781860
Residual 2.278501e+00 1.509470331#Variance components of original model
VarCorr(a0b0.intsex)
nest = pdLogChol(list(a0 ~ 1))
Variance StdDev
174.74069 13.218952
Residual 54.49631 7.382162
- There may no longer be enough information in the data set to justify letting the asymptote of the model vary by nest. We can try dropping the random effects (while still keeping the variance structure). To do so we use the gnls function of the nlme package. The syntax is outlined in the call shown below. It is identical to the nlme call above, creating model var2a, except the random argument is omitted.
var2.gnls <- gnls(nstlmass~SSgompertz(ageclass,a0,b0,b1), params=list(a0~sex*trtmnt, b0~1, b1~1), data=kestrel3, na.action=na.omit, start=guesses.var2a, weights=varPower(form=~ageclass))
sapply(list(var2,var2.gnls),AIC)
557.2259 556.3585
- Based on the reported AIC values we could further simplify the model by dropping the random effects.
A direct approach to dealing with variance heterogeneity: a gamma probability model
-
While modeling the variance explicitly has corrected the problems observed in the residual plots, one could argue that this is a patchwork approach. The problem with the original model is that one of its basic assumptions is wrong. We began by assuming that the level-1 residuals were normally distributed with constant variance. We were forced to relax the constant variance assumption and allow the variance to increase with age. As a result what remains is not a single normal distribution for the residuals but a whole family of normal distributions indexed by age. A conceptually simpler approach is to abandon both the constant variance and normal distribution assumptions and instead choose a probability distribution in which the variance is already an increasing function of the mean. One such distribution is the gamma distribution.
We discussed this distribution in lecture 3.
-
The gamma distribution is often used as a prior probability for precisions (reciprocal of variances) in Bayesian models because the support of the gamma distribution is restricted to the positive real numbers. Like the normal distribution, the gamma is a two-parameter distribution. Its parameters are referred to variously as shape and scale. One common parameterization of the gamma distribution is the following.

Here
is the gamma function, the continuous analog of the factorial function. In the usual parameterization of the gamma distribution the variance is proportional to the square of the mean.
R supports the parameterization shown above. It calls α the shape parameter but refers to β as the rate parameter (with one over β being called the scale).
- An even better rationale for using a gamma distribution with these data is that it has positive support. In the formula given above the value of the density is nonzero only for x > 0. In other words when we use a gamma distribution as a probability model for growth, our model predictions will be constrained to be positive. This was not the case for the normal distribution where because of its symmetry and constant variance about the mean, negative values were possible with our estimated Gompertz growth model.
- Note: We simultaneously addressed the problem of negative predictions when we chose a variance model for the normal distribution that was a function of age. Because the variance was modeled as being positively monotonic with age, as age approaches zero so does the variance. As the variance shrinks in value, negative values become more unlikely (as do positive deviations from the mean). Thus when the mean is small the normal model with heterogeneous variance exhibits hardly any variability at all.
- The gamma distribution handles this differently. Typically as the gamma mean approaches zero the gamma distribution becomes asymmetric with a short lower tail (truncated at zero) and a longer upper tail. When the mean approaches zero the gamma variance can be large or small, but if it is large the variability is one-sided. Observations are permitted to vary above the mean, but not below.
- Thus an alternative to fitting a heterogeneous variances model coupled with a normal distribution as was done in the previous section, is to fit a Gompertz mixed effects model in which the probability model used for the response is a gamma distribution. As of this writing there is no way to fit such a model using R. The lme4 package should eventually have this capability. Currently the gamma distribution is permitted as a value of the family argument of the lmer (although the correctness of the output is questionable), but lmer cannot be used to fit nonlinear models in which predictors are used to model individual parameters. For the moment then we need to turn to other resources in order to fit this model.
- As is often the case, when frequentist methods come up short in fitting mixed effects models, the Bayesian approach is there to bail us out. I illustrate next how to fit a Gompertz mixed effect model with a gamma-distributed response using WinBUGS. Because of the great power and utility of SAS in fitting nonlinear mixed effects models, I also demonstrate how to get data from R into SAS after which I use SAS's Proc NLMixed to fit the desired model.
Using WinBUGS
Fitting the normal Gompertz mixed effects model in WinBUGS
- A good place to start is by refitting the Gompertz model with normal errors using WinBUGS.
- Because we can compare our results directly with the nlme results obtained previously, we will be able to check the correctness of our code. Modifying the BUGS code to let the response have a gamma distribution should then be easy.
- The likelihood WinBUGS uses in calculating its deviance statistic for mixed models is not the same as the marginal likelihood that is reported by nlme. (See lecture 26 for more information on this.) Thus we need the value of the deviance for a normal model as calculated by WinBUGS as a baseline value against which we can compare the deviance of the gamma regression model.
- The code for fitting the Gompertz model with a normal response is just the basic random intercepts model we considered in lecture 23, so I present it below without additional comment except for the following.
- For clarity I include an assignment statement for each Gompertz parameter, even for those that are constants and do not vary across nests.
- In the formula for the Gompertz function I use the BUGS pow function to generate the exponentiated term that appears inside the exponential function.
- Because the parameter mu.b1 is the base of an exponent, both its prior distribution and its initial value must be non-negative. In the code below I use a normal prior for mu.b1 without incident.
Gompertz model with normal errors: "gompertz bugs.txt" |
|
model{
#normal likelihood--level 1 model
for(i in 1:n) {
y[i]~dnorm(y.hat[i], tau.y)
#Gompertz mean
y.hat[i]<-a0[nest[i]]*exp(-b0[nest[i]]*pow(b1[nest[i]],age[i]))
}
#priors for level-1 parameters
tau.y<-pow(sigma.y,-2)
sigma.y~dunif(0,100)
#level-2 model for Gompertz parameters
for(j in 1:J){
a0[j]~dnorm(a.hat[j],tau.a)
a.hat[j]<-mu.a+g1*sex[j]+g2*trt[j]+g3*sextrt[j]
b0[j]<-mu.b0+g4*sex[j]
b1[j]<-mu.b1
}
#priors for level-2 parameters
mu.a~dnorm(0,.0001)
tau.a<-pow(sigma.a,-2)
sigma.a~dunif(0,1000)
mu.b0~dnorm(0,.0001)
mu.b1~dnorm(0,.0001)
g1~dnorm(0,.0001)
g2~dnorm(0,.0001)
g3~dnorm(0,.0001)
g4~dnorm(0,.0001)
#extra derived quantities
a0.femtrt<-mu.a
a0.femcontrol<-mu.a+g2
a0.maletrt<-mu.a+g1
a0.malecontrol<-mu.a+g1+g2+g3
b0.fem<-mu.b0
b0.male<-mu.b0+g4
} |
- I load the arm package and set the directory to where the WinBUGS model definition file "gompertz bugs.txt" is located.
setwd('D:\\Ecology 563 Fall 2008\\lecture29')
library(arm)
- I generate the variables y, age, nest, n, and J that appear in the above BUGS program from the data frame kestrel3.
y<-kestrel3$nstlmass
age<-kestrel3$age
n<-length(y)
nest<-as.numeric(factor(kestrel3$nest))
J<-length(unique(kestrel3$nest))
- I next create the level-2 predictors of the Gompertz parameters. Remember that we need to create these so that there is only one value per nest. The variable nest plays the role of the level-2 structural variable.
sex<-as.numeric(tapply(kestrel3$sex,kestrel3$nest, function(x) x[1]))-1
trt<-as.numeric(tapply(kestrel3$trt,kestrel3$nest, function(x) x[1]))-1
sextrt<-sex*trt
- The next three lines create the arguments needed by the bugs function. Choices for the initial values are guided by our nlme results as well as natural restrictions on certain of the parameters, such as b1.
kestrel.data<-list("y", "age", "n", "nest", "J", "sex", "trt", "sextrt")
kestrel.inits<-function() {list(a0=rnorm(J), mu.a=runif(1), g1=rnorm(1), g2=rnorm(1), g3=rnorm(1), g4=rnorm(1), mu.b0=runif(1), mu.b1=runif(1), sigma.y=runif(1), sigma.a=runif(1))}
kestrel.parameters<-c("a0", "mu.a", "mu.b0", "mu.b1", "g1", "g2", "g3", "g4", "sigma.y", "sigma.a")
kestrel.out<-bugs(kestrel.data,kestrel.inits,kestrel.parameters, "gompertz bugs.txt", n.chains=3, bugs.directory="C:/Program Files/WinBUGS14/", n.iter=10, debug=T)
- I run the model until the chains are mixing and the diagnostics look good.
kestrel.out<-bugs(kestrel.data,kestrel.inits,kestrel.parameters, "gompertz bugs.txt", n.chains=3, bugs.directory="C:/Program Files/WinBUGS14/", n.iter=1000, debug=T)
kestrel.out2<-bugs(kestrel.data,kestrel.out$last.values,kestrel.parameters, "gompertz bugs.txt", n.chains=3, bugs.directory="C:/Program Files/WinBUGS14/", n.iter=10000, debug=T)
max(kestrel.out2$summary[,"Rhat"])
[1] 1.018114
min(kestrel.out2$summary[,"n.eff"])
[1] 110
kestrel.out3<-bugs(kestrel.data, kestrel.out2$last.values, kestrel.parameters, "gompertz bugs.txt", n.chains=3, bugs.directory="C:/Program Files/WinBUGS14/", n.iter=20100, n.burnin=100, debug=T)
min(kestrel.out3$summary[,"n.eff"])
[1] 15
max(kestrel.out3$summary[,"Rhat"])
[1] 1.192872
kestrel.out4<-bugs(kestrel.data, kestrel.out3$last.values, kestrel.parameters, "gompertz bugs.txt", n.chains=3, bugs.directory="C:/Program Files/WinBUGS14/", n.iter=50100, n.burnin=100, debug=T)
max(kestrel.out3$summary[,"Rhat"])
[1] 1.081149
min(kestrel.out3$summary[,"n.eff"])
[1] 70
- I carry out one more run in which I return some extra derived quantities created in the BUGS program.
kestrel.parameters2<-c("mu.a", "g1", "g2", "g3", "g4", "sigma.y", "sigma.a","a0.femtrt","a0.femcontrol", "a0.maletrt", "a0.malecontrol", "b0.fem", "b0.male")
kestrel.out5<-bugs(kestrel.data, kestrel.out3$last.values, kestrel.parameters2, "gompertz bugs.txt", n.chains=3, bugs.directory="C:/Program Files/WinBUGS14/", n.iter=100100, n.burnin=100, debug=T)
Graph of parameter estimates: frequentist versus Bayesian
- To compare the frequentist with the Bayesian results I generate a probability smear graph of the effect estimates for predictors that appear in the equations for the asymptote, a0, and the parameter b0.
#fig 11a
#create a matrix of Bayesian results
sum.table1<-cbind((1:3)+.075, kestrel.out5$summary[2:4,"50%"], kestrel.out5$summary[2:4,"2.5%"], kestrel.out5$summary[2:4,"97.5%"], kestrel.out5$summary[2:4,"25%"], kestrel.out5$summary[2:4,"75%"])
#labels for y-axis
b.names<-c('male', 'control', expression('male' %*%'control'))
#expand left margin
par(mar=c(5.1,7.1,2.1,1.1))
#change bar ends to square
par(lend=2)
#set up plot
plot(range(sum.table1[,2:4]), c(1,3), type='n', axes=F, ylab='', xlab=expression(paste('Effect on ',a[0])), ylim=c(.5,3.75))
axis(1,cex.axis=.8)
axis(2,at=1:3, labels=b.names, cex.axis=.8, las=2)
#draw bayesian results
apply(sum.table1, 1, function(x) segments(x[3], x[1], x[4], x[1], lwd=4, col='grey75'))
apply(sum.table1, 1, function(x) segments(x[5], x[1], x[6], x[1], lwd=6, col='grey45'))
apply(sum.table1, 1, function(x) points(x[2], x[1], pch=16, col='white', cex=1.3))
apply(sum.table1, 1, function(x) points(x[2], x[1], pch=1, cex=1.4))
box()
abline(v=0,col=2,lty=2)
mtext('Female, Treatment',at=0,side=3,line=.5,cex=.8)
#create a matrix of frequentist results
sumtable.nlme1<-cbind((1:3)-.075, summary(a0b0.intsex)$tTable[2:4,1], summary(a0b0.intsex)$tTable[2:4,2])
nlme.df<-summary(a0b0.intsex)$tTable[2,3]
#draw frequentist results
apply(sumtable.nlme1,1, function(x) segments(x[2]+qt(.025,nlme.df)*x[3], x[1], x[2]+qt(.975,nlme.df)*x[3], x[1], lwd=4, col='lightpink'))
apply(sumtable.nlme1,1, function(x) segments(x[2]+qt(.25,nlme.df)*x[3], x[1], x[2]+qt(.75,nlme.df)*x[3], x[1], lwd=6, col='red'))
apply(sumtable.nlme1,1,function(x) points(x[2], x[1], pch=16, col='white', cex=1.3))
apply(sumtable.nlme1,1, function(x) points(x[2], x[1], pch=1, cex=1.4))
legend(-80,3.85, c('Bayesian','frequentist'), col=c('grey45','red'), lty=1, lwd=4, bty='n', cex=.8)
#fig 11b
sum.table2<-c(1+.025, kestrel.out5$summary[5,"50%"], kestrel.out5$summary[5,"2.5%"], kestrel.out5$summary[5,"97.5%"], kestrel.out5$summary[5,"25%"], kestrel.out5$summary[5,"75%"])
sumtable.nlme3<-c(1-.025, summary(a0b0.intsex)$tTable[6,1], summary(a0b0.intsex)$tTable[6,2])
nlme.df<-summary(a0b0.intsex)$tTable[6,3]
b.names<-'male'
par(mar=c(5.1,5.1,2.1,1.1))
plot(c(min(c(sum.table2[2:4], sumtable.nlme3[2]+ qt(.025,nlme.df)*sumtable.nlme3[3])), .10), c(.5,1.75), type='n', axes=F, ylab='', xlab=expression(paste('Effect on ', b[0])), ylim=c(.85,1.25))
axis(1,cex.axis=.8)
axis(2,at=1, labels=b.names, cex.axis=.8, las=2)
#Bayesian
x<-sum.table2
segments(x[3], x[1], x[4], x[1], lwd=4, col='grey75')
segments(x[5], x[1], x[6], x[1], lwd=6, col='grey45')
points(x[2],x[1],pch=16,col='white', cex=1.1)
points(x[2],x[1],pch=1, cex=1.2)
box()
abline(v=0,col=2,lty=2)
mtext('Female',at=0,side=3,line=.5,cex=.8)
#frequentist
x<-sumtable.nlme3
segments(x[2]+qt(.025,nlme.df)*x[3], x[1], x[2]+qt(.975,nlme.df)*x[3], x[1], lwd=4, col='lightpink')
segments(x[2]+qt(.25,nlme.df)*x[3], x[1], x[2]+qt(.75,nlme.df)*x[3], x[1], lwd=6, col='red')
points(x[2],x[1],pch=16,col='white', cex=1.1)
points(x[2],x[1],pch=1, cex=1.2)
legend(-1.3, 1.25, c('Bayesian', 'frequentist'), col=c('grey45', 'red'), lty=1, lwd=4, bty='n', cex=.8)
#reset line ends back to the default value
par(lend=0)
(a)  |
(b)  |
Fig. 11 Bayesian and frequentist estimates for the effect parameters. In (a) we have estimates for the two main effects, sex and treatment, and their interaction in the model for a0 and in (b) just the main effect of sex in the b0 equation. The y-axis is labeled by the effect that is represented. Recall that each "effect" is measured relative to the reference group, "female, treatment" (indicated as the vertical red dashed line at 0). Displayed are 95% confidence (credibility) intervals, 50% confidence (credibility) intervals, and point estimates (mean or median of posterior distribution). |
- There is fairly close agreement in the two sets of results. Notice though that if we were married to the α = .05 cut-off for significance testing then the Bayesian and frequentist interval estimates would lead us to different conclusions about the importance of the effect of the interaction term, male × control, on a0.
Graph of the treatment-sex combination estimates: frequentist versus Bayesian
- We can perhaps get more interpretable results by displaying the results for individual groups, rather than for effects. For the Bayesian approach these effects are already calculated in the BUGS program (in the section labeled "extra derived quantities"). These were then added to the parameter list of the last bugs run, so that we already have the posterior distributions for these quantities from which we can compute credibility intervals.
- For the frequentist approach we will have to calculate the standard errors needed for the confidence intervals ourselves. In lecture 28 we obtained formulas for the individual group means in terms of the parameters of the regression models by plugging in appropriate values for the predictors. For instance, in our Gompertz model the formula for the mean value of the asymptote parameter is

For "male controls" the values of the predictors are sex = 1 and trtmnt = 1. Thus we have

- This can be written as the following vector dot product.

where 1 is the displayed vector of ones and
is the vector of parameter estimates.
- This is an instance of a linear combination of the elements of
. More generally a linear combination is

where the elements c1, c2, c3, and c4 are typically real numbers and are referred to generically as scalars. In the formula for mean asymptote of the male controls we had c1 = 1, c2 = 1, c3 = 1, and c4 = 1.
- Now, it's a basic result from multivariate statistics that if

is the variance-covariance matrix of the parameter estimates
, then the variance of
is the following.

The expression on the right is called a quadratic form. Because we can extract the parameter variance-covariance matrix from a model object with R's vcov function, we can calculate the variance of any linear combination of parameter estimates we desire. Because we've estimated seven regression parameters (four for a0, two for b0, and one for b1), vcov(a0b0.intsex) returns a 7 × 7 covariance matrix. Thus our scalar vector c will contain 7 components. When we desire the estimate variance of a group mean for the b0 parameter, say, then necessarily the components of c corresponding to the a0 and b1 parameters should be set to zero.
- With this background I next display probability smear graphs for the estimated individual mean values of a0 and b0 for the various sex-treatment combinations that are allowed to be distinct in our final model.
#Fig. 12a
#extract derived quantities from WinBUGS output
sum.table<-cbind((1:4)+.075, kestrel.out5$summary[8:11,"50%"], kestrel.out5$summary[8:11,"2.5%"], kestrel.out5$summary[8:11,"97.5%"], kestrel.out5$summary[8:11,"25%"], kestrel.out5$summary[8:11,"75%"])
b.names<-c('female treatment', 'female control', 'male treatment', 'male control')
par(mar=c(5.1,7.1,2.1,1.1))
par(lend=2)
plot(range(sum.table[,2:4]), c(1,3), type='n', axes=F, ylab='', xlab=expression(paste('Estimate of ',a[0])), ylim=c(.5,4.75))
axis(1,cex.axis=.8)
axis(2,at=1:4, labels=b.names, cex.axis=.8, las=2)
#Bayesian results
apply(sum.table,1, function(x) segments(x[3], x[1], x[4], x[1], lwd=4, col='grey75'))
apply(sum.table,1, function(x) segments(x[5], x[1], x[6], x[1], lwd=6, col='grey45'))
apply(sum.table, 1, function(x) points(x[2], x[1], pch=16, col='white', cex=1.3))
apply(sum.table, 1, function(x) points(x[2], x[1], pch=1, cex=1.4))
box()
#frequentist calculations
#linear combinations for means of treatment-sex combinations
a0.femtreat<-c(1,0,0,0,0,0,0) %*% fixef(a0b0.intsex)
a0.maletreat<-c(1,1,0,0,0,0,0) %*% fixef(a0b0.intsex)
a0.femcontrol<-c(1,0,1,0,0,0,0) %*% fixef(a0b0.intsex)
a0.malecontrol<-c(1,1,1,1,0,0,0) %*% fixef(a0b0.intsex)
#quadratic forms for variance extraction
sd.a0.femtreat<-sqrt(c(1,0,0,0,0,0,0) %*% vcov(a0b0.intsex)%*% c(1,0,0,0,0,0,0))
sd.a0.maletreat<-sqrt(c(1,1,0,0,0,0,0) %*% vcov(a0b0.intsex)%*% c(1,1,0,0,0,0,0))
sd.a0.femcontrol<-sqrt(c(1,0,1,0,0,0,0) %*% vcov(a0b0.intsex) %*% c(1,0,1,0,0,0,0))
sd.a0.malecontrol<-sqrt(c(1,1,1,1,0,0,0) %*% vcov(a0b0.intsex) %*% c(1,1,1,1,0,0,0))
sumtable.nlme<-cbind((1:4)-.075, c(a0.femtreat, a0.femcontrol, a0.maletreat, a0.malecontrol), c(sd.a0.femtreat, sd.a0.femcontrol, sd.a0.maletreat, sd.a0.malecontrol))
a0b0.intsex$fixDF$X[1]->nlme.df
#frequentist
apply(sumtable.nlme,1, function(x) segments(x[2]+qt(.025,nlme.df)*x[3], x[1], x[2]+qt(.975,nlme.df)*x[3], x[1], lwd=4, col='lightpink'))
apply(sumtable.nlme,1, function(x) segments(x[2]+qt(.25,nlme.df)*x[3], x[1], x[2]+qt(.75,nlme.df)*x[3], x[1], lwd=6, col='red'))
apply(sumtable.nlme, 1, function(x) points(x[2], x[1], pch=16, col='white', cex=1.3))
apply(sumtable.nlme, 1, function(x) points(x[2], x[1], pch=1, cex=1.4))
legend(40,4.85,c('Bayesian', 'frequentist'), col=c('grey45', 'red'), lty=1, lwd=4, bty='n', cex=.8)
#Fig. 12b
#extract derived quantities from WinBUGS output
sum.table4<-cbind((1:2)+.075, kestrel.out5$summary[12:13,"50%"], kestrel.out5$summary[12:13,"2.5%"], kestrel.out5$summary[12:13,"97.5%"], kestrel.out5$summary[12:13,"25%"], kestrel.out5$summary[12:13,"75%"])
#frequentist calculations
#linear combinations for means of treatment-sex combinations
b0.fem<-c(0,0,0,0,1,0,0) %*% fixef(a0b0.intsex)
b0.male<-c(0,0,0,0,1,1,0 )%*% fixef(a0b0.intsex)
#quadratic forms for variance extraction
sd.b0.fem<-sqrt(c(0,0,0,0,1,0,0) %*% vcov(a0b0.intsex) %*%c (0,0,0,0,1,0,0))
sd.b0.male<-sqrt(c(0,0,0,0,1,1,0) %*% vcov(a0b0.intsex) %*% c(0,0,0,0,1,1,0))
sumtable.nlme4<-cbind((1:2)-.075, c(b0.fem,b0.male), c(sd.b0.fem,sd.b0.male))
a0b0.intsex$fixDF$X[1]->nlme.df
b.names<-c('female', 'male')
par(mar=c(5.1,4.1,2.1,1.1))
par(lend=2)
plot(range(range(cbind(sum.table4[,2:4]), sumtable.nlme4[,2]+ qt(.025,nlme.df)* sumtable.nlme4[,3])), c(1,2), type='n', axes=F, ylab='', xlab=expression(paste('Estimate of ',b[0])), ylim=c(.65,2.9))
axis(1,cex.axis=.8)
axis(2,at=1:2, labels=b.names, cex.axis=.8, las=2)
#Bayesian graph
apply(sum.table4,1, function(x) segments(x[3], x[1], x[4], x[1], lwd=4, col='grey75'))
apply(sum.table4,1, function(x) segments(x[5], x[1], x[6], x[1], lwd=6, col='grey45'))
apply(sum.table4, 1, function(x) points(x[2], x[1], pch=16, col='white', cex=1.1))
apply(sum.table4, 1, function(x) points(x[2], x[1], pch=1, cex=1.2))
box()
#frequentist graph
apply(sumtable.nlme4, 1, function(x) segments(x[2]+qt(.025,nlme.df)*x[3], x[1], x[2]+qt(.975,nlme.df)*x[3], x[1], lwd=4, col='lightpink'))
apply(sumtable.nlme4, 1, function(x) segments(x[2]+qt(.25,nlme.df)*x[3], x[1], x[2]+qt(.75,nlme.df)*x[3], x[1], lwd=6, col='red'))
apply(sumtable.nlme4, 1, function(x) points(x[2], x[1], pch=16, col='white', cex=1.1))
apply(sumtable.nlme4, 1, function(x) points(x[2], x[1], pch=1, cex=1.2))
legend(3, 3, c('Bayesian', 'frequentist'), col=c('grey45', 'red'), lty=1, lwd=4, bty='n', cex=.8)
#reset line ends back to the default value.
par(lend=0)
(a)  |
(b)  |
Fig. 12 Bayesian and frequentist estimates for the estimated group means. In (a) we have estimates of mean a0 for each combination of sex and treatment. In (b) we have the estimated mean b0 for the two sexes. Displayed are 95% confidence (credibility) intervals, 50% confidence (credibility) intervals, and point estimates (mean or median of the posterior distribution). |
- As Fig. 12 shows, the Bayesian and frequentist results are reasonably close, indicating that we've probably formulated the BUGS model correctly. We're ready to make the switch to a gamma distribution for the response.
Using WinBUGS—gamma regression model 1
- While most of the BUGS code for the normal error model carries over without modification to a model with a gamma-distributed response, there are a few new wrinkles.
- In our regression model the Gompertz curve represents the mean of the level-1 probability distribution, now assumed to be gamma. At each point of the curve the observed value of the response variable is assumed to have a gamma distribution with mean given by the curve. The formula for the gamma distribution as given above is not parameterized in terms of its mean. Thus in order to fit a gamma regression model we will need to express the mean of the gamma distribution in terms of shape and rate parameters, α and β, of the gamma distribution and substitute it into our formula for the gamma density.
- The mean of a gamma distribution must be positive (because the support of the gamma distribution is positive). Thus we will need to constrain the priors for some of our parameters to guarantee that during the simulation its Markov chain doesn't accidentally wander off into an area that leads temporarily to negative estimates for the mean. If this happens, WinBUGS will grind to halt and return a trap error.
- After substituting the mean into the formula of the gamma density, the remaining parameter of the gamma distribution, be it α or β, must also be positive so we will need to choose an appropriate uninformative prior that accounts for this fact.
Parameterization of the Gamma Distribution
- For the parameterization of the gamma distribution used in R and WinBUGS (and whose formula was given above), the mean of the distribution is

Thus we can eliminate the shape parameter, α, and rewrite it in terms of the mean and rate parameters, μ and β. This is one of two possible parameterizations we could use. We'll consider the second parameterization later.
Non-negativity Constraints on Parameters
- In generalized linear models the usual way to constrain a mean to be non-negative is by using a log link. (With a log link the mean is recovered by exponentiating the linear predictor and hence is positive). A side effect of using a log link is that parameters that affect the log mean in an additive fashion will affect the raw mean in a multiplicative fashion. We could use a log link for the mean of each Gompertz parameter but there is some advantage in sticking with an additive model so that we can directly compare our results and parameter estimates to the normal model that was fit previously.
- Without a log link we'll need some other way to constrain the parameters in our WinBUGS model to be positive. One possibility is to use the Bugs censoring function, I(lower,upper), which has two arguments here labeled lower and upper. These two arguments are used to place lower and/or upper bounds on a distribution. For instance, if we want to place a normal prior on a parameter a, but we also want to exclude any negative values obtained during sampling, we can write the following.
a ~ dnorm(0, .001)I(0,)
Here the lower argument is set to zero. The fact that the upper argument is not specified means the upper bound is infinite. The results is that dnorm(0, .001)I(0,) forces negative values that are generated from the specified normal distribution to be censored as zeroes. The BUGS manual stresses that this construction does not formally yield a truncated probability distribution. Consequently it should not be used when truncation is truly desired.
- I've used the I( , ) function in two distinct ways in the BUGS code. One way is consistent with the BUGS manual recommendations, but the second way is not.
- In the statement a ~ dnorm(0, .001)I(0,) no unknown parameter values appear. Both parameters of dnorm are fully specified and are known. Consequently this construction has no substantive impact on the likelihood. The specification only serves to ensure that no negative values are sampled for a.
- In the statement a0[j]~dnorm(a.hat[j],tau.a)I(0, ) The I(0,) function is being used to ensure that a negative value is never generated for a0, If a0 was negative then the mean of the gamma distribution of the response would also be negative, an impossible situation. Unlike the first case, here both parameters of dnorm are unknown and need to be estimated. Consequently, the choice of distribution will dictate the manner in which each a0 contributes to the likelihoods for tau.a and the parameters that make up a.hat[j]. By using an inappropriate distribution we affect the posterior distribution that will be obtained for these parameters.
- So, a normal distribution is not an appropriate distribution in either case and we really should use a truncated normal distribution (available as the WBDev shared component djl.dnorm.trunc that is available online) or even a gamma distribution. However, it is only in the second case that our choice may actually affect the results.
- My rationale for inappropriately using the I( , ) function in the second case is threefold.
- The problem is already complicated enough without introducing a WinBUGS distribution contributed by users, djl.dnorm.trunc, or using a gamma distribution a second time.
- I want the starting assumptions of the Bayesian model to differ as little as possible from what we did in the nlme implementation.
- If we had used a truncated normal or gamma distribution for a0[j], the results would not have appreciably changed.
- Let me expand on this last point a little bit. Because the sample estimates of a.hat[j] are large for all j and the value of tau.a (the precision) is also large, the probability of obtaining a negative value for a0[j] is negligible. Although negligible, unfortunately it's not zero. Hence when we run the code without the censoring function WinBUGS invariably returns a trap error at some point while generating samples causing the program to crash. These samples occur so infrequently that if we were to eliminate them by changing to a more appropriate probability model for a0[j] or just evade them by keeping the normal model and discarding the offending values when they arise (as we're currently doing), the effect on the posterior distribution is minimal.
Priors for Gamma Distribution Parameters
- When previously working with variances in WinBUGS, we've obtained a vague prior by letting the log of the standard deviation have a diffuse normal prior, defined as being a normal distribution with a small value for the precision parameter (a large variance). This then translated into a vague prior for the variance. Another way to achieve a vague prior for a positive random variable is to use a gamma distribution with small values for both the shape and rate parameters. A typical choice is dgamma(.001, .001). When graphed this looks anything but uninformative because it has a large spike at the origin. What makes it vague though is its large variance. The variance of a gamma distribution is given by

The choice of α = .001 and β = .001 yields a gamma distribution with mean = 1 and variance = 1000. Consequently these choices permit a very wide range of positive values of the parameter to be possible. Note: For the current problem this range turns out to be too wide as it led to occasional trap errors, so I used a slightly more informative dgamma(.01, .01) prior for the rate parameter β.
- Another way to understand this choice of prior is to observe that β is playing a role analogous to a precision parameter in the formula for the gamma variance. As the formula shows β and the variance are inversely related.

As β decreases, the variance increases and hence the precision decreases. Likewise when β increases the precision also increases. Thus when we use a prior for β that places most of its mass near the origin this is roughly equivalent to placing a prior on the precision that is concentrated near zero. Note: Although we haven't done so in this course, it is common to use a prior such as dgamma(.001, .001) to model the precision parameter in a normal distribution. See, e.g., McCarthy (2007).
- With this background, here's what the BUGS code for the Gompertz model looks like when modified so that the response has a gamma distribution. The code shown in blue indicate changes from the normal model BUGS code given previously.
Gompertz model gamma parameterization #1: "gompertz gamma 1 bugs.txt" |
model{
#gamma likelihood--level 1 model
for(i in 1:n) {
y[i]~dgamma(alpha[i],beta)
alpha[i] <- y.hat[i]*beta
#Gompertz mean
y.hat[i]<-a0[nest[i]]*exp(-b0[nest[i]]*pow(b1[nest[i]],age[i]))
}
#priors for level-1 parameters
beta ~ dgamma(.01,.01)
#level-2 model for Gompertz parameters
for(j in 1:J){
#the next use of I(0,) is technically incorrect!
a0[j]~dnorm(a.hat[j],tau.a)I(0,)
a.hat[j]<-mu.a+g1*sex[j]+g2*trt[j]+g3*sextrt[j]
b0[j]<-mu.b0+g4*sex[j]
b1[j]<-mu.b1
}
#priors for level-2 parameters
mu.a~dnorm(0,.0001)I(0,)
tau.a<-pow(sigma.a,-2)
sigma.a~dunif(0,1000)
mu.b0~dnorm(0,.0001)I(0,)
mu.b1~dnorm(0,.0001)I(0,)
g1~dnorm(0,.0001)
g2~dnorm(0,.0001)
g3~dnorm(0,.0001)
g4~dnorm(0,.0001)
#extra quantities
a0.femtrt<-mu.a
a0.femcontrol<-mu.a+g2
a0.maletrt<-mu.a+g1
a0.malecontrol<-mu.a+g1+g2+g3
b0.fem<-mu.b0
b0.male<-mu.b0+g4
} |
- I run the model until the diagnostics look good.
kestrel.data<-list("y","age","n","nest","J","sex","trt","sextrt")
kestrel.g.inits<-function() {list(a0=runif(J,50,250), mu.a=runif(1,50,250), mu.b0=runif(1,1,2), mu.b1=runif(1), beta=runif(1), sigma.a=runif(1), g1=rnorm(1), g2=rnorm(1), g3=rnorm(1), g4=rnorm(1))}
kestrel.g.parameters<-c("a0","mu.a","mu.b0","mu.b1", "beta", "sigma.a", "g1", "g2", "g3", "g4")
kestrel.g.out2<-bugs(kestrel.data, kestrel.g.inits, kestrel.g.parameters, "gompertz gamma 1 bugs.txt", n.chains=3, bugs.directory="C:/Program Files/WinBUGS14/", n.iter=10, debug=T)
kestrel.g.out2<-bugs(kestrel.data, kestrel.g.inits, kestrel.g.parameters, "gompertz gamma 1 bugs.txt", n.chains=3, bugs.directory="C:/Program Files/WinBUGS14/", n.iter=1000, debug=T)
max(kestrel.g.out2$summary[,"Rhat"])
[1] 1.038671
kestrel.g.out3<-bugs(kestrel.data, kestrel.g.out2$last.values, kestrel.g.parameters, "gompertz gamma 1 bugs.txt", n.chains=3, bugs.directory="C:/Program Files/WinBUGS14/", n.iter=10000, debug=T)
max(kestrel.g.out3$summary[,"Rhat"])
[1] 1.037423
min(kestrel.g.out3$summary[,"n.eff"])
[1] 93
kestrel.g.parameters2<-c("a0","mu.a","mu.b0","mu.b1", "beta", "sigma.a", "g1", "g2", "g3", "g4","a0.femtrt","a0.femcontrol", "a0.maletrt", "a0.malecontrol", "b0.fem", "b0.male")
kestrel.g.out4<-bugs(kestrel.data, kestrel.g.out3$last.values, kestrel.g.parameters2, "gompertz gamma 1 bugs.txt", n.chains=3, bugs.directory="C:/Program Files/WinBUGS14/", n.iter=50100, n.burnin=100, debug=T)
max(kestrel.g.out4$summary[,"Rhat"])
[1] 1.012623
min(kestrel.g.out4$summary[,"n.eff"])
[1] 160
-
First we check if the gamma model is an improvement over the model with normal errors. I extract the deviance, DIC, and pD from each model.
results<-data.frame(c('normal', 'gamma'), c(kestrel.out5$summary["deviance","50%"], kestrel.g.out4$summary["deviance","50%"]), c(kestrel.out5$DIC, kestrel.g.out4$DIC), c(kestrel.out5$pD, kestrel.g.out4$pD))
colnames(results)<-c('model', 'deviance', 'DIC', 'pD')
results
model deviance DIC pD
1 normal 558.6 579.134 19.415
2 gamma 552.3 570.948 17.540
- Both the deviance and the DIC are lower for the gamma model. Technically the normal model and gamma model estimate the same number of parameters so whether we use deviance or DIC to choose should be immaterial. Having said that notice that the reported values of pD are not the same suggesting that the estimate of the effective number of parameters is slightly unstable.
Graph of parameter estimate: gamma 1 versus normal
- I next compare the estimates obtained from the Gompertz model with normal errors to the Gompertz model with a gamma distribution for the response.
#fig 13a
#create a matrix of Bayesian normal results
gnames<-c("g1","g2","g3")
sum.table1<-cbind((1:3)+.075, kestrel.out5$summary[gnames,"50%"], kestrel.out5$summary[gnames,"2.5%"], kestrel.out5$summary[gnames,"97.5%"], kestrel.out5$summary[gnames,"25%"], kestrel.out5$summary[gnames,"75%"])
#create a matrix of Bayesian gamma results
sum.table.g1<-cbind((1:3)-.075, kestrel.g.out4$summary[gnames,"50%"], kestrel.g.out4$summary[gnames,"2.5%"], kestrel.g.out4$summary[gnames,"97.5%"], kestrel.g.out4$summary[gnames,"25%"], kestrel.g.out4$summary[gnames,"75%"])
#labels for y-axis
b.names<-c('male', 'control', expression('male' %*%'control'))
#expand left margin
par(mar=c(5.1,7.1,2.1,1.1))
#change bar ends to square
par(lend=2)
#set up plot
plot(range(cbind(sum.table1[,2:4],sum.table.g1[,2:4])), c(1,3), type='n', axes=F, ylab='', xlab=expression(paste('Effect on ',a[0])), ylim=c(.5,3.75))
axis(1,cex.axis=.8)
axis(2,at=1:3, labels=b.names, cex.axis=.8, las=2)
#draw bayesian-normal results
apply(sum.table1, 1, function(x) segments(x[3], x[1], x[4], x[1], lwd=4, col='grey75'))
apply(sum.table1, 1, function(x) segments(x[5], x[1], x[6], x[1], lwd=6, col='grey45'))
apply(sum.table1, 1, function(x) points(x[2], x[1], pch=16, col='white', cex=1.3))
apply(sum.table1, 1, function(x) points(x[2], x[1], pch=1, cex=1.4))
box()
abline(v=0,col=2,lty=2)
mtext('Female, Treatment',at=0,side=3,line=.5,cex=.8)
#draw bayesian gamma results
apply(sum.table.g1, 1, function(x) segments(x[3], x[1], x[4], x[1], lwd=4, col='lightpink'))
apply(sum.table.g1, 1, function(x) segments(x[5], x[1], x[6], x[1], lwd=6, col='red'))
apply(sum.table.g1,1,function(x) points(x[2], x[1], pch=16, col='white', cex=1.3))
apply(sum.table.g1,1, function(x) points(x[2],x[1], pch=1, cex=1.4))
legend(-80,3.85, c('Bayesian-normal','Bayesian-gamma'), col=c('grey45','red'), lty=1, lwd=4, bty='n', cex=.8)
#fig 13b
sum.table1b<-c(1+.025, kestrel.out5$summary["g4","50%"], kestrel.out5$summary["g4","2.5%"], kestrel.out5$summary["g4","97.5%"], kestrel.out5$summary["g4","25%"], kestrel.out5$summary["g4","75%"])
sum.table.g1b<-c(1-.025, kestrel.g.out4$summary["g4","50%"], kestrel.g.out4$summary["g4","2.5%"], kestrel.g.out4$summary["g4","97.5%"], kestrel.g.out4$summary["g4","25%"], kestrel.g.out4$summary["g4","75%"])
b.names<-'male'
par(mar=c(5.1,5.1,2.1,1.1))
plot(c(min(c(sum.table1b[2:4], sum.table.g1b[2:4])), .10), c(.5,1.75), type='n', axes=F, ylab='', xlab=expression(paste('Effect on ', b[0])), ylim=c(.85,1.25))
axis(1,cex.axis=.8)
axis(2,at=1, labels=b.names, cex.axis=.8, las=2)
#Bayesian normal
x<-sum.table2
segments(x[3], x[1], x[4], x[1], lwd=4, col='grey75')
segments(x[5], x[1], x[6], x[1], lwd=6, col='grey45')
points(x[2],x[1],pch=16,col='white', cex=1.1)
points(x[2],x[1],pch=1, cex=1.2)
box()
abline(v=0,col=2,lty=2)
mtext('Female',at=0,side=3,line=.5,cex=.8)
#Bayesian gamma
x<-sum.table.g1b
segments(x[3], x[1], x[4], x[1], lwd=4, col='lightpink')
segments(x[5], x[1], x[6], x[1], lwd=6, col='red')
points(x[2],x[1],pch=16,col='white', cex=1.1)
points(x[2],x[1],pch=1, cex=1.2)
legend(-1.3, 1.25, c('Bayesian-normal', 'Bayesian-gamma'), col=c('grey45', 'red'), lty=1, lwd=4, bty='n', cex=.8)
#reset line ends back to the default value.
par(lend=0)
(a)  |
(b)  |
Fig. 13 Bayesian estimates for the effect parameters for different probability models: normal and gamma. In (a) we have estimate for the two main effects, sex and treatment, and their interaction in the model for a0 and in (b) just the main effect of sex in the b0 equation. The y-axis is labeled by the effect that is represented by that estimate relative to the reference group, "female, treatment group" (indicated at the top of the graph labeling an effect of zero). Displayed are 95% confidence (credibility) intervals, 50% confidence (credibility) intervals, and point estimates (mean or median of posterior distribution). |
- The effect estimates for the a0 parameter are similar for the normal and gamma distribution models, although the gamma regression model appears to lead to slightly narrower intervals. If we look at the actual estimates of a0 in the four treatment-sex groupings, the two models yield very different predictions, but we'll examine this later.
- The b0 sex effect appears to be quite different in the two models. The gamma regression credibility interval is nearly entirely contained within that of the the normal errors interval and is much narrower.
- There's no particular value in examining residual plots from this model. The behavior of residuals from gamma regression models is not well-studied. In particular, the residuals are not normally distributed nor are they expected to be homoscedastic. The best way to evaluate the fit of a gamma regression model is through predictive simulation. For further discussion of predictive simulation see lecture 14 and lecture 15.
Using WinBUGS—gamma regression model 2
- As was remarked previously, there is more than one way that the gamma regression model can be parameterized in terms of its mean. Because the mean of the gamma distribution is the ratio of the shape parameter, α, to the rate parameter, β, we can solve for either one and then use the resultant expression as a substitute. Previously, in what I will now call parameterization 1, we solved for the shape parameter and replaced it with a function of the mean μ and β. In parameterization 2 we instead solve for the rate parameter β.

- I incorporate this parameterization into the BUGS code for parameterization 1. The changes are highlighted in blue. Because α is constrained to be positive, I place a vague normal prior on its log.
Gompertz model gamma parameterization #2: "gompertz gamma 2 bugs.txt" |
model{
#gamma likelihood--level 1 model
for(i in 1:n) {
y[i]~dgamma(alpha,beta[i])
beta[i] <- alpha / y.hat[i]
#Gompertz mean
y.hat[i]<-a0[nest[i]]*exp(-b0[nest[i]]*pow(b1[nest[i]],age[i]))
}
#priors for level-1 parameters
logalpha ~ dnorm(0,.000001)
alpha<-exp(logalpha)
#level-2 model for Gompertz parameters
for(j in 1:J){
#the next use of I(0,) is technically incorrect!
a0[j]~dnorm(a.hat[j],tau.a)I(0,)
a.hat[j]<-mu.a+g1*sex[j]+g2*trt[j]+g3*sextrt[j]
b0[j]<-mu.b0+g4*sex[j]
b1[j]<-mu.b1
}
#priors for level-2 parameters
mu.a~dnorm(0,.0001)I(0,)
tau.a<-pow(sigma.a,-2)
sigma.a~dunif(0,1000)
mu.b0~dnorm(0,.0001)I(0,)
mu.b1~dnorm(0,.0001)I(0,)
g1~dnorm(0,.0001)
g2~dnorm(0,.0001)
g3~dnorm(0,.0001)
g4~dnorm(0,.0001)
#extra quantities
a0.femtrt<-mu.a
a0.femcontrol<-mu.a+g2
a0.maletrt<-mu.a+g1
a0.malecontrol<-mu.a+g1+g2+g3
b0.fem<-mu.b0
b0.male<-mu.b0+g4
} |
- I run the model in WinBUGS until the diagnostics look good.
kestrel.data<-list("y","age","n","nest","J","sex","trt","sextrt")
kestrel.g2.inits<-function() {list(a0=runif(J,50,250), mu.a=runif(1,50,250), mu.b0=runif(1,1,2), mu.b1=runif(1), alpha=runif(1), sigma.a=runif(1), g1=rnorm(1), g2=rnorm(1), g3=rnorm(1), g4=rnorm(1))}
kestrel.g2.parameters<-c("a0","mu.a","mu.b0","mu.b1", "alpha", "sigma.a", "g1", "g2", "g3", "g4")
kestrel.g2.out2<-bugs(kestrel.data, kestrel.g2.inits, kestrel.g2.parameters, "gompertz gamma 2 bugs.txt", n.chains=3, bugs.directory="C:/Program Files/WinBUGS14/", n.iter=10, debug=T)
kestrel.g2.out2<-bugs(kestrel.data, kestrel.g2.inits, kestrel.g2.parameters, "gompertz gamma 2 bugs.txt", n.chains=3, bugs.directory="C:/Program Files/WinBUGS14/", n.iter=1000, debug=T)
max(kestrel.g2.out2$summary[,"Rhat"])
[1] 5.294687
kestrel.g2.out3<-bugs(kestrel.data, kestrel.g2.out2$last.values, kestrel.g2.parameters, "gompertz gamma 2 bugs.txt", n.chains=3, bugs.directory="C:/Program Files/WinBUGS14/", n.iter=10000, debug=T)
max(kestrel.g2.out3$summary[,"Rhat"])
[1] 1.045594
min(kestrel.g2.out3$summary[,"n.eff"])
[1] 61
kestrel.g2.out4<-bugs(kestrel.data, kestrel.g2.out3$last.values, kestrel.g2.parameters, "gompertz gamma 2 bugs.txt", n.chains=3, bugs.directory="C:/Program Files/WinBUGS14/", n.iter=20000, n.burnin=100, debug=T)
max(kestrel.g2.out4$summary[,"Rhat"])
[1] 1.093501
min(kestrel.g2.out4$summary[,"n.eff"])
[1] 27
kestrel.g2.parameters2<-c("a0","mu.a","mu.b0","mu.b1", "alpha", "sigma.a", "g1", "g2", "g3", "g4","a0.femtrt","a0.femcontrol", "a0.maletrt", "a0.malecontrol", "b0.fem", "b0.male")
kestrel.g2.out5<-bugs(kestrel.data, kestrel.g2.out4$last.values, kestrel.g2.parameters2, "gompertz gamma 2 bugs.txt", n.chains=3, bugs.directory="C:/Program Files/WinBUGS14/", n.iter=50100, n.burnin=100, debug=T)
max(kestrel.g2.out5$summary[,"Rhat"])
[1] 1.008313
min(kestrel.g2.out5$summary[,"n.eff"])
[1] 260
- I collect the median deviances, DIC, and pD values from the three Bayesian models we've fit—the normal model and the two gamma models using different parameterizations.
results<-data.frame(c('normal', 'gamma 1','gamma 2'), c(kestrel.out5$summary["deviance","50%"], kestrel.g.out3$summary["deviance","50%"], kestrel.g2.out5$summary["deviance","50%"]), c(kestrel.out5$DIC, kestrel.g.out3$DIC, kestrel.g2.out5$DIC), c(kestrel.out5$pD, kestrel.g.out3$pD, kestrel.g2.out5$pD))
colnames(results)<-c('model', 'deviance', 'DIC', 'pD')
results
model deviance DIC pD
1 normal 558.6 579.134 19.415
2 gamma 1 552.7 571.336 17.719
3 gamma 2 555.2 566.327 11.131
- So, what do we conclude from this? Whether we use deviance or DIC to choose models, we conclude that the gamma models are superior to the normal model. If we try to select from among the gamma models we find that DIC and the deviance give conflicting recommendations with DIC favoring the second gamma model while deviance favors the first.
- Recall that DIC is analogous to AIC in that it "corrects" the deviance by including a penalty for the effective number of parameters estimated. The effective number of parameters is recorded in pD and is substantially lower for the second gamma model thus effectively counteracting the fact that it has a higher deviance than does the first gamma model.
- These differences in pD do not make any sense. The number of estimated parameters in these three models should be the same. All three probability models include a mean, which is estimated with exactly the same set of parameters in all three models, plus one additional parameter either a variance (normal), rate (gamma 1), or shape (gamma 2) parameter.
- Thus I'm compelled to disregard the DIC results and rate the models solely based on their deviance values. Using deviance the first gamma parameterization gives rise to a slightly better model than does the gamma 2 parameterization.
Comparing the estimates from the normal, gamma 1, and gamma 2 models
- I start by recreating Fig. 13 in which I display the parameter estimates of the regression equations for all three models.
#fig 14a -- a0 Gompertz parameter
#create a matrix of Bayesian normal results
gnames<-c("g1","g2","g3")
sum.table1<-cbind((1:3)+.125, kestrel.out5$summary[gnames,"50%"], kestrel.out5$summary[gnames,"2.5%"], kestrel.out5$summary[gnames,"97.5%"], kestrel.out5$summary[gnames,"25%"], kestrel.out5$summary[gnames,"75%"])
#create a matrix of Bayesian gamma 1 results
sum.table.g1<-cbind((1:3), kestrel.g.out4$summary[gnames,"50%"], kestrel.g.out4$summary[gnames,"2.5%"], kestrel.g.out4$summary[gnames,"97.5%"], kestrel.g.out4$summary[gnames,"25%"], kestrel.g.out4$summary[gnames,"75%"])
#create a matrix of Bayesian gamma 2 results
sum.table.g2<-cbind((1:3)-.125, kestrel.g2.out5$summary[gnames,"50%"], kestrel.g2.out5$summary[gnames,"2.5%"], kestrel.g2.out5$summary[gnames,"97.5%"], kestrel.g2.out5$summary[gnames,"25%"], kestrel.g2.out5$summary[gnames,"75%"])
#labels for y-axis
b.names<-c('male', 'control', expression('male' %*%'control'))
#expand left margin
par(mar=c(5.1,7.1,2.1,1.1))
#change bar ends to square
par(lend=2)
#set up plot
plot(range(cbind(sum.table1[,2:4],sum.table.g1[,2:4], sum.table.g2[,2:4])), c(1,3), type='n', axes=F, ylab='', xlab=expression(paste('Effect on ',a[0])), ylim=c(.5,3.75))
axis(1,cex.axis=.8)
axis(2,at=1:3, labels=b.names, cex.axis=.8, las=2)
#draw bayesian-normal results
apply(sum.table1, 1, function(x) segments(x[3], x[1], x[4], x[1], lwd=4, col='grey75'))
apply(sum.table1, 1, function(x) segments(x[5], x[1], x[6], x[1], lwd=6, col='grey45'))
apply(sum.table1, 1, function(x) points(x[2], x[1], pch=16, col='white', cex=1.3))
apply(sum.table1, 1, function(x) points(x[2], x[1], pch=1, cex=1.4))
box()
abline(v=0,col=2,lty=2)
mtext('Female, Treatment',at=0,side=3,line=.5,cex=.8)
#draw bayesian gamma 1 results
apply(sum.table.g1, 1, function(x) segments(x[3], x[1], x[4], x[1], lwd=4, col='lightpink'))
apply(sum.table.g1, 1, function(x) segments(x[5], x[1], x[6], x[1], lwd=6, col='red'))
apply(sum.table.g1,1,function(x) points(x[2], x[1], pch=16, col='white', cex=1.3))
apply(sum.table.g1,1, function(x) points(x[2],x[1], pch=1, cex=1.4))
#draw bayesian gamma 2 results
apply(sum.table.g2, 1, function(x) segments(x[3], x[1], x[4], x[1], lwd=4, col='lightblue'))
apply(sum.table.g2, 1, function(x) segments(x[5], x[1], x[6], x[1], lwd=6, col='blue'))
apply(sum.table.g2,1,function(x) points(x[2], x[1], pch=16, col='white', cex=1.3))
apply(sum.table.g2,1, function(x) points(x[2],x[1], pch=1, cex=1.4))
legend(-80,3.85, c('normal','gamma 1' ,'gamma 2'), col=c('grey45','red', 'blue'), lty=1, lwd=4, bty='n', cex=.8)
#fig 14b -- b0 Gompertz parameter
sum.table1b<-c(1+.05, kestrel.out5$summary["g4","50%"], kestrel.out5$summary["g4","2.5%"], kestrel.out5$summary["g4","97.5%"], kestrel.out5$summary["g4","25%"], kestrel.out5$summary["g4","75%"])
sum.table.g1b<-c(1, kestrel.g.out4$summary["g4","50%"], kestrel.g.out4$summary["g4","2.5%"], kestrel.g.out4$summary["g4","97.5%"], kestrel.g.out4$summary["g4","25%"], kestrel.g.out4$summary["g4","75%"])
sum.table.g2b<-c(1-.05, kestrel.g2.out5$summary["g4","50%"], kestrel.g2.out5$summary["g4","2.5%"], kestrel.g2.out5$summary["g4","97.5%"], kestrel.g2.out5$summary["g4","25%"], kestrel.g2.out5$summary["g4","75%"])
b.names<-'male'
par(mar=c(5.1,5.1,2.1,1.1))
plot(c(min(c(sum.table1b[2:4], sum.table.g1b[2:4], sum.table.g2b[2:4])), .10), c(.5,1.75), type='n', axes=F, ylab='', xlab=expression(paste('Effect on ', b[0])), ylim=c(.85,1.25))
axis(1, cex.axis=.8)
axis(2, at=1, labels=b.names, cex.axis=.8, las=2)
#Bayesian normal
x<-sum.table1b
segments(x[3], x[1], x[4], x[1], lwd=4, col='grey75')
segments(x[5], x[1], x[6], x[1], lwd=6, col='grey45')
points(x[2], x[1], pch=16, col='white', cex=1.1)
points(x[2], x[1], pch=1, cex=1.2)
box()
abline(v=0, col=2, lty=2)
mtext('Female', at=0, side=3, line=.5, cex=.8)
#Bayesian gamma 1
x<-sum.table.g1b
segments(x[3], x[1], x[4], x[1], lwd=4, col='lightpink')
segments(x[5], x[1], x[6], x[1], lwd=6, col='red')
points(x[2], x[1], pch=16, col='white', cex=1.1)
points(x[2], x[1], pch=1, cex=1.2)
#Bayesian gamma 2
x<-sum.table.g2b
segments(x[3], x[1], x[4], x[1], lwd=4, col='lightblue')
segments(x[5], x[1], x[6], x[1], lwd=6, col='blue')
points(x[2], x[1], pch=16, col='white', cex=1.1)
points(x[2], x[1], pch=1, cex=1.2)
legend(-1.3, 1.25, c('normal', 'gamma 1', 'gamma 2'), col=c('grey45', 'red', 'blue'), lty=1, lwd=4, bty='n', cex=.8)
#reset line ends back to the default value
par(lend=0)
(a)  |
(b)  |
Fig. 14 Bayesian estimates for the effect parameters for three different probability models: a normal and two parameterizations of the gamma distribution. In (a) we have estimates for the two main effects of sex and treatment as well as their interaction in the equation for the a0 Gompertz parameter and in (b) we have just the main effect of sex in the b0 equation. The y-axis is labeled by the effect that is represented by that estimate relative to the reference group (indicated at the top of the graph with an effect of zero). Displayed are 95% confidence (credibility) intervals, 50% confidence (credibility) intervals, and point estimates (median of posterior distribution). |
- The conclusions we draw from the experiment are the same regardless of the probability model used. Males have a smaller value of the asymptote, a0, than do females but there appears to be a substantive interaction present suggesting that the diminished asymptotic size is most dramatic for males that underwent treatment.
- The three models differ in their estimate of the effect of sex on b0. While the precision of the estimated effect is greater in the two gamma models than in the normal model, the estimated effect is also less negative. Consequently the credibility intervals in the gamma models are shifted to the right to such an extent that in the gamma 2 model the interval barely misses including 0.
- To better understand what's happening I display the values of the full set of Gompertz parameter fixed effects estimates for the three models.
cbind(kestrel.out4$summary[c("mu.a", "mu.b0", "mu.b1", "g1", "g2", "g3", "g4"), "50%"], kestrel.g.out4$summary[c("mu.a", "mu.b0", "mu.b1", "g1", "g2", "g3", "g4"), "50%"], kestrel.g2.out5$summary[c("mu.a", "mu.b0", "mu.b1", "g1", "g2", "g3", "g4"), "50%"]) -> models.out
colnames(models.out)<-c('normal', 'gamma 1', 'gamma 2')
models.out
normal gamma 1 gamma 2
mu.a 110.10000 127.5500 157.1000
mu.b0 2.99500 2.6175 2.7365
mu.b1 0.86440 0.8961 0.9132
g1 -46.49500 -49.9300 -52.2500
g2 11.37000 12.7400 10.8550
g3 30.69000 31.4850 29.9800
g4 -0.66755 -0.4209 -0.3129
- The points estimates of the parameters are rather close with one notable exception, mu.a. This parameter by itself corresponds to the value of the asymptote for the reference group, treated females. The estimate is lowest in the normal model, higher for the gamma 1 model, and much higher still for the gamma 2 model. Because g1, g2, and g3 vary only a little across the three models (as we saw in Fig. 14), but mu.a appears in the formula for the asymptote of all four treatment groups, it follows that the asymptotes will be ranked in the same order across models for all four treatment × sex combinations.
Comparing the predictions from the normal, gamma 1, and gamma 2 models
- While the above results concerning the asymptote would seem to indicate that the three models disagree in their predictions, this is not the case. To demonstrate this I plot the individual Gompertz curves for the eighteen nests in a lattice display. Each panel of the lattice represents a different nest and displays the predicted subject-specific Gompertz curves for the normal model and the two parameterizations of the gamma probability model.
#number of observations per nest
apply(table(kestrel3$nest, kestrel3$age),1,sum) -> num.ages
#obtain one value of sex and trtmnt per nest
tapply(kestrel3$sex, kestrel3$nest, function(x) x[1])-1 -> unique.sex
tapply(kestrel3$trtmnt, kestrel3$nest, function(x) x[1])-1 -> unique.trt
#collect nest-level values in single data frame
model.dat<-data.frame(as.numeric(names(num.ages)), num.ages, unique.trt, unique.sex, kestrel.g2.out5$summary[1:18,'50%'], kestrel.g.out4$summary[1:18,'50%'], kestrel.out4$summary[1:18,'50%'])
colnames(model.dat)<-c('nest', 'num.ages', 'trtmnt', 'sex', 'G2.a0', 'G1.a0', 'N.a0')
#Gompertz parameters for gamma 2 model
model.dat$G2.b0<-kestrel.g2.out5$summary['mu.b0','50%']+ kestrel.g2.out5$summary['g4','50%']*model.dat$sex
model.dat$G2.b1<-kestrel.g2.out5$summary['mu.b1','50%']
#Gompertz parameters for gamma 1 model
model.dat$G1.b0<-kestrel.g.out4$summary['mu.b0','50%']+ kestrel.g.out4$summary['g4','50%']*model.dat$sex
model.dat$G1.b1<-kestrel.g.out4$summary['mu.b1','50%']
#Gompertz parameters for normal model
model.dat$N.b0<-kestrel.out4$summary['mu.b0','50%']+ kestrel.out4$summary['g4','50%']*model.dat$sex
model.dat$N.b1<-kestrel.out4$summary['mu.b1','50%']
#combine nest-level data with full data set
model.dat.short<-model.dat[,c(1:2,5:(ncol(model.dat)))]
kestrel3.model<-merge(kestrel3, model.dat.short)
library(lattice)
xyplot(nstlmass~age|factor(nest), data=kestrel3.model, xlab='Age', ylab='Nestling Mass', subscripts=T, panel=function(x,y,subscripts) {
#raw data
panel.xyplot(x,y,pch=16)
#normal Gompertz curve
panel.curve(SSgompertz(x, kestrel3.model$N.a0[subscripts][1], kestrel3.model$N.b0[subscripts][1], kestrel3.model$N.b1[subscripts][1]), lwd=2, col='gray60')
#gamma 1 Gompertz curve
panel.curve(SSgompertz(x, kestrel3.model$G1.a0[subscripts][1], kestrel3.model$G1.b0[subscripts][1], kestrel3.model$G1.b1[subscripts][1]), lwd=2, col='red', lty=2)
#gamma 2 Gompertz curve
panel.curve(SSgompertz(x, kestrel3.model$G2.a0[subscripts][1], kestrel3.model$G2.b0[subscripts][1], kestrel3.model$G2.b1[subscripts][1]), lwd=1, col='blue', lty=1)
},
key=list(lines=list(lty=c(1,2,1), col=c('gray60',2,4), lwd=c(2,2,1)), text=list(c('Normal','Gamma 1', 'Gamma 2'), cex=c(.75,.75,.75)), border=0, x=.8, y=.85, corner=c(0,0)))

Fig. 15 Comparing fitted subject-specific Gompertz curves from different probability models
- What we see is that the differences in the predicted asymptotes does not cause any important differences in the Gompertz curves over the actual range of the data. This is especially apparent for those nests that have less than the full suite of six ages represented, e.g., nests 3, 5, 6, 8, 11, 12, 13, 15, 27, and 29. Here we see that the curves start to diverge at the last data value after which we have no objective basis to choose among them. Over the actual range of the data the curves are barely distinguishable for individual nestlings.
Comparing the estimated normal, gamma 1, and gamma 2 probability distributions
- Except for some small differences in the overall model deviances, we thus far have little basis for choosing one probability model over another. In a generalized linear model, mixed or otherwise, there is assumed to be a probability distribution of possible observed data values spread out along the vertical axis at each point of the mean Gompertz curves shown in Fig. 15. In lecture 3 we superimposed the predicted probability distributions at various points along the regression curve. Formally these probability distributions should define a third dimension to the plot with the probability mass sticking out of the x-y plane of the paper. Such displays can quickly get unwieldy.
- Instead I select a nestling and at each observed age of that nestling I draw the probability distributions of nestling mass that are predicted by the normal model and the two parameterizations of the gamma model. The different ages of that nestling then define the separate panels of the lattice graph. For reference I also indicate the observed nestling mass (blue square,
) at each age at the bottom of the panels.
#get a list of nests with 5 or more observations
my.nests<- unique(kestrel3.model.short[kestrel3.model.short$num.ages>=5, "nest"])
#function for drawing probability distributions
nest.function<-function(i) {
model.nest<-kestrel3.model[kestrel3.model$nest==my.nests[i],]
histogram(~nstlmass|factor(age), data=model.nest, type='density', xlab='Nestling Mass', subscripts=T, main=paste('Nest ', my.nests[i]), xlim=c(-20,170),
ylim=c(-.002, .1), panel=function(x,subscripts) {
#normal distribution
panel.curve(dnorm(x,mean=SSgompertz( model.nest$age[subscripts][1], model.nest$N.a0[subscripts][1], model.nest$N.b0[subscripts][1], model.nest$N.b1[subscripts][1]), sd=kestrel.out4$summary["sigma.y",'50%']), lwd=2, col='gray60')
#gamma 1 distribution
panel.curve(dgamma(x,SSgompertz(model.nest$age[subscripts][1], model.nest$G1.a0[subscripts][1], model.nest$G1.b0[subscripts][1], model.nest$G1.b1[subscripts][1])* kestrel.g.out4$summary['beta','50%'], kestrel.g.out4$summary['beta','50%']), lwd=2, col='red', lty=2)
#gamma 2 distribution
panel.curve(dgamma(x, kestrel.g2.out5$summary['alpha', '50%'], kestrel.g2.out5$summary['alpha','50%']/ SSgompertz(model.nest$age[subscripts][1], model.nest$G2.a0[subscripts][1], model.nest$G2.b0[subscripts][1], model.nest$G2.b1[subscripts][1])), lwd=1, col='blue', lty=1)
#raw data
panel.points(x, 0, pch=15)
}, par.settings = list(par.main.text = list(cex = 1)),
key=list(columns=3, lines=list(lty=c(1,2,1), col=c('gray60',2,4), lwd=c(2,2,1)), text=list(c('Normal', 'Gamma 1', 'Gamma 2'), cex=c(.75,.75,.75))))
}
#draw graph for ninth nest in the list
nest.function(9)
 |
Fig. 16 Probability distributions for nestling mass. The label appearing at the top of a panel is the age of the nestling. The blue square at the bottom of each panel indicates the observed mass of the nestling. |
- A few observations about Fig. 16.
- It's clear that the observed value (small blue square) would be viewed as a fairly typical value from any one of the three distributions. While these are only the data for nest 25, a similar conclusion would have been drawn if we had used any of the other nests. Thus there are no severe outliers in these data for any of the three distributions.
- The first panel (age = 0) offers one rationale for rejecting the normal distribution out of hand. The fixed variance of the normal distribution permits the distribution to go negative here because the average nestling mass is small. Observe that the two gamma distributions compensate for the presence of a lower boundary at zero by being more concentrated and less variable.
- Both gamma distributions are less variable than the normal distribution at early ages (small nestling masses), but become more variable than the normal distribution at later ages (larger nestling masses).
- The two gamma distributions while not especially distinguishable early on (ages 0, 5, and 10), become markedly different by ages 20 and 25. Specifically the distribution we're calling gamma 1 has a smaller variance than does gamma 2.
- The last observation describes a fundamental difference in the two gamma parameterizations, which I now demonstrate.
- In the parameterization I've been calling gamma 1 we used the mean,
, to replace the parameter α. Thus in the gamma 1 model there are two parameters, μ and β. The variance in terms of these two parameters is the following.

So, the variance in the gamma 1 parameterization is just a constant multiple of the mean.
- Turning to the gamma 2 parameterization, there we used the formula for the mean to replace the parameter β. Thus in the gamma 2 parameterization there are two parameters, μ and α. The variance in terms of these two parameters is the following.

So, the variance in the gamma 2 parameterization is a constant multiple of the square of the mean.
- This explains what we observed in Fig. 16. As the mean nestling mass increases with age, the variances of both gamma distributions also increase. The variance of the gamma 2 parameterization increases much more because the variance increases as the square of the mean while in the gamma 1 parameterization it only increases linearly with the mean. The variance in the normal distribution, of course, is constant because the mean and variance are independent.
- An inspection of Fig. 16 for the other nests suggests that there is nothing in these data to suggest that the extra variability permitted by gamma 2 distribution is needed. When we couple this with the ranking of the models by deviance and the nonsensical predictions made by the normal distribution at age 0, the gamma 1 distribution appears to be the best choice for these data.
Using SAS to Reproduce the Results from nlme and WinBUGS
- SAS has long been considered the gold standard in statistical analysis. This is because the SAS Institute has an extremely rigorous quality control program to tests the accuracy of its statistical programs. This doesn't mean that SAS doesn't have bugs, but the bugs it does have tend not to be egregious ones.
- That being said, SAS is not easy to use. Interacting with SAS today is not too different from what it was like over 40 years ago when SAS ran only on main frame computers. The user writes a SAS program in which data manipulation takes place in a Data step and all statistical analyses are relegated to individual procedures (called Procs) that are dedicated to that statistical methodology. My goal here is not to teach SAS but merely to show you
- how get data from R into SAS,
- how to do a quick analysis in SAS to take advantage of a capability that SAS has but that is lacking in R, and
- then how to get the results back to R again.
- There are three SAS Procs that can be used to fit mixed effects models: Proc Mixed, Proc Glimmix, and Proc NLMixed.
- Proc Mixed is the SAS analogue of the lme function of the nlme package of R although it precedes it chronologically by almost ten years. Proc Mixed only fits normal-based linear mixed effects models.
- Proc Glimmix is a recent innovation that extends SAS Proc Mixed to non-normal distributions. Glimmix is the direct competitor of the lme4 package of R. It's latest incarnation in SAS 9.2 includes maximum likelihood estimation of generalized linear mixed effects models. It supports the same probability distributions that will eventually be implemented in the lmer function of R. Unlike lmer (at least supposedly), Glimmix can only be used to fit linear models.
- Proc NLMixed extends Proc Mixed to nonlinear models. It is the most general of these three packages because it allows the user to fit a mixed effects model using any probability distribution at all. In this sense NLMixed is the frequentist analogue of WinBUGS. Although there is no limitation on the number of random effects that are included in a model, the one shortcoming of NLMixed is that it can only be used to fit hierarchical models with two levels.
- Because NLMixed is the most general of these three Procs and is the only one that can be used to fit nonlinear mixed effects models, I use it here to fit a Gompertz mixed effects model with a gamma-distributed response variable.
Preparing the data for SAS
- SAS has spectacular data entry capabilities, far better than does R. To make things as simple as possible I'm going to ignore this capability and do all the necessary pre-processing of data using R. I carry out the following manipulations of the kestrel data set in R:
- remove all observations with missing values in any of the variables that will be used in the regression model,
- convert factors into numeric 0-1 dummy variables, and
- create a data frame that includes only those variables that will be needed in the analysis.
setwd('D:/Ecology 563 Fall 2008/lecture29')
kestrel3<- kestrel2[!is.na(kestrel2$sex),]
for.sas<- kestrel3[,c("nest", "age", "nstlmass")]
for.sas$trtmnt<- as.numeric(kestrel3$trtmnt)-1
for.sas$sex<- as.numeric(kestrel3$sex)-1
write.table(for.sas, 'kestrelsas.csv', sep=',', row.names=F)
- I don't think we've previously used write.table in this course. It's the counterpart of read.table. Instead of reading in data from external text files, it is used to write R data frames to an external text file.
- The first argument of write.table is the R data frame to be exported.
- The second argument is the name, given in quotes, for the external text file that write.table will create. In an earlier line I used the setwd function to establish the external directory location. If I hadn't done so the second argument of write.table would have needed to include the full directory specification as part of the name.
- There are many additional arguments supported by write.table. Here I use two more.
- The sep= argument specifies the separator to use to separate the variable fields. Here I specify that the file be comma-delimited.
- I also specify row.names=F to ensure that the row names that are part of a R data frame are not included in the exported text file.
- We're ready to load the data into SAS. I start up SAS by selecting Start > All Programs > SAS > SAS 9.1 (English) from the Windows menu system. As of this writing the current version of SAS is SAS 9.2 but the differences from SAS 9.1 are slight. Fig. 17 shows the start-up display for SAS 9.1.

Fig. 17 Start-up screen for SAS 9.1
- There are a number of SAS windows present but only three that are relevant to us here—Editor, Log, and Output. These are indicated on the three tabs displayed at the bottom of the SAS parent window. Two of them, Editor and Log, are tiled and appear by default in the start-up screen.
- The Editor window is where SAS program code is entered. It is analogous to the script window of R. SAS code is entered in this window and then submitted to the SAS system for execution.
- The Log window displays information about previous SAS runs including error messages if there are mistakes in the SAS program code or problems in carrying out the analysis.
- The Output window displays the analytical results of the SAS run.
- Our first objective is to import the text data file we created in R and to check that it was interpreted correctly by SAS. This can be accomplished by selecting File > Import Data… from the SAS menu and then negotiating through a series of menu choices. If we do this and then tell SAS to transcribe our menu choices into SAS code we would obtain the following SAS program.
PROC IMPORT OUT= WORK.KESTREL2
DATAFILE= "D:\Ecology 563 Fall 2008\lecture29\kestrelsas.csv"
DBMS=CSV REPLACE;
GETNAMES=YES;
DATAROW=2;
RUN;
- Some of what appears in the code should be understandable from our experience with R. I highlight some of the similarities and differences in the bullets below.
- R statements end when you press the return key on the keyboard. SAS statements end when you enter a semicolon. Thus we see that there are only four SAS program statements in the above block of code even though there are six lines.
- SAS is not case-sensitive. Thus keywords can be written all lower-case, all upper-case, or a mixture of the two. In addition SAS does not distinguish upper-case and lower-case versions of the same variable. This can be a problem if you made use of this feature in naming your variables in R.
- The SAS keyword PROC is used to identify a SAS procedure that should implement the code that follows. In the code above we specified the IMPORT procedure.
- The DATAFILE argument identifies the external file that we wish to import into SAS.
- The OUT argument assigns a SAS name to the external file. SAS data set names are two-part names with the two parts separated by a period. By default the first part of the name is given the name WORK by SAS. Because WORK is the default, we can refer to this data set as WORK.KESTREL2 or just KESTREL2. In R we would have accomplished the first two lines of the IMPORT procedure by writing the following.
WORK.KESTREL2 <- read.table("D:\\Ecology 563 Fall 2008\\lecture29\\kestrelsas.csv")
- DBMS=CSV tells SAS that the external file is a comma-delimited file. This is like the sep=',' option of R in the read.table function.
- REPLACE tells SAS to replace KESTREL2 if it already exists in the SAS workspace.
- GETNAMES=YES; tells SAS that the variable names occur in the external file. This is like the header=T option of R in the read.table function.
- DATAROW=2; tells SAS that the actual data values begin on the second row of the external file. This is the default in R with the header=T option.
- The RUN; statement closes off the lines of code that are to be submitted to the IMPORT procedure and signifies that IMPORT can begin to run.
- SAS and R have different naming conventions for variables. Thus it is useful to determine if SAS has altered any of the names of our variables from what they were in R. As was noted above, SAS is not case-sensitive and thus does not distinguish upper-case and lower-case versions of the same variable. Furthermore SAS does not support the use of periods in variable names. The IMPORT procedure typically changes periods in variable names to underscores.
- A quick way to see how SAS has interpreted your variable names is to run the CONTENTS procedure. The code for doing this is shown below. The data argument tells the CONTENTS procedure to use the SAS data set we just created, WORK.KESTREL2.
proc contents data=kestrel2;
run;
- The Program Editor window with all of these lines of SAS code is shown in Fig. 18.

Fig. 18 SAS Editor window with SAS code
- To execute the code in the Editor window we need to submit the lines of code to SAS. When the Editor window is the active window this can be done by selecting Run > Submit from the menu or more simply by clicking the Submit button on the SAS toolbar (Fig. 19).

Fig. 19 Location of the submit button on the SAS toolbar
- After the code is submitted, the Log window should display that WORK.KESTREL2 was successfully created and has 81 observations and 5 variables. The Output window displays the default output of the CONTENTS procedure. The useful information for us appears at the end of the display and is shown in Fig. 20. The second column lists the variable names that the SAS IMPORT procedure has created. As we can see SAS has made no changes to the variable names from what they were in R.

Fig. 20 Output from the CONTENTS procedure
Running a nonlinear mixed effects model in SAS
- As was noted above we'll be using the NLMIXED procedure of SAS to fit nonlinear mixed effects models. NLMIXED has a number of built-in probability models, but also permits user-specified probability models. The latter is accomplished by writing down the log-likelihood of the desired probability model. I'll demonstrate both approaches in what follows.
- To make things more transparent I start by fitting a normal model, the same model we fit using the nlme package in R. The code for doing this appears below. In SAS comments start with an asterisk, *, rather than the pound sign as they do in R, and end in a semicolon. Comments are shown in green below, the same way that they appear in the SAS Editor window.
*normal model;
proc nlmixed data=kestrel2;
parms a0=127 a0gen=-49.9 a0trt=12 a0int=34.5
b0=2.56 b0gen=-.93 b1=.8 s2u1=225 s2e=10
;
a0ind=a0+a0gen*sex+a0trt*trtmnt+a0int*sex*trtmnt+u1; *individual a0;
b0ind=b0+b0gen*sex; *individual b0;
b1ind=b1; *individual b1;
eta=a0ind*exp(-b0ind*b1**age); *Gompertz formula;
model nstlmass~normal(eta,s2e);
random u1~normal(0,s2u1) subject=nest;
run;
- The code begins with a declaration of the procedure to be used, in this case proc nlmixed, followed by a data argument that specifies the name of the SAS data set to use. The run statement that appears in the last line closes off the procedure.
- There are three standard statements needed to define the basic model in nlmixed. These are the three statements that begin with a key word that appears in blue.
- The parms statement is used to provide initial estimates for all the parameters in the model. The closer these values are to the actual maximum likelihood estimates, the more likely it is that the algorithm will converge to the maximum likelihood solution. Remember that we had to do something similar in the nlme function of R. There the initial values appeared in a start= argument. The only difference between proc nlmixed and R's nlme function in this regard is that nlmixed also requires initial estimates of the variance terms also, here labeled s2u1, the variance of the random effects, and s2e, the level-1 variance.
- The model statement declares the basic probability model for the response. This should remind you of how things are specified in WinBUGS. The normal function in SAS takes two arguments: the mean of the normal distribution followed by its variance. Here I call the mean eta, defined earlier in the code, and the variance, s2e, whose initial estimate was specified in the parms statement.
- The random statement declares the number of random effects, their distribution, and the structure of the data set. In the above code I specify that there is a single random effect, u1, that has a normal distribution with mean 0 and variance s2u1. (Note that s2u1 was given an initial estimate in the parms statement.) The subject=nest specification indicates that observations from the same nest should have the same value of the random effect u1. Observe that the variable u1 appears earlier in the code in the formula for the asymptote, labeled a0ind. Thus the random statement in SAS along with the inclusion of the random effect in the asymptote equation is equivalent to the notation random=a0~1|nest that we used in the nlme function of R.
- The remaining lines of code define intermediate variables in SAS. The statements that define the variables a0ind, b0ind, and b1ind are just the level-1 equations for the three parameters of the Gompertz model. The variable eta is the value of the Gompertz function as a function of the variable age.
- There's no reason to submit the entire Editor window to SAS and cause it to rerun the import and contents procedures again. Instead just drag over the all of the lines of the nlmixed procedure from proc nlmixed all the way to and including the run statement. With these lines highlighted, click the submit button,
, on the SAS toolbar to submit just the highlighted lines.
- Many of the SAS produce a lot of output, what critics of SAS like to call "statistical diarrhea".
- The first section of the output, labeled Specification, describes the various methods SAS has used to obtain the maximum likelihood solution.
- The next section labeled Dimensions lists some details about the data set being used.
- The section labeled Parameters recites the initial values that were given to the model parameters as well as the initial value of the negative log-likelihood, NegLogLike, evaluated at the initial values of the parameter estimates.
- This is followed by the iteration history which describes the search history of the optimization algorithm. Recall that maximum likelihood estimation for mixed effects models is done numerically. The log-likelihood expression defines a surface over the parameter space and the initial values for the parameters place us at some location on that surface. The numerical optimization method then tries to move us on that surface in such a way that after a finite number of steps we reach the global maximum of that surface (if it exists). At the desired maximum likelihood estimate, the tangent plane to log-likelihood surface should be horizontal (and thus have a gradient that is approximately zero).
- Below I list the last line of the iteration history reported by SAS.
Iter Calls NegLogLike Diff MaxGrad Slope
29 58 291.025945 3.009E-6 0.051791 -7.52E-7
NOTE: GCONV convergence criterion satisfied.
- So at the 29th iteration the value of the negative log-likelihood was 291.026. This differs (Diff) from its value at the previous iteration by 0.000003. So, the log-likelihood is barely changing.
- The gradient is a vector with a component corresponding to each parameter in the model. The elements of the gradient contain the values of the partial derivative of the negative log-likelihood with respect to each parameter at the current values of the parameters. When all components of the gradient are zero it means that the current choice of values for the parameters has yielded a point on the log-likelihood surface that is a local maximum or minimum. MaxGrad is the maximum value of the absolute projected gradient at the current iteration. (The projected gradient is just the ordinary gradient adjusted for boundary conditions.) The reported value of 0.05 is small but certainly not zero.
- Slope corresponds to the value of the directional derivative with respect to the search direction that the algorithm has selected as being "best". The reported value is quite small indicating the log-likelihood will not change very much if the algorithm were to take a step in this direction.
- The final line indicates that the so-called GCONV convergence criterion has been satisfied. This is one of many built-in stopping criteria that NLMIXED uses. GCONV stands for "relative gradient convergence criterion" and corresponds to the value of the normalized predicted function reduction. This is just the amount the objective function would change relative to the current size of the objective function.
- Following this we get a list of fit statistics—deviance, AIC, AICc and BIC. Observe that the reported value of AIC is the same that we obtained previously with nlme. This provides some evidence that we've specified the model correctly in SAS.
Fit Statistics
-2 Log Likelihood 582.1
AIC (smaller is better) 600.1
AICC (smaller is better) 602.6
BIC (smaller is better) 608.1
- Finally we get a list of parameter estimates, standard errors, p-values (labeled Pr > |t|), symmetric confidence intervals based on a t-distribution (Lower and Upper), and the value of the gradient at the last iteration. Most of the values are pretty close to the nlme and WinBUGS estimates. The symmetric confidence intervals and statistical tests for the variance components are clearly nonsense. Notice that the lower bound on the confidence interval for the variance of the random effects is negative.
Parameter Estimates
Standard
Parameter Estimate Error DF t Value Pr > |t| Alpha Lower Upper Gradient
a0 112.89 10.3727 17 10.88 <.0001 0.05 91.0018 134.77 0.001503
a0gen -51.3346 13.2582 17 -3.87 0.0012 0.05 -79.3069 -23.3623 -0.00371
a0trt 8.3383 11.4471 17 0.73 0.4763 0.05 -15.8129 32.4895 -0.00229
a0int 37.1117 16.5013 17 2.25 0.0381 0.05 2.2970 71.9264 0.005007
b0 2.9813 0.2419 17 12.33 <.0001 0.05 2.4711 3.4916 0.005889
b0gen -0.7148 0.2518 17 -2.84 0.0113 0.05 -1.2460 -0.1835 0.003008
b1 0.8657 0.01098 17 78.87 <.0001 0.05 0.8425 0.8889 -0.05179
s2u1 224.70 128.86 17 1.74 0.0993 0.05 -47.1803 496.58 0.004316
s2e 53.7607 9.7246 17 5.53 <.0001 0.05 33.2435 74.2778 -0.00016
Running a nonlinear mixed effects gamma regression model in SAS
- With that bit of success, we're ready to try to fit a gamma model. SAS has a built in gamma function. Its parameterization differs from that of WinBUGS. Instead of a rate parameter β, it uses a shape parameter, φ, that is the reciprocal of β. Thus in terms of this parameterization the mean of the gamma distribution is the following.
|  |
(1a)
(1b) |
So, as before we have two different ways to include the mean as an explicit parameter in the gamma distribution corresponding to what we called the gamma 1 and the gamma 2 model before when we fit this model in WinBUGS.
- With WinBUGS we had no difficulty fitting a gamma model using either parameterization. Using frequentist methods one parameterization turns out to be relatively easy to fit, while the other is nearly impossible to fit. To understand why, consider the formula for the gamma probability distribution in terms of α and φ.

with the corresponding log-likelihood.

- As in most frequentist approaches to mixed models, SAS works with the marginal likelihood in which the random effects are integrated out of the log-likelihood. Rather than maximize the marginal log-likelihood, SAS chooses to minimize the negative marginal log-likelihood, eqn (2).
|  |
(2) |
- To obtain maximum likelihood estimates this expression would need to be differentiated with respect to the parameters of interest. Thus SAS needs to minimize an expression that combines both an integration, which needs to be carried out numerically, with a differentiation that also needs to be carried out numerically.
- As a final complication we still need to introduce μ into the expression. Recall that μ itself is a nonlinear expression of the actual parameters of interest, the coefficients of the effects due to treatment and sex.

- We need to insert this expression for μ into the negative log-likelihood of eqn (2) using either the formula for α, the gamma 1 model given in eqn (1a), or the formula for φ, the gamma 2 model given in eqn (1b). After the substitution it is the parameters that appear in the expression for μ whose maximum likelihood estimates are desired.
- I'm stressing the nature of the complexity of the expression we're dealing with here, because it affects whether it's possible to obtain estimates using a generic estimation program such as Proc NLMIXED. It turns out in the frequentist paradigm that we are often forced to choose parameterizations not because we prefer their interpretation, but because we can obtain estimates using them. That turns out to be the situation here. I was able to obtain model convergence when I used the the gamma 2 parameterization, but not when I used the gamma 1 parameterization.
- This should not be too surprising. If we examine eqn (2) we see that φ appears as its reciprocal and as the argument of a logarithm. The parameter α similarly appears as a simple coefficient in two places but α also occurs as the argument of the log gamma function.
- Recall that the gamma function is an improper integral (improper because its limits are 0 to infinity). So if we substitute for α in eqn (2), the parameters of interest end up occurring in a nonlinear function that is in turn buried in the logarithm of an improper integral all of which then occurs as part of the integrand of an integral that also needs to be differentiated.
- The take-home message for frequentist estimation is that if we need to make a substitution we should try to choose that substitution so that it occurs in the simplest portions of the function that needs to be optimized.
- The Proc NLMIXED code for fitting the gamma 2 parameterization of the gamma regression model is shown below. In the first run I illustrate fitting the model using the built-in SAS gamma function.
- A new parameter, logalpha, is added to the parms list.
- As in WinBUGS a statement is added to constrain the scale parameter, α, to be positive.
- A statement is added that defines the substitution for the shape parameter φ in terms of the mean, the gamma 2 parameterization.
- The gamma distribution replaces the normal distribution in the model statement.
- The two ods lines appearing the program create SAS data sets from portions of the nlmixed output. They will be explained later.
*gamma 2 model using gamma function;
proc nlmixed data=kestrel2;
parms a0=127 a0gen=-49.9 a0trt=12 a0int=34.5
b0=2.56 b0gen=-.93 b1=.8 s2u1=225 logalpha=-1
;
alpha=exp(logalpha); *gamma shape parameter constrained to be positive;
a0ind=a0+a0gen*sex+a0trt*trtmnt+a0int*sex*trtmnt+u1; *individual a0;
b0ind=b0+b0gen*sex; *individual b0;
b1ind=b1; *individual b1;
eta=a0ind*exp(-b0ind*b1**age); *Gompertz formula for the mean;
phi=eta/alpha; *gamma 2 model substitution;
model nstlmass~gamma(alpha,phi);
random u1~normal(0,s2u1) subject=nest;
ods output ParameterEstimates=parmests; *extract parameter estimates;
ods output FitStatistics=fits; *extract logLik and AIC;
run;
- In the second run I show how the same model can be fit instead by writing down the log-likelihood explicitly. This approach is extremely useful because it means that Proc NLMIXED can be used to fit a random intercepts model using any probability distribution of interest for the response. To carry this out here the gamma function is replaced with general in the nlmixed model statement. The single argument of general is the log-likelihood of the model, an expression that was defined in an earlier statement. Notice in the definition of the log-likelihood, called ll in the code, that SAS has a built-in function lgamma to calculate the log of the gamma function.
*gamma 2 model using log-likelihood
proc nlmixed data=kestrel2;
parms a0=127 a0gen=-49.9 a0trt=12 a0int=34.5
b0=2.56 b0gen=-.93 b1=.8 s2u1=225 logalpha=-1
;
alpha=exp(logalpha); *gamma shape parameter constrained to be positive;
a0ind=a0+a0gen*sex+a0trt*trtmnt+a0int*sex*trtmnt+u1; *individual a0;
b0ind=b0+b0gen*sex; *individual b0;
b1ind=b1; *individual b1;
eta=a0ind*exp(-b0ind*b1**age); *Gompertz formula for the mean;
phi=eta/alpha; *gamma 2 model substitution;
yvar=nstlmass;*response variable;
ll=(alpha-1)*log(yvar)- alpha*yvar/eta-lgamma(alpha) + alpha*log(alpha)- alpha*log(eta);*gamma 2 log-likelihood;
model yvar~general(ll);
random u1~normal(0,s2u1) subject=nest;
ods output ParameterEstimates=parmests; *extract parameter estimates;
ods output FitStatistics=fits; *extract logLik and AIC;
run;
- The output from the two runs are identical, so I just show it once. First I display the last line of the iteration history reported by SAS.
Iter Calls NegLogLike Diff MaxGrad Slope
39 97 273.370747 3.943E-7 0.006569 -9.62E-7
NOTE: GCONV convergence criterion satisfied.
- Next we have the fit statistics. Notice that the displayed negative log-likelihood and AIC are smaller than what we obtained for the normal model above.
Fit Statistics
-2 Log Likelihood 546.7
AIC (smaller is better) 564.7
AICC (smaller is better) 567.3
BIC (smaller is better) 572.8
- Finally appear the parameter estimates.
Parameter Estimates
Standard
Parameter Estimate Error DF t Value Pr > |t| Alpha Lower Upper Gradient
a0 154.37 19.8155 17 7.79 <.0001 0.05 112.56 196.17 -0.00001
a0gen -53.0026 15.3797 17 -3.45 0.0031 0.05 -85.4509 -20.5542 -3.87E-6
a0trt 8.4424 9.3765 17 0.90 0.3805 0.05 -11.3403 28.2252 6.693E-6
a0int 30.5906 12.6521 17 2.42 0.0271 0.05 3.8970 57.2842 9.836E-7
b0 2.7156 0.1284 17 21.16 <.0001 0.05 2.4448 2.9864 0.000913
b0gen -0.7148 0.2518 17 -2.84 0.0113 0.05 -1.2460 -0.1835 0.003008
b1 0.9110 0.007970 17 114.31 <.0001 0.05 0.8942 0.9278 0.006569
s2u1 11.0982 49.6237 17 0.22 0.8257 0.05 -93.5987 115.80 -1.47E-6
logalpha 3.3921 0.1724 17 19.67 <.0001 0.05 3.0284 3.7559 0.000099
Returning SAS output to R
- Each portion of the SAS output defines what's called a table. Each table has a name to identify it and a table can be extracted and stored separately as a SAS data set. The two lines in the SAS program code above that begin with ods output extract the values contained in two of these tables. The table called FitStatistics contains the log-likelihood and AIC values, while the table called ParameterEstimates contains the parameter estimates. I give these tables the names fits and parmests respectively. If we wish to see the contents of these SAS data sets we can use the SAS Print procedure.
proc print data=parmests;
run;
proc print data=fits;
run;
- My purpose in extracting these tables as SAS data sets is to then export them from SAS and read them into R for further analysis. SAS data sets can be exported by selecting File > Export from the menu and negotiating through a series of menu choices. If you save the menu choices you made as a SAS command file, you would obtain code like the following. The two runs of Proc Export separately export the parameter estimates and fit statistics. The data sets were saved as comma-delimited text files.
PROC EXPORT DATA = WORK.PARMESTS
OUTFILE= "D:\Ecology 563 Fall 2008\lecture29\sas gamma ests.csv"
DBMS=CSV REPLACE;
RUN;
PROC EXPORT DATA = WORK.FITS
OUTFILE= "D:\Ecology 563 Fall 2008\lecture29\sas gamma ests.csv"
DBMS=CSV REPLACE;
RUN;
- Back in R we can read in either one of these files in the usual fashion.
#back in R
setwd('D:/Ecology 563 Fall 2008/lecture29')
sas.parms<-read.table('sas gamma ests.csv', header=T, sep=',')
sas.fits<-read.table('sas gamma fits.csv', header=T, sep=',')
sas.fits
Descr Value
1 -2 Log Likelihood 546.7
2 AIC (smaller is better) 564.7
3 AICC (smaller is better) 567.3
4 BIC (smaller is better) 572.8
dim(sas.parms)
[1] 9 10
sas.parms[1:4,]
Parameter Estimate StandardError DF tValue Probt Alpha Lower Upper Gradient
1 a0 154.3700 19.8155 17 7.79 <.0001 0.05 112.5600 196.1700 -1.000e-05
2 a0gen -53.0026 15.3797 17 -3.45 0.0031 0.05 -85.4509 -20.5542 -3.870e-06
3 a0trt 8.4424 9.3765 17 0.90 0.3805 0.05 -11.3403 28.2252 6.693e-06
4 a0int 30.5906 12.6521 17 2.42 0.0271 0.05 3.8970 57.2842 9.836e-07
- The AIC returned by SAS is directly comparable to the AIC values we've calculated with the nlme package. Here's a comparison of the Gompertz mixed effects normal model fit in R, the Gompertz normal model with the best variance heterogeneity function fit in R, and the gamma 2 mixed effects Gompertz model estimated in SAS.
c(sapply(list(a0b0.intsex, var2.gnls),AIC), sas.fits[2,2])
600.0308 556.3585 564.7000
- The AIC indicates that the gamma model is far superior to the normal model, but it is beaten by the variance heterogeneity model in which we fit the Gompertz normal model but allowed the variance to vary as a power function of the nestling's age.
Using AIC to Compare Mixed Effect Models with Transformed and Untransformed Responses
Normal mixed effects model log-likelihood with a transformed response variable
- Let y be the n × 1
vector of values of the response variable to which we fit the general linear
mixed effects model
.
This is the usual Laird-Ware formulation where X is the design matrix for the fixed
effects portion of the model, Z is
the design matrix for the random effects portion, β is the vector of fixed coefficients, and b is the vector of random coefficients. We assume that
and
with b independent of ε.
- The notation
represents
a diagonal matrix with the constant
on the
diagonal. Coupled with normality this gives us the standard assumptions of independence and constant variance
for the level-1 errors.
represents Var(b), the covariance matrix of the random effects. The subscript θ is there to remind us that the covariance matrix typically depends on a set of parameters. For instance, in the
random slopes and intercepts model θ would reference three parameters: the variance of the random intercepts,
, the variance of the random slopes,
, and the covariance of the random intercepts and
slopes,
.
- In developing a formula for the
likelihood of a model for the transformed response T(y), I restrict myself to the case
where conditional on the values of the random effects, the distribution of T(y) is normal. This
is the only scenario for which we can explicitly derive the unconditional
distribution of y. For conditional distributions that are not normal we can approximate the unconditional distribution by first integrating out the random effects using numerical methods.
- A vector
is said
to have an n-variate (multivariate)
normal distribution with mean μ and variance-covariance matrix Σ,
if its density function is given by
.
|
(3) |
Here
denotes
the determinant of the matrix Σ. Let
in the mixed effects model equation
. Because e is the sum of two independent normal random variables each with mean 0, it follows that e has a (multivariate) normal distribution with mean 0 and covariance matrix obtained as
follows.

- Because e has a
multivariate normal distribution and Xβ is just a constant, it follows that y also has a
multivariate normal distribution with mean Xβ and the same covariance matrix
, where
. From eqn (3) the density function of y is therefore the following.

This joint density also is the likelihood of our model for the given data set. Using the notation
for the
likelihood, or more simply
, the likelihood of the mixed effects model is given
by eqn (4).

|
(4) |
- Suppose
we log-transform the response vector y to create a new random vector
and fit a
general linear mixed effects model to w,
using the
same assumptions as before. The density (likelihood) function for this model
given the data values w,
, is just eqn (4) with y replaced by w.
- To be able to compare this likelihood (using AIC for example) to that of a second model in which the response vector is y rather than log y, we need to re-express this likelihood solely in terms of the original variable y rather than in terms of its
log-transformed version w. The formula that allows us to do this is the multivariate version of the univariate
solution to this problem we constructed in lecture 10. In the univariate version we started with, for example, the transformation
with
density function
and corresponding likelihood
.
We then showed that we could express this likelihood
function solely in terms of the original variable y as follows.

The general result is that if
has density function
,
then the likelihood expressed in terms of the original variable y is the following.

|
(5) |
- To extend this result to the random
effects setting, we first need the multivariate change of variables formula for
integration. The generic version of this formula considers a more general case in which there is a different
transformation associated with each observation. In particular suppose the new values w are generated from y as follows.

where conceivably
are
different functions. The original distribution
is unknown to us, but we do know that the transformed observations have joint density
. The general change-of-variables formula that
allows us to rewrite this density in w as a density in terms of y is the
following.

|
(6) |
- The second term on the right hand side that appears inside the absolute value sign is the multivariate analogue of the derivative term that appears in eqn (5). It is
called the Jacobian and is defined to be the determinant of a matrix of first partial
derivatives as shown below.

Now in our case the transformations
are quite
simple.

So, the required Jacobian is easy to compute.

|
(7) |
Combining eqns (4), (6), and (7) we obtain the following expression
for the likelihood of w = log y under the random effects model written in terms of the individual
components of y.

|
(8) |
Implementing the log-likelihood transformation formula in R for a random intercepts model
- The implementation of eqn (8) in R is complicated by the fact that the variance-covariance matrix of the random effects,
, is not available in a convenient format from R. The VarCorr function returns the values of
but as character values not organized in a matrix. Thus we need to extract these individually, convert them to numbers, and then arrange them in a matrix. Also VarCorr returns correlations rather than covariances and these will also need to be converted. As a result there is no easy way to write a function that will return the log-likelihood for a generic random coefficients model.
- To illustrate the use of eqn (8) I write a function in R to express the log-likelihood of a random intercepts model in terms of the raw response when the model was fit to a log-transformed response. In this case the matrix
is quite simple; it is just the scalar element
. To construct the log-likelihood we can work with eqn (8) directly or more simply use the multivariate normal density function dmvnorm that is found in the mvtnorm package. I illustrate both approaches below.
Using eqn (8) directly
- I begin by taking the logarithm of eqn (8) to yield the following expression for the log-likelihood.

|
(9a) |
I set up the function I call trans.loglik so that it takes four arguments—three required and one optional. The arguments are the following.
- lme.model: the first argument is the lme object that is produced when a random intercepts model is fit using the lme function of the nlme package.
- lme.group: the second argument is the structural variable, the variable that defines how observations are grouped in the data set. This is the variable that was specified to the right of the vertical bar, |, in the random argument of lme.
- lme.y: the third argument is the untransformed response variable.
- log.tran: the fourth argument is a logical argument that indicates whether or not the response variable is log-transformed in the fitted lme model. The default setting for this argument is TRUE indicating that a log transformation was used when the model was fit.
I include the fourth argument as a check on the code. When the fourth argument is set to FALSE and the lme model specified in the first argument was fit to the raw response, then the trans.loglik function should agree with the values returned by the logLik and AIC functions when applied to the same model.
- The if...else sections of the code check the value of the argument log.tran and use it to choose the appropriate term of the log-likelihood. To compute the mean response I use the predict function with its level argument set to level=0. This choice causes predict to return the value we've previously called the population average value. It is equal to Xβ in the formulas given above.
#loglik for a log-transformed normal random intercepts model
trans.loglik<-function(lme.model, lme.group, lme.y, log.tran=T)
{
#number of observations and number of groups
n.obs<-length(lme.group)
n.grp<-length(unique(lme.group))
#extract variance components
tau2<-as.numeric(VarCorr(lme.model)[1,1])
sigma2<-as.numeric(VarCorr(lme.model)[2,1])
#design matrix for random intercepts
zmat<-outer(as.numeric(factor(lme.group)), 1:n.grp,'==')+0
#variance matrix
zmat%*%t(zmat)*tau2/sigma2+diag(n.obs)->Sig.theta
inv.Sig<-solve(Sig.theta)
#components of multivariate density: deviation from mean
if(log.tran) y.diff<-log(lme.y)-predict(lme.model, level=0) else
y.diff<-lme.y-predict(lme.model, level=0)
#exponential term in the original likelihood
exp.term<- -t(y.diff) %*% inv.Sig %*% y.diff/(2*sigma2)
#log-likelihood: log(Jacobian) is in blue
LL<- if(log.tran) -.5*(n.obs*log(2*pi*sigma2)+ log(det(Sig.theta)))+ exp.term -sum(log(lme.y)) else -.5*(n.obs*log(2*pi*sigma2)+ log(det(Sig.theta)))+ exp.term
AIC.model<- -2*LL + 2*attr(logLik(lme.model),'df')
my.out<-c(LL,AIC.model)
names(my.out)<-c('loglik','AIC')
my.out
}
- To illustrate the use of the function, I read in the kestrel data set and fit a random intercepts linear model with age as the only predictor. I do this separately using nstlmass and log(nstlmass) as the responses.
kestrel<- read.table('http://www.unc.edu/courses/2008fall/ecol/563/001/data/lab13/kestrel_data.csv', header=TRUE, sep=',', na.strings='.')
kestrel2<-kestrel[kestrel$egg==1 & !is.na(kestrel$nstlmass),]
The grouping variable is kestrel2$nest and the raw response is kestrel2$nstlmass.
library(nlme)
#log-transformed response
out1.log<-lme(log(nstlmass)~age, data=kestrel2, random=~1|nest, method='ML')
trans.loglik(out1.log,kestrel2$nest,kestrel2$nstlmass)
loglik AIC
-322.7917 653.5833
#untransformed response
out1<-lme(nstlmass~age, data=kestrel2, random=~1|nest, method='ML')
trans.loglik(out1,kestrel2$nest,kestrel2$nstlmass,F)
loglik AIC
-326.7366 661.4732
As a check, the second set of results for the untransformed response should match the the values returned by logLik and AIC.
c(logLik(out1),AIC(out1))
[1] -326.7366 661.4732
- Because we've used the change-of-variables formula to calculate the log-likelihood of the model with the log-transformed response, we can compare the results. From the reported values we see that the model using a log-transformed response is an improvement over the same model using the raw response. On the other hand if we compare these values to the ones obtained for the nonlinear Gompertz model we fit previously we see that the log-likelihoods and AICs are far worse than what we obtained with the Gompertz model.
#Gompertz model without any level-2 predictors
c(logLik(a0.int),AIC(a0.int))
-305.6560 621.3121
Using the mvtnorm package
- If we use the dmvnorm function then the log-likelihood corresponding to eqn (9a) is the following.

|
(9b) |
Using the dmvnorm function simplifies our code somewhat. I call the new function trans.loglik2.
#loglik for a log-transformed normal random intercepts model
#uses dmvnorm function from mvtnorm package
trans.loglik2<-function(lme.model, lme.group, lme.y, log.tran=T)
{
#number of observations and number of groups
n.obs<-length(lme.group)
n.grp<-length(unique(lme.group))
#extract variance components
tau2<-as.numeric(VarCorr(lme.model)[1,1])
sigma2<-as.numeric(VarCorr(lme.model)[2,1])
#design matrix for random intercepts
zmat<-outer(as.numeric(factor(lme.group)), 1:n.grp, '==')+0
#variance matrix
zmat%*%t(zmat)*tau2+sigma2*diag(n.obs)->Sigma
#multivariate normal log-likelihood
LL<- if(log.tran) log(dmvnorm(log(lme.y), predict(lme.model, level=0), Sigma)) - sum(log(lme.y)) else log(dmvnorm(lme.y, predict(lme.model, level=0), Sigma))
AIC.model<- -2*LL + 2*attr(logLik(lme.model),'df')
my.out<-c(LL,AIC.model)
names(my.out)<-c('loglik','AIC')
my.out
}
- I load the mvtnorm package and run the function first on the model out1.log that was fit above with the log-transformed response, and then on the model out1 that was fit using the raw response.
library(mvtnorm)
trans.loglik2(out1.log,kestrel2$nest,kestrel2$nstlmass)
loglik AIC
-322.7917 653.5833
trans.loglik2(out1,kestrel2$nest,kestrel2$nstlmass,F)
loglik AIC
-326.7366 661.4732
The results are the same as those obtained using the trans.loglik function above.
Cited references
- Gelman, Andrew and Jennifer Hill. 2006. Data Analysis Using Regression and Multilevel/Hierarchical Models. Cambridge University Press: New York.
-
Kutner, Michael H., Christopher J. Nachtsheim, John Neter, and William Li. 2004. Applied Linear Statistical Model. McGraw-Hill: Boston.
- McCarthy, Michael A. 2007. Bayesian Methods for Ecology. Cambridge University Press: New York.
- Pinheiro, J. C. and D. M. Bates. 2000. Mixed-Effects Models in S and S-Plus. Springer-Verlag: New York.
Course Home Page