Read in the data.
> lakes<- read.table('http://www.unc.edu/courses/2007spring/enst/562/001/assignments/final/lakes.txt', header=T,sep='\t')
> names(lakes)
[1] "Lake" "Latitude" "Longitude" "X1976" "X1977" "X1978"
[7] "X1981"
The data are currently in "lake-level" format in which each lake has a single record. We need to put this into "lake-period" format with multiple records for each lake, in this case one record for each measurement occasion. This requires unlisting the records X1976, X1977, X1978, and X1981 stacking them into a single column, replicating the Lake, Latitude, and Longitude variables four times, and adding a new column that lists the year the measurement was taken.
> newlakes<-data.frame(rep(lakes$Lake,4), rep(lakes$Latitude,4), rep(lakes$Longitude,4), unlist(lakes[,4:7]), rep(c(1976:1978,1981), rep(dim(lakes)[1],4)))
> colnames(newlakes)<-c('lake', 'latitude', 'longitude', 'SO4', 'year')
> newlakes[1:10,]
lake latitude longitude SO4 year
X19761 1 58.0 7.2 6.5 1976
X19762 2 58.1 6.3 5.5 1976
X19763 4 58.5 7.9 4.8 1976
X19764 5 58.6 8.9 7.4 1976
X19765 6 58.7 7.6 3.7 1976
X19766 7 59.1 6.5 1.8 1976
X19767 8 58.9 7.3 2.7 1976
X19768 9 59.1 8.5 3.8 1976
X19769 10 58.9 9.3 8.4 1976
X197610 11 59.4 6.4 1.6 1976
As was discussed in lecture 3 there are a number of guidelines for helping us choose a probability model for the response variable SO4. I repeat some of these here.
Clearly SO4 is a continuous variable and, being a concentration, is bounded below by zero (but in principle unbounded above). Continuous distributions were covered in lecture 6 where four distributions were discussed: the normal, lognormal, gamma, and beta distributions. Of these we can immediately reject the beta distribution because the data aren't bounded between 0 and 1. Of the remaining three both the lognormal and gamma are bounded below by zero, while the normal is unbounded. Of course if the data are displaced far enough from zero, then the fact that the normal distribution is unbounded below need not be a problem.
Because the goal is to model SO4 concentration against year the probability model we're seeking must hold at each year separately rather than in the aggregate. Because both the lognormal and gamma distributions tend to be skewed while the normal is symmetric, a histogram of SO4 concentration at each year is a useful display.
> par(mar=c(5.1,3.1,1.1,1.1))
> par(mfrow=c(2,2))
> sapply(c(1976:1978,1981), function(x) hist(newlakes$SO4[newlakes$year==x], xlab=expression("SO"[4]), main=x, col='lightblue'))

Fig. 1 Histograms of SO4 concentration in each year
From Fig. 1 it's pretty clear that the normal distribution will be inappropriate. The actual distributions are skewed and with many values close to zero. Thus a model based on a normal distribution is likely to predict negative concentrations. Both the lognormal and gamma distributions have the same mean-variance relation (quadratic), so the best way to evaluate them is to actually use them in a model and compare the results with AIC. Even though we know a normal model is unlikely to be appropriate I also include it among the candidate models.
I fit each model with year as a linear predictor. The normal and lognormal models can be fit with lm, the lognormal by first log-transforming the response SO4. The gamma distribution is fit with glm and the family=Gamma argument.
> out.norm<-lm(SO4~year,data=newlakes)
> out.gamma<-glm(SO4~year,data=newlakes,family=Gamma)
> out.lnorm<-lm(log(SO4)~year,data=newlakes)
The loglikelihood of the log-transformed model is not directly comparable to models with an untransformed response without additional work. We wrote a function in lecture 22 that carried out the necessary steps and I reproduce it below.
> norm.loglike<-function(data,var,model)
{
t.y<-log(data[,var])
sigma2<-(sum(residuals(model)^2))/dim(data)[1]
loglike<-sum(log(dnorm(t.y, mean=predict(model), sd=sqrt(sigma2))*1/(data[,var])))
out<-list(loglike, c(coef(model), sigma2))
out
}
To get this function to work we need to give it a version of the data set in which the missing values of the response are removed. (These observations were automatically removed by the lm function when the model was fit.) We can do this by subsetting the data set using the !is.na( ) construction. The first argument to the function is the data set, the second is the name of the response, and the third is the model.
> norm.loglike(newlakes[!is.na(newlakes$SO4),], "SO4", out.lnorm)->out.loglike
> out.loglike
[[1]]
[1] -342.0175
[[2]]
(Intercept) Year
41.66739080 -0.02050562 0.37574603
Each of the models has three estimated parameters (β0, β1, and σ2), so we could just as well have compared loglikelihoods rather than AIC.
> -2*out.loglike[[1]]+2*length(out.loglike[[2]])
[1] 690.0349
> AIC(out.gamma)
[1] 705.6806
> AIC(out.norm)
[1] 786.8488
The lognormal model has the lowest AIC by a sizeable amount. We'll proceed with a lognormal probability model for the response.
The variable year has four unique values. Hence there are a limited number of choices for the linear predictor. Without any theory to guide us we can
Each model was fit assuming a lognormally distributed response. AIC can be used to compare the models without any adjustment because all three models use the same response, log SO4.
> out.lnorm2<-lm(log(SO4)~year+I(year^2),data=newlakes)
> out.lnorm3<-lm(log(SO4)~factor(year),data=newlakes)
> sapply(list(out.lnorm,out.lnorm2,out.lnorm3),AIC)
[1] 318.3179 320.3132 322.3114
Clearly the linear model is to be preferred. It has both the lowest AIC and is simpler.r To see what's going on I graph the data (jittered) and superimpose the three models. For the separate means model I just plot the estimated mean at each year as predicted by the model.
> plot(log(SO4)~jitter(year), data=newlakes, xlim=c(1975.5,1981.5), xlab='year', ylab=expression(log('SO'[4])))
> abline(out.lnorm, col='grey80', lwd=5)
> curve(coef(out.lnorm2)[1]+ coef(out.lnorm2)[2]*x+ coef(out.lnorm2)[3]*x^2,col=2, lty=2, add=TRUE)
> points(c(1976:1978,1981), c(coef(out.lnorm3)[1], coef(out.lnorm3)[1]+ coef(out.lnorm3)[2],
coef(out.lnorm3)[1]+ coef(out.lnorm3)[3], coef(out.lnorm3)[1]+coef(out.lnorm3)[4]), col=3, pch=16, cex=2)
> legend(1979,2.5, c('linear','quadratic','means'), lty=c(1,2,NA), lwd=c(3,1,1), pch=c(NA,NA,16), col=c('grey80',2,3), bty='n', cex=.9)

Fig. 2 Three lognormal models using different versions of the linear predictor for year
Our conclusions remaing the same even if we use a normal or a gamma probability model for the response. A model linear in year is best.
> out.gamma1<-glm(SO4~year+I(year^2),data=newlakes,family=Gamma)
> out.gamma2<-glm(SO4~factor(year),data=newlakes,family=Gamma)
> sapply(list(out.gamma,out.gamma1,out.gamma2),AIC)
[1] 705.6806 707.4164 709.2101
> out.norm1<-lm(SO4~year+I(year^2),data=newlakes)
> out.norm2<-lm(SO4~factor(year),data=newlakes)
> sapply(list(out.norm,out.norm1,out.norm2),AIC)
[1] 786.8488 788.6666 790.4781
The table below summarizes the common pooling models that were fit.
Model |
Linear predictor | AIC |
lognormal |
linear year | 690.03 |
| quadratic year | 692.03 |
|
| categorical year | 694.03 |
|
gamma |
linear year | 705.68 |
| quadratic year | 707.42 |
|
| categorical year | 709.21 |
|
normal |
linear year | 786.85 |
| quadratic year | 786.67 |
|
| categorical year | 790.48 |
Of course it is possible that some weird transformation of year is preferable. To assess that possibility I plot the residuals from the linear model against year and superimpose a lowess curve.
> plot(residuals(out.norm)~ jitter(newlakes$year[!is.na(newlakes$SO4)]), ylab='residuals', xlab='year')
> lines(lowess(residuals(out.norm)~ jitter(newlakes$year[!is.na(newlakes$SO4)])), col=2)

Fig. 3 Plot of residuals from model linear in year against the predictor "year"
The plot provides no evidence that form of the linear predictor is inadequate. It's worth noting at this point that the modeled relationship between log(SO4) and year is not statistically significant.
> summary(out.lnorm)
Call:
lm(formula = log(SO4) ~ year, data = newlakes)
Residuals:
Min 1Q Median 3Q Max
-1.50496 -0.52462 -0.08062 0.52028 1.58027
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 41.66739 49.32240 0.845 0.399
year -0.02051 0.02493 -0.822 0.412
Residual standard error: 0.6167 on 166 degrees of freedom
Multiple R-Squared: 0.004057, Adjusted R-squared: -0.001942
F-statistic: 0.6763 on 1 and 166 DF, p-value: 0.4120
The basic structure possessed by these these data is that they consist of repeated measures on individual lakes. Due to local geology and edaphic factors, we should expect measurements coming from the same lake, even separated by a time interval of a year or more, to be more similar to each other than to observations coming from other lakes. This provides the potential for what we've called observational heterogeneity—differing degrees of variability in subsets of observations. Observe that the repeated measures aspect of these data was imposed at the design level. Structure that arises from sampling design must be accounted for in the analysis.
In addition to temporal correlation there is the possibility of spatial structure in these data relating to their geographic proximities. We might expect lakes that are closer to each other to share a similar chemistry. I consider structure of this sort to be inadvertent structure. It arises because lakes are fixed objects that occupy space and hence their varying proximities will be an issue in even the most well-designed random selection scheme. The resulting spatial structure may turn out to be important, but it is not present by design. Hence we'll account for the designed structure first and then deal with the spatial structuring if necessary.
The ideal way to display the temporal structure of these data is in a lattice graph in which we plot the SO4 versus year separately for each lake. Because we've already seen that a log transformation of the SO4 concentration is justified, the most useful lattice graph plots log(SO4) versus year.
> trellis.par.set(col.whitebg())
> xyplot(log(SO4)~year|lake, data=newlakes, panel=function(x,y) {
panel.xyplot(x,y)
panel.abline(lm(y~x),col=2)},
ylab=expression(paste("SO"[4], ' Concentration',sep=' ')),
par.settings=list(axis.text=list(cex=.8)), layout=c(8,6))

Fig. 4 Lattice graph displaying the repeated measures structure of the data set
As Fig. 4 indicates there is a dramatic difference in baseline SO4 concentrations across lakes. This is reflected in the widely different values for the intercepts of the regression lines. In addition there is some evidence for heterogeneity in slopes across the lakes. The natural way to handle structured data such as these is with a multilevel model. Fig. 4 coupled with our early work with the complete pooling model suggests that a linear model relating log(SO4) concentration to continuous time should be adequate.
Unconditional means model
The correct place to begin in fitting a multilevel model is with the unconditional means model. The unconditional means model here includes only an intercept but it allows that intercept to vary among the different lakes. It serves to partition the variance between and among lakes and acts as a benchmark that allows us to assess where most of the variability lies--between lakes or between years for the same lake. The unconditional means model is formulated as follows.
where ![]()
Written as a composite equation it takes the following form.
It's the latter form that matches the syntax used by R's lme function.
> library(nlme)
> lme(log(SO4)~1,random=~1|lake, data=newlakes, method='ML', na.action=na.omit)->unc.mean
The unconditional means model is of interest only for its variance components. From these we can calculate the intraclass correlation coefficient.

I calculate this for the lakes data using output from VarCorr.
> VarCorr(unc.mean)
lake = pdLogChol(1)
Variance StdDev
(Intercept) 0.33272801 0.5768258
Residual 0.03104858 0.1762061
> as.numeric(VarCorr(unc.mean)[1,1])/ (as.numeric(VarCorr(unc.mean)[1,1])+ as.numeric(VarCorr(unc.mean)[2,1]))
[1] 0.9146493
So we see that 91% of the variability in log(SO4) concentrations occurs between lakes rather than within lakes across years. Put another way measurements from the same lake exhibit a correlation of 0.91. This is dramatic evidence that a multilevel model is needed here and argues strongly against using a common pooling model.
Random intercepts model
The next step is to add the predictor year. As we've seen the individual lines in Fig. 4 exhibit a large variability in intercepts. Thus the natural second model to consider is a random intercepts model, a linear population model in which the intercepts of the regression lines for individual lakes are allowed to vary but they still share a common slope.
where ![]()
Written in composite form it looks like the following.

This is the form that matches the syntax used by R's lme function.
> lme(log(SO4)~year,random=~1|lake, data=newlakes, method='ML', na.action=na.omit)->random.ints
> sapply(list(out.lnorm,unc.mean,random.ints),AIC)
[1] 318.31792 73.76328 65.73269
Observe the dramatic decrease in AIC relative to the common pooling model, strong evidence that we need to account for the repeated measures structure of the data set. (The AIC are directly comparable here because both models use log(SO4) as the response.) There is a more modest decrease in AIC relative to the unconditional means model indicating that the inclusion of year has improved the fit of the model somewhat. Next we examine the variance components.
> VarCorr(random.ints)
lake = pdLogChol(1)
Variance StdDev
(Intercept) 0.33345020 0.5774515
Residual 0.02855946 0.1689954
As measured by the relative reduction in level-1 variance the effect of year on accounting for the within lake variability is more modest.
> (as.numeric(VarCorr(unc.mean)[2,1])- as.numeric(VarCorr(random.ints)[2,1]))/ as.numeric(VarCorr(unc.mean)[2,1])
[1] 0.08016856
So we've explained only 8% of the within lake variability by including year as a predictor. Still the predictor year is statistically significant.
> summary(random.ints)
Linear mixed-effects model fit by maximum likelihood
Data: newlakes
AIC BIC logLik
65.73269 78.22854 -28.86634
Random effects:
Formula: ~1 | lake
(Intercept) Residual
StdDev: 0.5774515 0.1689954
Fixed effects: log(SO4) ~ year
Value Std.Error DF t-value p-value
(Intercept) 45.76822 13.897196 119 3.293342 0.0013
year -0.02259 0.007026 119 -3.215169 0.0017
Correlation:
(Intr)
year -1
Standardized Within-Group Residuals:
Min Q1 Med Q3 Max
-4.02214991 -0.47329909 -0.05670254 0.55159957 3.33206296
Number of Observations: 168
Number of Groups: 48
Random slopes and intercepts model
The last model we might consider at this point is a random slopes and intercepts model. The lattice graph of Fig. 4 suggests there is some variability in individual slopes. Unfortunately, and this is frequently the case with mixed models in the frequentist paradigm, this model is very difficult to fit. One of the pernicious features of fitting such models is that it's not always clear that there is a problem.
Recall that the random slopes and intercepts model takes the following form.
where
.
In composite form the equations reduce to the following.
So the following function call fits this equation in R.
> lme(log(SO4)~year,random=~year|lake, data=newlakes, method='ML', na.action=na.omit)->random.slopes
If we look at the variance components of this model we see the following.
> VarCorr(random.slopes)
lake = pdLogChol(year)
Variance StdDev Corr
(Intercept) 3.334502e-01 5.774515e-01 (Intr)
year 1.192613e-14 1.092068e-07 0
Residual 2.855945e-02 1.689954e-01
From the displayed correlation of 0 between the random slopes and intercepts and the estimated variance component of the slopes of approximately 0 we can see that R has converged to a solution in which the slopes don't vary. If we examine the loglikelihoods of the random slopes and intercepts model with that of the random intercepts model, we see that they are identical!
> sapply(list(random.ints,random.slopes),logLik)
[1] -28.86634 -28.86634
It is extremely unlikely that introducing to additional parameters to the model has had no effect on the loglikelihood. Given that we've already seen from our lattice plot that the slopes do vary between lakes out best conclusion is that lme has converged to a local solution (the solution of the random intercepts model) rather than a global solution.
The usual tack when this happens is to try reparameterizing the predictor, perhaps by centering, and if that doesn't work to modify some of the control settings of the maximization algorithm. I tried all of this with very little success. For instance here's an attempt to shift the origin of year to zero.
> lme(log(SO4)~I(year-1976), random=~I(year-1976)|lake, data=newlakes, method='ML', na.action=na.omit)->random.slopes
Error in lme.formula(log(SO4) ~ I(year - 1976), random = ~I(year - 1976) | :
iteration limit reached without convergence (9)
With only 48 lakes and a maximum of 4 measurements per lake, obtaining point estimates of the variance parameters in a problem such as this is difficult. We may have outstripped the ability of our software to fit such a model. Bayesian models tend to have much greater success with data such as these, so a logical step at this point would be to try to fit the model using WinBUGS. But, that was not part of this assignment so I'll leave that as a summer homework exercise! Without evidence to the contrary we'll stick with the random intercepts model as our best model.
The table below summarizes our results. Because all the models being compared have log(SO4) as the response we can use AIC applied directly to the model object to compare them. The AIC values reported here cannot be used to compare these models against models with other probability models for the response. Note: the norm.loglike function we used above does not return the correct AIC when random effects are involved.
Lognormal Model |
Predictor |
# parameters |
loglikelihood (for log y) |
AIC |
| common pooling | year |
3 |
–159.159 |
318.318 |
| unconditional means | — |
3 |
–33.882 |
73.763 |
| random intercepts | year |
4 |
–28.866 |
65.732 |
| random slopes and intercepts | year |
6 |
failed to converge |
— |
Note: One model I don't include in the above list is the separate intercepts model, a model that estimates a separate intercept for each lake. This model would be fit as follows.
# the following model was not considered
> lm(log(SO4)~ year + factor(lake), data=newlakes)
In this model we would end up estimating 45 additional parameters, one intercept for each lake. Not surprisinglyl this model does turn to be far and away the best model in terms of loglikelihood and AIC, but given that we have at most four observations per lake using such a model is clearly an example of overfitting.
A multilevel model returns a population level model from which individual lakes are allowed to deviate. The manner in which the lakes are allowed to deviate depends on how the random effects portion was specified. The population model we obtained is the following.
> fixef(random.ints)
(Intercept) year
45.76822277 -0.02258819
![]()
The population model states that for every one year increment the average log(SO4) concentrations are predicted to fall by 0.023. To put this into more understandable terms we can express this in terms of concentration units. If we exponentiate the model equation we obtain an expression for the geometric mean sulfate concentration.
So in each subsequent year the model predicts the geometric mean sulfate concentration will be 98% of what is was the year before. Because of the way the model was formulated the intercept is currently not interpretable. It actually represents the geometric mean sulfate concentration in year 0 AD!! To fix this we can refit the model centering year at the first measured year 1976.
> lme(log(SO4)~I(year-1976),random=~1|lake, data=newlakes, method='ML', na.action=na.omit)->random.ints2)
> fixef(random.ints2)
(Intercept) I(year - 1976)
1.13395508 -0.02258819
So now we have the following.
So the average lake in 1976 had a geometric mean sulfate concentration of 3.1. Under the normality assumption of the random effects and given the point estimate for the variability of the random effects (τ = 0.57 in the output above), we predict the geometric mean sulfate concentration in 1976 varies between 1 and 9.6 and will decrease by a factor of 0.98 with each subsequent year over the range of the data.
> exp(1.13395508-1.96*0.5774515)
[1] 1.002152
> exp(1.13395508+1.96*0.5774515)
[1] 9.638447
Because we have the geographic locations of the lakes we can use the semivariogram to assess spatial correlation. For question 5 we use the residuals from the complete pooling model and for question 6 the residuals from the random intercepts model. We need to remove the missing values from the original data set so that the residual vector and the latitude and longitude vectors are the same length. I also do the same with a vector of years.
> lakes.reduced<-newlakes[!is.na(newlakes$SO4),]
#assemble data in the form preferred by the geoR package
> newdat1<-data.frame(lakes.reduced$latitude, lakes.reduced$longitude, residuals(out.lnorm))
> newdat2<-data.frame(lakes.reduced$latitude, lakes.reduced$longitude, residuals(random.ints))
#remove years with missing concentrations
> years.reduced<-newlakes$year[!is.na(newlakes$SO4)]
Technically we should convert latitude and longitude measured in degrees to distance units. Since we're only operating on a limited geographic scale, this should not have a big effect so I stick with degrees. I do some initial experimentation to determine how wide to make the bins and what the maximum distance between lakes is.
> library(geoR)
> curdata<-newdat1[years.reduced==1976,]
> geodat1<-as.geodata(curdata)
> variog(geodat1, uvec=seq(0,8,.25), max.dist=8, option='bin')->geodata.v1
> rbind(geodata.v1$n, geodata.v1$u)
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13]
[1,] 20.00 31.0 45.00 56 46.00 60.0 67.00 54 49.00 54.0 53.00 66 50.00
[2,] 0.25 0.5 0.75 1 1.25 1.5 1.75 2 2.25 2.5 2.75 3 3.25
[,14] [,15] [,16] [,17] [,18] [,19] [,20] [,21] [,22] [,23] [,24] [,25] [,26]
[1,] 39.0 49.00 45 33.00 39.0 34.00 27 27.00 16.0 13.00 23 6.00 11.0
[2,] 3.5 3.75 4 4.25 4.5 4.75 5 5.25 5.5 5.75 6 6.25 6.5
[,27] [,28] [,29] [,30]
[1,] 9.00 2 8.00 2.0
[2,] 6.75 7 7.25 7.5
From the output we see that by the time we reach a distance of 5 the number of observations per bin has dropped below 30, a number that's considered a minimum value for obtaining a reasonable estimate of the semivariogram. Thus I should not plot or calculate the semivariogram for distances beyond 5.
I fit and plot a semivariogram separately for each year (because duplicate coordinates are not allowed and it probably makes sense to treat years separately). To facilitate things I write a generic function that calculates and plots the semivariogram separately by year that then I sapply to a list of year values.
> par(mar=c(5.1,4.1,2.1,1.1))
> par(mfrow=c(2,2))
> sapply(c(1976:1978,1981),function(x) {
curdata<-newdat1[years.reduced==x,]
geodat1<-as.geodata(curdata)
variog(geodat1,uvec=seq(0,5,.25), max.dist=5, option='bin')->geodata.v1
plot(geodata.v1)
mtext(side=3,line=.5, paste(x,': Complete Pooling Model Residuals', sep=''), cex=.8)
})

Fig. 5 Semivariograms by year for complete pooling model residuals
Nest I repeat these steps for the random intercepts model.
> sapply(c(1976:1978,1981),function(x) {
curdata<-newdat2[years.reduced==x,]
geodat1<-as.geodata(curdata)
variog(geodat1,uvec=seq(0,5,.25), max.dist=5, option='bin')->geodata.v1
plot(geodata.v1)
mtext(side=3,line=.5, paste(x,': Random Intercepts Model Residuals', sep=''), cex=.8)
})

Fig. 6 Semivariograms by year for random intercepts model residuals
It's pretty clear that while there was serious spatial correlation among the residuals of the complete pooling model (as evidence by the clear increasing trend in the semivariogram plot), this has completely disappeared in the residuals of the random intercepts model. In Fig. 6 we see only random scatter and no clear trend. Thus by accounting for the temporal correlation in the data we have also managed to account for the spatial correlation.
I suggested five different approaches to this problem in the hints. I work through in detail the code for the first approach, the paired t-test, but then just present the code with minimal comments for the remaining four methods.
Read in the data and pull out the unique species names.
> cover<- read.table('http://www.unc.edu/courses/2007spring/enst/562/001/assignments/final/coverclass.txt', header=T)
> unique(cover$CHISSpCode)->species
The function below subsets the data using the name of the current species. It then carries out the paired t-test and assigns the results to an object. The p-value of the test, an estimate of the mean of the paired differences, and a confidence interval for the mean difference are all extracted from this object. I also record the sample size for each species for use in annotating the plot that we'll produce.
> rep.func.t<-function(x) {
cover[cover$CHISSpCode==x,]->name
t.test(name[,4], name[,3], paired=TRUE, conf.int=TRUE, conf.level=.95)->out
n<-length(name[,3])
c(out$estimate,out$p.value,out$conf.int,n)
}
I sapply the function to the list of species names to obtain the results.
> sapply(species,rep.func.t)->out
> temp<-t(out)
> rownames(temp)<-species
> colnames(temp)<-c('estimate','p-value','lower95','upper95','n')
> temp
estimate p-value lower95 upper95 n
ACMI 0.25000000 0.1910542981 -0.14493408 0.6449341 12
ATSE -0.76923077 0.0024154024 -1.20735056 -0.3311110 13
AVXX -0.07692308 0.7762430284 -0.65346398 0.4996178 13
BAPI 0.47826087 0.0082593612 0.13652817 0.8199936 23
BRDI 0.19230769 0.4668980374 -0.34377195 0.7283873 26
CAMA 0.15000000 0.5453609184 -0.35986558 0.6598656 20
CIOC 0.29166667 0.0160893434 0.05940813 0.5239252 24
COGI 0.61538462 0.0051616813 0.22232546 1.0084438 13
DISP 0.77777778 0.0147043609 0.17338851 1.3821670 18
GNLU 0.13333333 0.4985453188 -0.27825004 0.5449167 15
HOBR -0.05882353 0.7498058561 -0.44317570 0.3255286 17
ISME 0.17391304 0.4619848747 -0.30783817 0.6556643 23
MEIN -0.77777778 0.0007769294 -1.17981154 -0.3757440 18
MEPO 0.08333333 0.7544993470 -0.48871327 0.6553799 12
SOOL -0.41666667 0.0091516301 -0.71954808 -0.1137853 24
ASMI 0.05263158 0.7898521784 -0.35614654 0.4614097 19
ATCA 0.04761905 0.7711217886 -0.28923077 0.3844689 21
CAAE 0.38095238 0.1188353147 -0.10668380 0.8685886 21
DAPU -0.06666667 0.8177843203 -0.67570648 0.5423732 15
ERGL 0.33333333 0.2376664725 -0.24621912 0.9128858 15
LODE 0.38461538 0.1368122627 -0.14095024 0.9101810 13
LUAL 1.00000000 0.0016069697 0.47961888 1.5203811 11
MAIN -0.63157895 0.0007944877 -0.96124944 -0.3019085 19
ERGR 0.18181818 0.4405216964 -0.32254736 0.6861837 11
SIBE 0.58333333 0.0674332787 -0.04962508 1.2162917 12
I count the number of tests, extract the p-values, and identify the results by species.
> colnames(out)<-species
> out[2,]->pvalues
> numtests <-length(pvalues)
I set up the four testing criteria: unadjusted, Bonferroni, sequential Bonferroni (Holm method), and FDR.
> alpha<-.05
> alphai<-alpha/numtests
> bonf<-rep(alpha/numtests,numtests)
> holm<-alpha/seq(numtests,1,-1)
> fdr<-(1:numtests)*alpha/numtests
> sort(pvalues)->pvals
> cbind(pvals,bonf,holm,fdr)
pvals bonf holm fdr
MEIN 0.0007769294 0.002 0.002000000 0.002
MAIN 0.0007944877 0.002 0.002083333 0.004
LUAL 0.0016069697 0.002 0.002173913 0.006
ATSE 0.0024154024 0.002 0.002272727 0.008
COGI 0.0051616813 0.002 0.002380952 0.010
BAPI 0.0082593612 0.002 0.002500000 0.012
SOOL 0.0091516301 0.002 0.002631579 0.014
DISP 0.0147043609 0.002 0.002777778 0.016
CIOC 0.0160893434 0.002 0.002941176 0.018
SIBE 0.0674332787 0.002 0.003125000 0.020
CAAE 0.1188353147 0.002 0.003333333 0.022
LODE 0.1368122627 0.002 0.003571429 0.024
ACMI 0.1910542981 0.002 0.003846154 0.026
ERGL 0.2376664725 0.002 0.004166667 0.028
ERGR 0.4405216964 0.002 0.004545455 0.030
ISME 0.4619848747 0.002 0.005000000 0.032
BRDI 0.4668980374 0.002 0.005555556 0.034
GNLU 0.4985453188 0.002 0.006250000 0.036
CAMA 0.5453609184 0.002 0.007142857 0.038
HOBR 0.7498058561 0.002 0.008333333 0.040
MEPO 0.7544993470 0.002 0.010000000 0.042
ATCA 0.7711217886 0.002 0.012500000 0.044
AVXX 0.7762430284 0.002 0.016666667 0.046
ASMI 0.7898521784 0.002 0.025000000 0.048
DAPU 0.8177843203 0.002 0.050000000 0.050
#carry out testsWe could read off the results at this point or we can automate the procedure. For the sequential Bonferroni method we need to identify the first zero (nonsignificant result) and then count all tests up to that point as significant. For the FDR approach we find the last significant result and count all tests up until that point as significant. We can automate this as follows. The vector sig.holm below records the species that yield statistically significant results according to the sequential Bonferroni and the vector sig.fdr does the same for the FDR.
#sequential Bonferroni
> sig.holm<-vector(mode='numeric',length=n)
> if(!any(holm.t)) sig.holm[1:n]<-0 else
if(!any(holm.t==0)) sig.holm[1:n]<-1 else {
min((1:25)[holm.t==0])->min.holm
sig.holm[1:(min.holm-1)]<-1
sig.holm[min.holm:25]<-0 }
#false discovery rate
> sig.fdr<-vector(mode='numeric',length=25)
> if(!any(fdr.t)) sig.fdr[1:n]<-0 else
if(!any(fdr.t==1)) sig.fdr[1:n]<-1 else {
max((1:25)[fdr.t==1])->max.fdr
sig.fdr[(max.fdr+1):25]<-0
sig.fdr[1:max.fdr]<-1 }
Finally I get a list of species that yield significant results according to each of the criteria.
#unadjusted
> names(pvals)[unadjusted]
[1] "MEIN" "MAIN" "LUAL" "ATSE" "COGI" "BAPI" "SOOL" "DISP" "CIOC"
#Bonferroni
> names(pvals)[b.test]
[1] "MEIN" "MAIN" "LUAL"
#sequential Bonferroni
> names(pvals)[sig.holm==1]
[1] "MEIN" "MAIN" "LUAL"
#FDR results
> names(pvals)[sig.fdr==1]
[1] "MEIN" "MAIN" "LUAL" "ATSE" "COGI" "BAPI" "SOOL" "DISP" "CIOC"
We can described the differences between these three approaches as follows.
I produce an error bar graph to display the results. The error bars represent 95% confidence intervals for the mean difference in cover over time. I enhance the requested graph slightly by listing the sample sizes on the right margin and color coding the results of the significance tests. Red confidence intervals correspond to results deemed significant using the false discovery rate protocol while blue sample sizes followed by an asterisk denote results deemed significant using the sequential Bonferroni protocol.
#collect results of significance testing and place in alphabetical order
> newdat<-data.frame(rownames(results),sig.fdr,sig.holm)
> newdat<-newdat[order(newdat[,1]),]
#collect confidence interval results and add to significance test results
> out1<-cbind(t(out),1:25)
> out2<-out1[order(rownames(out1)),]
> out3<-cbind(out2,newdat[,2],newdat[,3])
> out3[,6]<-1:25
#generate graph
> par(mar=c(5.1,5.1,2.1,5.1))
> plot(range(out3[,3:4]),c(.75,25.25), xlab='Mean Change in Cover', type='n', axes=F, ylab='')
> axis(1)
> axis(2,at=out3[,6],labels=rownames(out3),cex.axis=.75,las=2)
> arrows(out3[,3],out3[,6],out3[,4],out3[,6],code=3, angle=90, length=.05,col=(out3[,7]+1),lwd=2)
> points(out3[,1],out3[,6],pch=16,cex=1.2,col='white')
> points(out3[,1],out3[,6],pch=1,cex=1.1,col=(out3[,7]+1))
> mytext<-ifelse(out3[,8]==1,paste('n = ',out3[,5],'*'),paste('n = ',out3[,5]))
> mycol<-ifelse(out3[,8]==1,4,1)
> mtext(side=4,line=.5,at=1:25,text=mytext,col=mycol,cex=.75,las=2)
> abline(v=0,lty=2,col='seagreen')
> box()
> mtext(side=3,line=.5,'Results based on paired t-test')
Pair t-test Multiple Comparison Results
species showing significant differences in cover values
Unadjusted |
Bonferroni |
Sequential Bonferroni |
False Discovery Rate (FDR) |
"MEIN" "MAIN" "LUAL" "ATSE" "COGI" "BAPI" "SOOL" "DISP" "CIOC" |
"MEIN" "MAIN" "LUAL" |
"MEIN" "MAIN" "LUAL" |
"MEIN" "MAIN" "LUAL" "ATSE" "COGI" "BAPI" "SOOL" "DISP" "CIOC" |
![]() |
| Fig. 7 Confidence intervals for species-specific paired t-tests. Confidence intervals in red are significant with FDR = 0.05 (9 species). Results with a blue * (shown in sample sizes on right side) are significant using a sequential Bonferroni adjustment with an experimentwise error of α = .05 (3 species) |
The wilcox.exact test is found in the exactRankTests package. Its syntax is the same as that of the t.test function. As noted in the hints we need to correct the reported confidence interval for COGI. The rest of the code is the same as for the paired t-test. I summarize the significance test results at the end.
> cover<-read.table('http://www.unc.edu/courses/2007spring/enst/562/001/assignments/final/coverclass.txt', header=T)
> unique(cover$CHISSpCode)->species
> library(exactRankTests)
> rep.func<-function(x) {
cover[cover$CHISSpCode==x,]->name
wilcox.exact(name[,4], name[,3], paired=TRUE, conf.int=TRUE, conf.level=.95)->out
n<-length(name[,3])
c(out$estimate,out$p.value,out$conf.int,n)
}
> sapply(species,rep.func)->out
Warning message:
no finite arguments to max; returning -Inf
> colnames(out)<-species
> out[2,]->pvalues
> numtests<-length(pvalues)
> sort(pvalues)->pvals
> unadjusted<- pvals<.05
> alphai<-0.05/numtests
> bonf<-rep(.05/numtests,numtests)
> holm<-.05/seq(numtests,1,-1)
> fdr<-(1:numtests)*.05/numtests
> #carry out tests
> b.test<-(pvals<bonf)
> holm.t<-(pvals<holm)
> fdr.t<-(pvals<fdr)
> cbind(pvals,b.test,holm.t,fdr.t)->results
> results
pvals b.test holm.t fdr.t
MEIN 0.002685547 0 0 0
MAIN 0.003173828 0 0 1
LUAL 0.007812500 0 0 0
ATSE 0.010742188 0 0 0
COGI 0.015625000 0 0 0
BAPI 0.019210815 0 0 0
SOOL 0.019531250 0 0 0
DISP 0.025634766 0 0 0
CIOC 0.039062500 0 0 0
SIBE 0.113281250 0 0 0
CAAE 0.164550781 0 0 0
LODE 0.234375000 0 0 0
ERGL 0.269531250 0 0 0
ACMI 0.375000000 0 0 0
BRDI 0.455734253 0 0 0
ISME 0.478607178 0 0 0
CAMA 0.621337891 0 0 0
ERGR 0.687500000 0 0 0
GNLU 0.726562500 0 0 0
AVXX 1.000000000 0 0 0
HOBR 1.000000000 0 0 0
MEPO 1.000000000 0 0 0
ASMI 1.000000000 0 0 0
ATCA 1.000000000 0 0 0
DAPU 1.000000000 0 0 0
> sig.holm<-vector(mode='numeric',length=n)
> if(!any(holm.t)) sig.holm[1:n]<-0 else
if(!any(holm.t==0)) sig.holm[1:n]<-1 else {
min((1:25)[holm.t==0])->min.holm
sig.holm[1:(min.holm-1)]<-1
sig.holm[min.holm:25]<-0 }
> sig.fdr<-vector(mode='numeric',length=25)
> if(!any(fdr.t)) sig.fdr[1:n]<-0 else
if(!any(fdr.t==1)) sig.fdr[1:n]<-1 else {
max((1:25)[fdr.t==1])->max.fdr
sig.fdr[(max.fdr+1):25]<-0
sig.fdr[1:max.fdr]<-1 }
#unadjusted
> names(pvals)[unadjusted]
[1] "MEIN" "MAIN" "LUAL" "ATSE" "COGI" "BAPI" "SOOL" "DISP" "CIOC"
#Bonferroni
> names(pvals)[b.test]
character(0)
#sequential Bonferroni
> names(pvals)[sig.holm==1]
character(0)
#FDR results
> names(pvals)[sig.fdr==1]
[1] "MEIN" "MAIN"
> newdat<-data.frame(rownames(results),sig.fdr,sig.holm)
> newdat<-newdat[order(newdat[,1]),]
> out1<-cbind(t(out),1:25)
#fix COGI
> cover[cover$CHISSpCode=="COGI",]->cogi
> cogi.diff<-cogi[,4]-cogi[,3]
#Walsh averages
> Wmat<-outer(cogi.diff,cogi.diff,'+')/2
> Walsh<-c(Wmat[lower.tri(Wmat)],diag(Wmat))
> n.cogi<-dim(cogi)[1]
> q1<-n.cogi*(n.cogi+1)/2+1-qsignrank(.975,n.cogi)
> q2<-qsignrank(.975,n.cogi)
> Walsh.sort<-sort(Walsh)
> cogi.median<-median(Walsh)
> cogi.lower95<-Walsh.sort[q1]
> cogi.upper95<-Walsh.sort[q2]
> out1["COGI",1]<-cogi.median
> out1["COGI",3]<-cogi.lower95
> out1["COGI",4]<-cogi.upper95
> out1
(pseudo)median
ACMI 0.50 0.375000000 -1.0 1.0 12 1
ATSE -1.25 0.010742188 -1.5 -1.0 13 2
AVXX -0.25 1.000000000 -1.5 1.0 13 3
BAPI 0.50 0.019210815 0.0 1.0 23 4
BRDI 0.25 0.455734253 -1.0 1.0 26 5
CAMA 0.25 0.621337891 -1.0 1.0 20 6
CIOC 0.50 0.039062500 0.0 1.0 24 7
COGI 0.50 0.015625000 0.0 1.0 13 8
DISP 0.75 0.025634766 0.0 2.0 18 9
GNLU 0.50 0.726562500 -1.0 1.0 15 10
HOBR -0.50 1.000000000 -1.0 1.0 17 11
ISME 0.25 0.478607178 -0.5 1.0 23 12
MEIN -1.25 0.002685547 -2.0 -1.0 18 13
MEPO 0.50 1.000000000 -1.5 1.0 12 14
SOOL -1.25 0.019531250 -1.5 0.0 24 15
ASMI 0.50 1.000000000 -1.0 1.0 19 16
ATCA 0.50 1.000000000 -1.5 1.0 21 17
CAAE 0.75 0.164550781 -0.5 1.5 21 18
DAPU -0.25 1.000000000 -1.5 1.0 15 19
ERGL 0.25 0.269531250 -1.0 1.0 15 20
LODE 0.75 0.234375000 -1.0 1.5 13 21
LUAL 1.25 0.007812500 1.0 2.0 11 22
MAIN -1.25 0.003173828 -1.5 -1.0 19 23
ERGR 0.50 0.687500000 -1.0 1.0 11 24
SIBE 0.75 0.113281250 -1.0 1.5 12 25
> newdat<-data.frame(rownames(results),sig.fdr,sig.holm)
> newdat<-newdat[order(newdat[,1]),]
> out2<-out1[order(rownames(out1)),]
> out3<-cbind(out2,newdat[,2],newdat[,3])
> out3[,6]<-1:25
#generate the graph
> par(mar=c(5.1,5.1,2.1,5.1))
> plot(range(out3[,3:4]), c(.75,25.25), xlab='Median Change in Cover', type='n', axes=F, ylab='')
> axis(1)
> axis(2,at=out3[,6], labels=rownames(out3), cex.axis=.75, las=2)
> arrows(out3[,3],out3[,6],out3[,4],out3[,6],code=3, angle=90, length=.05, col=(out3[,7]+1), lwd=2)
> points(out3[,1],out3[,6],pch=16,cex=1.2,col='white')
> points(out3[,1],out3[,6],pch=1,cex=1.1,col=(out3[,7]+1))
> mytext<-ifelse(out3[,8]==1, paste('n = ', out3[,5],'*'), paste('n = ', out3[,5]))
> mycol<-ifelse(out3[,8]==1,4,1)
> mtext(side=4,line=.5,at=1:25, text=mytext, col=mycol, cex=.75, las=2)
> abline(v=0,lty=2,col='seagreen')
> box()
> mtext(side=3, line=.5, 'Results based on Wilcoxon signed rank test')
Wilcoxon Signed Rank Test Multiple Comparison Results
(species showing significant differences in cover values)
Unadjusted |
Bonferroni |
Sequential Bonferroni |
False Discovery Rate (FDR) |
| "MEIN" "MAIN" "LUAL" "ATSE" "COGI" "BAPI" "SOOL" "DISP" "CIOC" | None |
None |
"MEIN" "MAIN" |
![]() |
| Fig. 8 Confidence intervals for species-specific Wilcoxon signed rank tests. Confidence intervals in red are significant with FDR = 0.05 (2 species). Results with a blue * (shown in sample sizes on right side) are significant with experimentwise error rate of α = .05 using a sequential Bonferroni adjustment (0 species) |
The perm.test function is found in the exactRankTests package. Its syntax is the same as the t.test function. Due to the programming complexity of finding permutation-based confidence intervals I report basic bootstrap confidence intervals for the difference instead. I summarize the significance test results at the end.
> cover<-read.table('http://www.unc.edu/courses/2007spring/enst/562/001/assignments/final/coverclass.txt', header=T)
> unique(cover$CHISSpCode)->species
> library(exactRankTests)
> library(boot)
> rep.func.boot<-function(x) {
cover[cover$CHISSpCode==x,]->name
perm.test(name[,4], name[,3], paired=TRUE)->out
n<-length(name[,3])
pval<-out$p.value
actdiff<-(name[,4]-name[,3])
est<-mean(actdiff)
mean.func<-function(x,indices) mean(x[indices])
boot(actdiff,mean.func,9999)->myboot
boot.ci(myboot,type='basic')->mybootci
c(est,pval,mybootci$basic[4:5],n)
}
> sapply(species,rep.func.boot)->out
> colnames(out)<-species
> out[2,]->pvalues
> numtests<-length(pvalues)
> unadjusted<-(pvals<.05)
> sort(pvalues)->pvals
> alphai<-0.05/numtests
> bonf<-rep(.05/numtests,numtests)
> holm<-.05/seq(numtests,1,-1)
> fdr<-(1:numtests)*.05/numtests
#carry out tests
> b.test<-(pvals<bonf)
> holm.t<-(pvals<holm)
> fdr.t<-(pvals<fdr)
> cbind(pvals,b.test,holm.t,fdr.t)->results
> results
pvals b.test holm.t fdr.t
MEIN 0.002685547 0 0 0
MAIN 0.003173828 0 0 1
LUAL 0.007812500 0 0 0
ATSE 0.010742188 0 0 0
COGI 0.015625000 0 0 0
BAPI 0.019210815 0 0 0
SOOL 0.019531250 0 0 0
DISP 0.025634766 0 0 0
CIOC 0.039062500 0 0 0
SIBE 0.113281250 0 0 0
CAAE 0.164550781 0 0 0
LODE 0.234375000 0 0 0
ERGL 0.269531250 0 0 0
ACMI 0.375000000 0 0 0
BRDI 0.455734253 0 0 0
ISME 0.478607178 0 0 0
CAMA 0.621337891 0 0 0
ERGR 0.687500000 0 0 0
GNLU 0.726562500 0 0 0
AVXX 1.000000000 0 0 0
HOBR 1.000000000 0 0 0
MEPO 1.000000000 0 0 0
ASMI 1.000000000 0 0 0
ATCA 1.000000000 0 0 0
DAPU 1.000000000 0 0 0
> sig.holm<-vector(mode='numeric',length=n)
> if(!any(holm.t)) sig.holm[1:n]<-0 else
if(!any(holm.t==0)) sig.holm[1:n]<-1 else {
min((1:25)[holm.t==0])->min.holm
sig.holm[1:(min.holm-1)]<-1
sig.holm[min.holm:25]<-0 }
> sig.fdr<-vector(mode='numeric',length=25)
> if(!any(fdr.t)) sig.fdr[1:n]<-0 else
if(!any(fdr.t==1)) sig.fdr[1:n]<-1 else {
max((1:25)[fdr.t==1])->max.fdr
sig.fdr[(max.fdr+1):25]<-0
sig.fdr[1:max.fdr]<-1 }
#unadjusted
> names(pvals)[unadjusted]
[1] "MEIN" "MAIN" "LUAL" "ATSE" "COGI" "BAPI" "SOOL" "DISP" "CIOC"
#Bonferroni
> names(pvals)[b.test]
character(0)
#sequential Bonferroni
> names(pvals)[sig.holm==1]
character(0)
#FDR results
> names(pvals)[sig.fdr==1]
[1] "MEIN" "MAIN"
> newdat<-data.frame(rownames(results),sig.fdr,sig.holm)
> newdat<-newdat[order(newdat[,1]),]
> out1<-cbind(t(out),1:25)
> out2<-out1[order(rownames(out1)),]
> out3<-cbind(out2,newdat[,2],newdat[,3])
> out3[,6]<-1:25
#generate the graph
> par(mar=c(5.1,5.1,2.1,5.1))
> plot(range(out3[,3:4]), c(.75,25.25), xlab='Mean Change in Cover', type='n', axes=F, ylab='')
> axis(1)
> axis(2,at=out3[,6], labels=rownames(out3), cex.axis=.75, las=2)
> arrows(out3[,3], out3[,6], out3[,4], out3[,6], code=3, angle=90, length=.05,col=(out3[,7]+1),lwd=2)
> points(out3[,1], out3[,6], pch=16, cex=1.2, col='white')
> points(out3[,1], out3[,6], pch=1, cex=1.1, col=(out3[,7]+1))
> mytext<-ifelse(out3[,8]==1, paste('n = ',out3[,5],'*'), paste('n = ', out3[,5]))
> mycol<-ifelse(out3[,8]==1,4,1)
> mtext(side=4, line=.5, at=1:25, text=mytext, col=mycol, cex=.75, las=2)
> abline(v=0, lty=2, col='seagreen')
> box()
> mtext(side=3, line=.5, 'Results based on permutation test and basic bootstrap interval')
Permutation Test Multiple Comparison Results
(species showing significant differences in cover values)
Unadjusted |
Bonferroni |
Sequential Bonferroni |
False Discovery Rate (FDR) |
| "MEIN" "MAIN" "LUAL" "ATSE" "COGI" "BAPI" "SOOL" "DISP" "CIOC" | None |
None |
"MEIN" "MAIN" |
![]() |
| Fig. 9 Confidence intervals for species-specific permutation tests on the differences. Bootstrap confidence intervals in red correspond to significant permutation tests with FDR = 0.05 (9 species). Results with a blue * (shown in sample sizes on right side) are significant using a sequential Bonferroni adjustment with an experimentwise error rate α = .05 (3 species) |
The randomized block design approach yields results identical to the paired t-test. The only change from that code are the details in locating the statistics needed from the output of the lm function. In addition the confidence intervals are constructed by hand. I summarize the significance test results at the end
################# randomized block design model ###################
> cover<-read.table('http://www.unc.edu/courses/2007spring/enst/562/001/assignments/final/coverclass.txt', header=T)
> unique(cover$CHISSpCode)->species
> rep.func.lm<-function(x) {
cover[cover$CHISSpCode==x,]->acmi
new.acmi<-data.frame(rep(acmi$PlotCode,2), c(acmi$Coverclass.1983, acmi$Coverclass.2002), rep(c(1983,2002), c(dim(acmi)[1], dim(acmi)[1])))
colnames(new.acmi)<-c('Plot', 'Cover', 'Year')
lm(Cover~factor(Year)+ factor(Plot), data=new.acmi)->out5
summary(out5)->out6
p.val<-out6$coefficients[2,4]
est<-out6$coefficients[2,1]
df<-out5$df.residual
se<-out6$coefficients[2,2]
low95<-est+qt(.025,df)*se
hi95<-est+qt(.975,df)*se
n<-dim(acmi)[1]
c(est,p.val,low95,hi95,n)
}
> sapply(species,rep.func.lm)->out
> colnames(out)<-species
> out[2,]->pvalues
> numtests<-length(pvalues)
> sort(pvalues)->pvals
> unadjusted<-(pvals<.05)
> alphai<-0.05/numtests
> bonf<-rep(.05/numtests,numtests)
> holm<-.05/seq(numtests,1,-1)
> fdr<-(1:numtests)*.05/numtests
#carry out tests
> b.test<-(pvals<bonf)
> holm.t<-(pvals<holm)
> fdr.t<-(pvals<fdr)
> cbind(pvals,b.test,holm.t,fdr.t)->results
> cbind(pvals,unadjusted,b.test,holm.t,fdr.t)->results
pvals unadjusted b.test holm.t fdr.t
MEIN 0.0007769294 1 1 1 1
MAIN 0.0007944877 1 1 1 1
LUAL 0.0016069697 1 1 1 1
ATSE 0.0024154024 1 0 0 1
COGI 0.0051616813 1 0 0 1
BAPI 0.0082593612 1 0 0 1
SOOL 0.0091516301 1 0 0 1
DISP 0.0147043609 1 0 0 1
CIOC 0.0160893434 1 0 0 1
SIBE 0.0674332787 0 0 0 0
CAAE 0.1188353147 0 0 0 0
LODE 0.1368122627 0 0 0 0
ACMI 0.1910542981 0 0 0 0
ERGL 0.2376664725 0 0 0 0
ERGR 0.4405216964 0 0 0 0
ISME 0.4619848747 0 0 0 0
BRDI 0.4668980374 0 0 0 0
GNLU 0.4985453188 0 0 0 0
CAMA 0.5453609184 0 0 0 0
HOBR 0.7498058561 0 0 0 0
MEPO 0.7544993470 0 0 0 0
ATCA 0.7711217886 0 0 0 0
AVXX 0.7762430284 0 0 0 0
ASMI 0.7898521784 0 0 0 0
DAPU 0.8177843203 0 0 0 0
> sig.holm<-vector(mode='numeric',length=n)
> if(!any(holm.t)) sig.holm[1:n]<-0 else
if(!any(holm.t==0)) sig.holm[1:n]<-1 else {
min((1:25)[holm.t==0])->min.holm
sig.holm[1:(min.holm-1)]<-1
sig.holm[min.holm:25]<-0 }
> sig.fdr<-vector(mode='numeric',length=25)
> if(!any(fdr.t)) sig.fdr[1:n]<-0 else
if(!any(fdr.t==1)) sig.fdr[1:n]<-1 else {
max((1:25)[fdr.t==1])->max.fdr
sig.fdr[(max.fdr+1):25]<-0
sig.fdr[1:max.fdr]<-1 }
#unadjusted
> names(pvals)[unadjusted]
[1] "MEIN" "MAIN" "LUAL" "ATSE" "COGI" "BAPI" "SOOL" "DISP" "CIOC"
#Bonferroni
> names(pvals)[b.test]
[1] "MEIN" "MAIN" "LUAL"
#sequential Bonferroni
> names(pvals)[sig.holm==1]
[1] "MEIN" "MAIN" "LUAL"
#FDR results
> names(pvals)[sig.fdr==1]
[1] "MEIN" "MAIN" "LUAL" "ATSE" "COGI" "BAPI" "SOOL" "DISP" "CIOC"
> newdat<-data.frame(rownames(results),sig.fdr,sig.holm)
> newdat<-newdat[order(newdat[,1]),]
> out1<-cbind(t(out),1:25)
> out2<-out1[order(rownames(out1)),]
> out3<-cbind(out2,newdat[,2],newdat[,3])
> out3[,6]<-1:25
#generate the graph
> par(mar=c(5.1,5.1,2.1,5.1))
> plot(range(out3[,3:4]), c(.75,25.25), xlab='Mean Change in Cover', type='n', axes=F, ylab='')
> axis(1)
> axis(2,at=out3[,6], labels=rownames(out3), cex.axis=.75, las=2)
> arrows(out3[,3], out3[,6], out3[,4], out3[,6], code=3, angle=90, length=.05, col=(out3[,7]+1), lwd=2)
> points(out3[,1], out3[,6], pch=16, cex=1.2, col='white')
> points(out3[,1], out3[,6], pch=1, cex=1.1, col=(out3[,7]+1))
> mytext<-ifelse(out3[,8]==1, paste('n = ', out3[,5],'*'), paste('n = ', out3[,5]))
> mycol<-ifelse(out3[,8]==1,4,1)
> mtext(side=4,line=.5, at=1:25, text=mytext, col=mycol, cex=.75, las=2)
> abline(v=0,lty=2, col='seagreen')
> box()
> mtext(side=3, line=.5, 'Results based on randomized block design')
Randomized Block Design Multiple Comparison Results
(species showing significant differences in cover values)
Unadjusted |
Bonferroni |
Sequential Bonferroni |
False Discovery Rate (FDR) |
"MEIN" "MAIN" "LUAL" "ATSE" "COGI" "BAPI" "SOOL" "DISP" "CIOC" |
"MEIN" "MAIN" "LUAL" |
"MEIN" "MAIN" "LUAL" |
"MEIN" "MAIN" "LUAL" "ATSE" "COGI" "BAPI" "SOOL" "DISP" "CIOC" |
![]() |
| Fig. 10 Confidence intervals for species-specific t-tests based on a randomized block design. Confidence intervals in red are significant with FDR = 0.05 (9 species). Results with a blue * (shown in sample sizes on right side) are significant using a sequential Bonferroni adjustment with an experimentwise error rate α = .05 (3 species) |
The mixed effects model treats the blocks of the randomized block design as random effects. The only change from that code is in locating the statistics needed from the output of the lme function. The confidence intervals are constructed by hand. I summarize the significance test results at the end.
################# mixed model ###################
> cover<-read.table('http://www.unc.edu/courses/2007spring/enst/562/001/assignments/final/coverclass.txt', header=T)
> unique(cover$CHISSpCode)->species
> library(nlme)
> rep.func.lme<-function(x) {
cover[cover$CHISSpCode==x,]->acmi
new.acmi<-data.frame(rep(acmi$PlotCode,2), c(acmi$Coverclass.1983, acmi$Coverclass.2002), rep(c(1983,2002), c(dim(acmi)[1], dim(acmi)[1])))
colnames(new.acmi)<-c('Plot', 'Cover', 'Year')
lme(Cover~factor(Year),random=~1|Plot, data=new.acmi, method='ML')->out5
summary(out5)->out6
p.val<-out6$tTable[2,5]
est<-out6$tTable[2,1]
df<-out6$tTable[2,3]
se<-out6$tTable[2,2]
low95<-est+qt(.025,df)*se
hi95<-est+qt(.975,df)*se
n<-dim(acmi)[1]
c(est,p.val,low95,hi95,n)
}
> sapply(species,rep.func.lme)->out
> colnames(out)<-species
> out[2,]->pvalues
> numtests<-length(pvalues)
> sort(pvalues)->pvals
> unadjusted<-(pvals<.05)
> alphai<-0.05/numtests
> bonf<-rep(.05/numtests,numtests)
> holm<-.05/seq(numtests,1,-1)
> fdr<-(1:numtests)*.05/numtests
> #carry out tests
> b.test<-(pvals<bonf)
> holm.t<-(pvals<holm)
> fdr.t<-(pvals<fdr)
> cbind(pvals,b.test,holm.t,fdr.t)->results
> results
pvals b.test holm.t fdr.t
MEIN 0.0005888852 1 1 1
MAIN 0.0007944875 1 1 1
LUAL 0.0016069638 1 1 1
ATSE 0.0024154023 0 0 1
COGI 0.0051616859 0 0 1
BAPI 0.0082593609 0 0 1
SOOL 0.0085271585 0 0 1
CIOC 0.0118833699 0 0 1
DISP 0.0147043609 0 0 1
SIBE 0.0429602640 0 0 0
CAAE 0.1188352972 0 0 0
LODE 0.1368122627 0 0 0
ACMI 0.1573816360 0 0 0
ERGL 0.1872034914 0 0 0
GNLU 0.4331929779 0 0 0
ERGR 0.4405216848 0 0 0
ISME 0.4619854920 0 0 0
BRDI 0.4668980315 0 0 0
CAMA 0.5453609093 0 0 0
MEPO 0.7161281594 0 0 0
HOBR 0.7498058540 0 0 0
DAPU 0.7564951971 0 0 0
ATCA 0.7711217816 0 0 0
AVXX 0.7762430425 0 0 0
ASMI 0.7898523960 0 0 0
> sig.holm<-vector(mode='numeric',length=n)
> if(!any(holm.t)) sig.holm[1:n]<-0 else
if(!any(holm.t==0)) sig.holm[1:n]<-1 else {
min((1:25)[holm.t==0])->min.holm
sig.holm[1:(min.holm-1)]<-1
sig.holm[min.holm:25]<-0 }
> sig.fdr<-vector(mode='numeric',length=25)
> if(!any(fdr.t)) sig.fdr[1:n]<-0 else
if(!any(fdr.t==1)) sig.fdr[1:n]<-1 else {
max((1:25)[fdr.t==1])->max.fdr
sig.fdr[(max.fdr+1):25]<-0
sig.fdr[1:max.fdr]<-1 }
#unadjusted
> names(pvals)[unadjusted]
[1] "MEIN" "MAIN" "LUAL" "ATSE" "COGI" "BAPI" "SOOL" "CIOC" "DISP" "SIBE"
#Bonferroni
> names(pvals)[b.test]
[1] "MEIN" "MAIN" "LUAL"
#sequential Bonferroni
> names(pvals)[sig.holm==1]
[1] "MEIN" "MAIN" "LUAL"
#FDR results
> names(pvals)[sig.fdr==1]
[1] "MEIN" "MAIN" "LUAL" "ATSE" "COGI" "BAPI" "SOOL" "CIOC" "DISP"
> newdat<-data.frame(rownames(results),sig.fdr,sig.holm)
> newdat<-newdat[order(newdat[,1]),]
> out1<-cbind(t(out),1:25)
> out2<-out1[order(rownames(out1)),]
> out3<-cbind(out2,newdat[,2],newdat[,3])
> out3[,6]<-1:25
#create graph
> par(mar=c(5.1,5.1,2.1,5.1))
> plot(range(out3[,3:4]), c(.75,25.25), xlab='Mean Change in Cover' ,type='n', axes=F, ylab='')
> axis(1)
> axis(2,at=out3[,6], labels=rownames(out3), cex.axis=.75, las=2)
> arrows(out3[,3], out3[,6], out3[,4], out3[,6], code=3, angle=90, length=.05, col=(out3[,7]+1), lwd=2)
> points(out3[,1], out3[,6], pch=16, cex=1.2, col='white')
> points(out3[,1], out3[,6], pch=1, cex=1.1, col=(out3[,7]+1))
> mytext<-ifelse(out3[,8]==1, paste('n = ', out3[,5],'*'), paste('n = ', out3[,5]))
> mycol<-ifelse(out3[,8]==1,4,1)
> mtext(side=4, line=.5, at=1:25, text=mytext, col=mycol, cex=.75, las=2)
> abline(v=0,lty=2, col='seagreen')
> box()
> mtext(side=3,line=.5, 'Results based on random intercepts model')
Random Effects Model Multiple Comparison Results
(species showing significant differences in cover values)
Unadjusted |
Bonferroni |
Sequential Bonferroni |
False Discovery Rate (FDR) |
"MEIN" "MAIN" "LUAL" "ATSE" "COGI" "BAPI" "SOOL" "CIOC" "DISP" "SIBE" |
"MEIN" "MAIN" "LUAL" |
"MEIN" "MAIN" "LUAL" |
"MEIN" "MAIN" "LUAL" "ATSE" "COGI" "BAPI" "SOOL" "DISP" "CIOC" |
![]() |
| Fig. 11 Confidence intervals for species-specific t-tests from a random effects model with random intercepts for each relevé. Confidence intervals in red are significant with FDR = 0.05 (9 species). Results with a blue * (shown in sample sizes on right side) are significant using a sequential Bonferroni adjustment with an experimentwise error rate of α = .05 (3 species) |
I read in the data and merge files just as was done in Assignment 10.
#load data
> srrs2 <- read.table ("http://www.unc.edu/courses/2007spring/enst/562/001/assignments/assign10/radon.txt", header=T, sep=",")
#get Minnesota
> minn<-srrs2[srrs2$state=="MN",]
#abbreviate county code
> minn$countycode<-substr(as.character(minn$county),1,6)
#get county data
> cty <- read.table ("http://www.unc.edu/courses/2007spring/enst/562/001/assignments/assign10/cty.txt", header=T, sep=",")
#select Minnesota
> minn2<-cty[cty$st=='MN',]
#combine data sets
> merge(minn,minn2,by.x="cntyfips",by.y="ctfips")->merge3
Next I set up the numeric variables for use in WinBUGS.
#log transform activity
> y<-log(ifelse (merge3$activity ==0, .05, merge3$activity))
> J<-length(unique(merge3$county))
> county<-as.numeric(factor(merge3$county))
> x<-merge3$floor
> n<-length(y)
#log transform and select uranium values one per county
> z<-as.vector(tapply(log(merge3$Uppm), factor(merge3$county), function(x) x[1]))
I create the input objects to WinBUGS.
> radon.data<-list("n","J","y","county","x","z")
> radon.inits<-function(){list(a=rnorm(J),b=rnorm(1),a1=rnorm(1),
> sigma.y=runif(1), mu.a=rnorm(1),sigma.a=runif(1))}
> radon.parameters<-c("a","b","sigma.y","mu.a","sigma.a","a1")
The following WinBUGS model file is stored in D:/enst 562/final/radonmodel.txt. This is only a trivial modification of the random intercepts WinBUGS model shown in the notes.
model{
for(i in 1:n) {
y[i]~dnorm(y.hat[i],tau.y)
y.hat[i]<-a[county[i]]+b*x[i]
}
b~dnorm(0,.0001)
tau.y<-pow(sigma.y,-2)
sigma.y~dunif(0,100)
for(j in 1:J){
a[j]~dnorm(a.hat[j],tau.a)
a.hat[j]<-mu.a+a1*z[j]
}
mu.a~dnorm(0,.0001)
a1~dnorm(0,.0001)
tau.a<-pow(sigma.a,-2)
sigma.a~dunif(0,100)
}
I set the R working directory to the directory that contains the model description.
> setwd("D:/enst 562/final")
> library(arm)
I run the model. After a couple of preliminary runs in which the sigma.a chains were not mixing well enough, I let the model run for 1500 iterations throwing away the first 750 iterates. The Rhat statistic is reported to be 1 for all parameters indicating adequate mixing.
> radon.results<-bugs(radon.data, radon.inits, radon.parameters, "radonmodel.txt",
n.chains=3, bugs.directory="D:/Program Files/WinBUGS14/", n.iter=1500, debug=T)
> radon.results
Inference for Bugs model at "radonmodel.txt", fit using winbugs,
3 chains, each with 1500 iterations (first 750 discarded), n.thin = 2
n.sims = 1125 iterations saved
mean sd 2.5% 25% 50% 75% 97.5% Rhat n.eff
a[1] 0.9 0.2 0.6 0.9 1.0 1.0 1.3 1 1100
a[2] 0.9 0.1 0.7 0.8 0.9 0.9 1.0 1 320
a[3] 1.4 0.2 1.1 1.3 1.4 1.5 1.7 1 1100
a[4] 1.1 0.2 0.9 1.0 1.1 1.2 1.5 1 1100
a[5] 1.4 0.1 1.1 1.3 1.4 1.5 1.6 1 520
a[6] 1.7 0.2 1.4 1.6 1.7 1.8 2.0 1 1100
a[7] 1.8 0.1 1.5 1.7 1.8 1.9 2.1 1 390
a[8] 1.7 0.2 1.4 1.6 1.7 1.8 2.1 1 1100
a[9] 1.1 0.1 0.9 1.1 1.2 1.2 1.4 1 870
a[10] 1.5 0.1 1.2 1.4 1.5 1.6 1.8 1 770
a[11] 1.1 0.2 0.8 1.0 1.1 1.2 1.4 1 660
a[12] 1.7 0.2 1.4 1.6 1.7 1.8 2.0 1 1100
a[13] 1.0 0.2 0.7 0.9 1.0 1.1 1.3 1 1100
a[14] 1.8 0.1 1.6 1.7 1.8 1.9 2.1 1 1100
a[15] 1.4 0.2 1.1 1.3 1.4 1.5 1.7 1 1100
a[16] 1.1 0.2 0.7 1.0 1.1 1.2 1.4 1 960
a[17] 1.6 0.2 1.3 1.5 1.6 1.7 1.9 1 340
a[18] 1.0 0.1 0.8 1.0 1.0 1.1 1.3 1 1100
a[19] 1.4 0.1 1.2 1.3 1.4 1.4 1.5 1 680
a[20] 1.7 0.2 1.4 1.6 1.7 1.8 2.0 1 1100
a[21] 1.6 0.1 1.3 1.5 1.6 1.7 1.9 1 820
a[22] 1.5 0.2 1.1 1.3 1.5 1.6 1.8 1 1100
a[23] 1.7 0.2 1.4 1.6 1.7 1.8 2.0 1 1000
a[24] 1.8 0.1 1.5 1.7 1.7 1.9 2.1 1 1100
a[25] 1.7 0.1 1.5 1.6 1.7 1.8 2.0 1 410
a[26] 1.4 0.1 1.2 1.3 1.4 1.4 1.5 1 640
a[27] 1.8 0.2 1.5 1.7 1.8 1.9 2.1 1 1100
a[28] 1.2 0.1 0.9 1.1 1.2 1.3 1.5 1 1100
a[29] 0.9 0.2 0.6 0.8 0.9 1.0 1.3 1 1100
a[30] 1.0 0.1 0.7 0.9 1.0 1.1 1.2 1 1100
a[31] 1.7 0.1 1.5 1.6 1.7 1.8 2.0 1 1100
a[32] 1.4 0.2 1.1 1.3 1.4 1.5 1.7 1 1000
a[33] 1.6 0.2 1.3 1.5 1.6 1.7 1.9 1 1100
a[34] 1.5 0.2 1.2 1.4 1.5 1.6 1.8 1 1100
a[35] 0.8 0.1 0.5 0.7 0.8 0.9 1.1 1 700
a[36] 1.8 0.2 1.5 1.7 1.8 1.9 2.2 1 660
a[37] 0.8 0.1 0.5 0.7 0.8 0.9 1.1 1 1100
a[38] 1.1 0.2 0.8 1.0 1.1 1.2 1.5 1 770
a[39] 1.6 0.1 1.3 1.5 1.6 1.7 1.9 1 1100
a[40] 1.9 0.2 1.6 1.7 1.8 2.0 2.2 1 1100
a[41] 1.8 0.1 1.5 1.7 1.8 1.9 2.1 1 640
a[42] 1.6 0.2 1.2 1.5 1.6 1.7 1.9 1 1100
a[43] 1.5 0.1 1.2 1.4 1.5 1.6 1.8 1 790
a[44] 1.5 0.2 1.1 1.4 1.5 1.6 1.7 1 1100
a[45] 1.4 0.1 1.2 1.4 1.5 1.5 1.7 1 460
a[46] 1.4 0.1 1.1 1.4 1.4 1.5 1.7 1 730
a[47] 1.3 0.2 0.9 1.2 1.3 1.4 1.6 1 1100
a[48] 1.3 0.1 1.1 1.2 1.3 1.4 1.6 1 1100
a[49] 1.7 0.1 1.4 1.6 1.7 1.8 1.9 1 1100
a[50] 1.8 0.2 1.5 1.7 1.8 1.9 2.2 1 1100
a[51] 1.7 0.2 1.4 1.6 1.7 1.8 2.1 1 1100
a[52] 1.8 0.2 1.4 1.7 1.8 1.9 2.1 1 1100
a[53] 1.6 0.2 1.3 1.5 1.6 1.7 1.9 1 310
a[54] 1.5 0.1 1.2 1.4 1.5 1.6 1.7 1 610
a[55] 1.4 0.1 1.1 1.3 1.4 1.5 1.7 1 1100
a[56] 1.4 0.2 1.0 1.3 1.4 1.5 1.7 1 620
a[57] 1.2 0.2 0.9 1.1 1.2 1.3 1.5 1 1100
a[58] 1.8 0.2 1.5 1.7 1.8 1.9 2.1 1 330
a[59] 1.7 0.2 1.3 1.6 1.7 1.8 2.0 1 280
a[60] 1.6 0.2 1.3 1.5 1.6 1.7 1.9 1 1100
a[61] 1.2 0.1 0.9 1.1 1.2 1.2 1.4 1 250
a[62] 1.8 0.2 1.5 1.7 1.8 1.9 2.1 1 880
a[63] 1.7 0.2 1.4 1.6 1.7 1.8 2.1 1 1100
a[64] 1.7 0.1 1.4 1.6 1.7 1.8 2.0 1 1100
a[65] 1.8 0.2 1.4 1.7 1.8 1.9 2.1 1 1100
a[66] 1.4 0.1 1.2 1.3 1.4 1.5 1.7 1 1100
a[67] 1.6 0.1 1.4 1.5 1.6 1.7 1.9 1 1100
a[68] 1.0 0.2 0.7 0.9 1.0 1.1 1.3 1 1100
a[69] 1.6 0.2 1.2 1.5 1.6 1.7 1.9 1 1100
a[70] 0.9 0.1 0.8 0.9 0.9 1.0 1.0 1 910
a[71] 1.5 0.1 1.3 1.4 1.5 1.6 1.7 1 1100
a[72] 1.6 0.1 1.4 1.6 1.6 1.7 1.9 1 1100
a[73] 1.8 0.2 1.5 1.7 1.8 1.9 2.1 1 1100
a[74] 1.6 0.2 1.2 1.5 1.6 1.7 1.9 1 1100
a[75] 1.5 0.2 1.2 1.4 1.5 1.6 1.8 1 330
a[76] 1.8 0.2 1.5 1.7 1.8 1.9 2.2 1 1100
a[77] 1.6 0.1 1.4 1.5 1.6 1.7 1.9 1 740
a[78] 1.0 0.2 0.7 0.9 1.0 1.1 1.4 1 1000
a[79] 1.5 0.2 1.1 1.4 1.5 1.6 1.8 1 1100
a[80] 1.3 0.1 1.2 1.3 1.3 1.4 1.5 1 1100
a[81] 1.7 0.2 1.4 1.6 1.7 1.8 2.1 1 1100
a[82] 1.7 0.2 1.3 1.6 1.7 1.8 2.0 1 1100
a[83] 1.7 0.1 1.4 1.6 1.7 1.8 2.0 1 570
a[84] 1.5 0.1 1.2 1.4 1.5 1.6 1.8 1 1000
a[85] 1.7 0.2 1.3 1.6 1.7 1.8 2.0 1 1100
b -0.7 0.1 -0.8 -0.7 -0.7 -0.6 -0.5 1 430
sigma.y 0.8 0.0 0.7 0.8 0.8 0.8 0.8 1 840
mu.a 1.5 0.0 1.4 1.4 1.5 1.5 1.5 1 930
sigma.a 0.2 0.0 0.1 0.1 0.2 0.2 0.3 1 1100
a1 0.7 0.1 0.5 0.7 0.7 0.8 0.9 1 1100
deviance 2127.3 11.8 2104.1 2119.0 2127.0 2135.0 2150.0 1 1100
For each parameter, n.eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor (at convergence, Rhat=1).
pD = 23.8 and DIC = 2151.1 (using the rule, pD = Dbar-Dhat)
DIC is an estimate of expected predictive error (lower deviance is better).
The results we need are easily accessed from $sims.matrix component of the bugs object.
> myparms<-radon.results$sims.matrix
> colnames(myparms)
> names(myparms)
[1] "a[1]" "a[2]" "a[3]" "a[4]" "a[5]" "a[6]"
[7] "a[7]" "a[8]" "a[9]" "a[10]" "a[11]" "a[12]"
[13] "a[13]" "a[14]" "a[15]" "a[16]" "a[17]" "a[18]"
[19] "a[19]" "a[20]" "a[21]" "a[22]" "a[23]" "a[24]"
[25] "a[25]" "a[26]" "a[27]" "a[28]" "a[29]" "a[30]"
[31] "a[31]" "a[32]" "a[33]" "a[34]" "a[35]" "a[36]"
[37] "a[37]" "a[38]" "a[39]" "a[40]" "a[41]" "a[42]"
[43] "a[43]" "a[44]" "a[45]" "a[46]" "a[47]" "a[48]"
[49] "a[49]" "a[50]" "a[51]" "a[52]" "a[53]" "a[54]"
[55] "a[55]" "a[56]" "a[57]" "a[58]" "a[59]" "a[60]"
[61] "a[61]" "a[62]" "a[63]" "a[64]" "a[65]" "a[66]"
[67] "a[67]" "a[68]" "a[69]" "a[70]" "a[71]" "a[72]"
[73] "a[73]" "a[74]" "a[75]" "a[76]" "a[77]" "a[78]"
[79] "a[79]" "a[80]" "a[81]" "a[82]" "a[83]" "a[84]"
[85] "a[85]" "b" "sigma.y" "mu.a" "sigma.a" "a1"
[91] "deviance"
According to the code the estimate of mean log radon levels for houses without a basement at the county level is given by a[j] + b for j = 1 to 85. Thus to obtain the posterior distributions of these we quantities in WinBUGS we could have simply added these quantities together at each iteration. Of course we still have the results of each of those iterations so we can carry out the addition in R. Thus we need to add b, column 86, to the first 85 columns on an entry by entry basis to obtain the individual county-wide posterior distributions.
> county.post<-myparms[,1:85]+myparms[,86]
Next I use the apply function to get the quantiles needed for the smear graph.
> apply(county.post,2,function(x) c(median(x), quantile(x,.025), quantile(x,.975),quantile(x,.25), quantile(x,.75)))->quants
I transpose the results so the statistics occupy rows rather than columns, exponentiate the logged values to obtain measurements on the original radon scale, append a column of identifiers, and add some column names.
> tquants<-t(quants)
> tquants2<-cbind(1:85,exp(tquants))
> colnames(tquants2)<-c('county','mean','lo95','hi95','lo50','hi50')
> results<-tquants2
Finally I borrow the code from Assignment 8 to produce the smear graph.
> par(mar=c(5.1,7.1,2.1,1.1))
> par(lend=2)
> plot(range(results[,2:6]), c(1,9), type='n', axes=F, ylab='County', xlab='Mean Radon Level', ylim=c(.5,85.5),
xlim=c(0,max(results[,2:6])))
> axis(1,cex.axis=.8)
> axis(2,at=seq(5,85,5), labels=seq(5,85,5), cex.axis=.8, las=2)
#95% confidence interval
> apply(results,1, function(x) segments(x[3], x[1], x[4], x[1], lwd=1, col='grey65'))
#50% confidence interval
> apply(results,1, function(x) segments(x[5], x[1], x[6], x[1], lwd=2, col='grey50'))
> apply(results,1,function(x) points(x[2],x[1],pch=16, col='white', cex=.8))
> apply(results,1,function(x) points(x[2],x[1],pch=1, cex=.9))
> box()
#reset line ends back to the default value.
> par(lend=0)
![]() |
| Fig. 12 Probability smear graph for county-level mean radon levels for homes without basements for counties in Minnesota. Estimates are based on a random effects model treating intercepts random by county. Displayed are 50% and 95% credibility intervals along with the mean of the posterior distributions. |
> manly<- read.table('http://www.unc.edu/courses/2007spring/enst/562/001/assignments/midterm/bycatch.csv', header=T,sep=',')
To fit a Poisson regression mixed model we need to use the lmer function from lme4 package.
> library(lme4)
> lmer(Data~Area+Gear.Type+Time+(1|Season), data=manly, family=poisson, offset=log(Tows), method='Laplace')->out.reff
Next I fit the model from the midterm and also a Poisson regression model without the season effect.
> out.manly<-glm(Data~Season+Area+Gear.Type+Time, data=manly, offset=log(Tows), family=poisson)
> out.noseason<-glm(Data~Area+Gear.Type+Time, data=manly, offset=log(Tows), family=poisson)
The AIC function doesn't work on lmer objects. We can either examine the AIC in the summary output or calculate it ourselves. The random effects model estimates five parameters: 4 regression coefficients plus the variance of the random effects.
> (-2*logLik(out.reff)+2*5)[1]
[1] 78.92763
> sapply(list(out.manly,out.noseason),AIC)
[1] 97.46053 192.57006
The random effects model actually beats the fixed effects model. This is an unusual result and I've never seen it happen before. It's almost always the case that estimating fixed effects for each "group" yields a better fit than assuming the group effects are samples from a normal distribution. That being the case the argument for using the random effects model is that it is more parsimonious and does not overfit the data. Apparently the bad estimate for the 1990-1991 season in the Manly model has really damaged the fit of the Poisson regression model. It's interesting to compare the estimates returned by the two models. I extract the coefficient estimates for season from manly's model and the corresponding random effects from the random effects model.
> coef(out.manly)
(Intercept) Season1990-91 Season1991-92 Season1992-93
-7.32782132 -19.66524028 1.81894771 0.07390532
Season1993-94 Season1994-95 AreaSouth Gear.TypeMid-Water
-0.93191404 -0.13503221 1.82245026 1.91768692
TimeNight
2.17656359
> ranef(out.reff)
An object of class “ranef.lmer”
[[1]]
(Intercept)
1989-90 0.4308524
1990-91 -2.5583053
1991-92 2.2216679
1992-93 0.4873996
1993-94 -0.4696015
1994-95 0.2983646
We can make the two sets of estimates comparable by subtracting off the 1989-1990 random effect from the rest.
> ranef(out.reff)[[1]]-ranef(out.reff)[[1]][1,1]
(Intercept)
1989-90 0.00000000
1990-91 -2.98915771
1991-92 1.79081548
1992-93 0.05654718
1993-94 -0.90045390
1994-95 -0.13248775
> coef(out.manly)[2:6]
Season1990-91 Season1991-92 Season1992-93 Season1993-94 Season1994-95
-19.66524028 1.81894771 0.07390532 -0.93191404 -0.13503221
Now we see that except for the bad estimate in 1990-1991 from Manly's model, the estimates are almost exactly the same.
I set up things for WinBUGS.
> area<-as.numeric(manly$Area)-1
> gear<-as.numeric(manly$Gear.Type)-1
> time<-as.numeric(manly$Time)-1
> ltows<-log(manly$Tows)
> season<-as.numeric(manly$Season)
> y<-manly$Data
> n<-length(y)
> J<-length(unique(season))
> manly.data<-list("n","J","y","season","area","gear","time", "ltows")
> manly.inits<-function() {list(a=rnorm(J), b1=rnorm(1), b2=rnorm(1),b3=rnorm(1),
mu.a=rnorm(1), sigma.a=runif(1))}
> manly.parameters<-c("a", "b1", "b2", "b3", "mu.a", "sigma.a")
> setwd("D:/enst 562/final")
> library(arm)
The WinBUGS model file is shown next. Notice that is a trivial modification of the random intercepts model in Problem 3. Following WinBUGS convention the log link appears in the definition of the derived quantity, log(yhat), and not as one of the arguments of dpois. Observe that the offset appears in the regression equation with a coefficient of 1.
model{
for(i in 1:n) {
y[i]~dpois(y.hat[i])
log(y.hat[i])<-a[season[i]]+ b1*area[i]+ b2*gear[i]+ b3*time[i]+ltows[i]
}
b1~dnorm(0,.0001)
b2~dnorm(0,.0001)
b3~dnorm(0,.0001)
for(j in 1:J){
a[j]~dnorm(a.hat[j],tau.a)
a.hat[j]<-mu.a
}
mu.a~dnorm(0,.0001)
tau.a<-pow(sigma.a,-2)
sigma.a~dunif(0,100)
}
I fit the model using 1500 iterations to improve the mixing. > manly.3c<-bugs(manly.data, manly.inits, manly.parameters, "manlymodel2.txt", n.chains=3, bugs.directory = "D:/Program Files/WinBUGS14/", n.iter=1500, debug=T)
The columns labeled 2.5% and 97.5% provide 95% credibility intervals for each model parameter.
> manly.3c
Inference for Bugs model at "manlymodel2.txt", fit using winbugs,
3 chains, each with 1500 iterations (first 750 discarded), n.thin = 2
n.sims = 1125 iterations saved
mean sd 2.5% 25% 50% 75% 97.5% Rhat n.eff
a[1] -7.6 0.6 -8.8 -8.1 -7.7 -7.2 -6.4 1 670
a[2] -12.5 3.1 -20.4 -13.4 -11.8 -10.7 -9.1 1 410
a[3] -5.8 0.6 -6.9 -6.2 -5.8 -5.4 -4.8 1 740
a[4] -7.6 0.6 -8.9 -8.0 -7.6 -7.2 -6.4 1 1100
a[5] -8.6 0.7 -9.9 -9.0 -8.6 -8.1 -7.3 1 640
a[6] -7.8 0.6 -8.9 -8.2 -7.8 -7.4 -6.7 1 1100
b1 1.9 0.4 1.1 1.6 1.9 2.2 2.8 1 240
b2 2.0 0.5 1.2 1.7 2.0 2.3 2.9 1 97
b3 2.3 0.5 1.4 1.9 2.3 2.6 3.3 1 120
mu.a -8.2 1.8 -11.6 -9.1 -8.2 -7.3 -5.3 1 1100
sigma.a 3.2 2.2 1.0 1.8 2.6 3.8 8.8 1 1100
deviance 89.3 4.3 82.7 86.1 88.6 91.7 99.7 1 1100
For each parameter, n.eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor (at convergence, Rhat=1).
pD = 8.9 and DIC = 98.3 (using the rule, pD = Dbar-Dhat)
DIC is an estimate of expected predictive error (lower deviance is better).
The 95% credibility intervals are shown in the output. If we examine the medians of the posterior distributions we see that they come very close to the estimates returned by lmer.
> manly.3c$median
$a
[1] -7.657 -11.820 -5.841 -7.617 -8.611 -7.772
$b1
[1] 1.893
$b2
[1] 1.983
$b3
[1] 2.282
$mu.a
[1] -8.161
$sigma.a
[1] 2.59
$deviance
[1] 88.61
The fixed effects are obtained directly. To match the a[j] statistics from WinBUGS to the results from lmer I need to add the intercept to the random effects returned from the lmer model.
#fixed effects
> fixef(out.reff)
(Intercept) AreaSouth Gear.TypeMid-Water TimeNight
-7.751557 1.811450 1.887163 2.202805
# a[j] parameters
> ranef(out.reff)[[1]]+fixef(out.reff)[1]
(Intercept)
1989-90 -7.320705
1990-91 -10.309862
1991-92 -5.529889
1992-93 -7.264158
1993-94 -8.221159
1994-95 -7.453192
As we can see the results are very close to the WinBUGS output
| Jack Weiss Phone: (919) 962-5930 E-Mail: jack_weiss@unc.edu Address: Curriculum in Ecology, Box 3275, University of North Carolina, Chapel Hill, 27516 Copyright © 2007 Last Revised--May 10, 2007 URL: http://www.unc.edu/courses/2007spring/enst/562/001/docs/solutions/final.htm |