Final—Solution

Problem 1

Read in the data.

> lakes<- read.table('http://www.unc.edu/courses/2007spring/enst/562/001/assignments/final/lakes.txt', header=T,sep='\t')
> names(lakes)
[1] "Lake" "Latitude" "Longitude" "X1976" "X1977" "X1978"
[7] "X1981"

Question 1

The data are currently in "lake-level" format in which each lake has a single record. We need to put this into "lake-period" format with multiple records for each lake, in this case one record for each measurement occasion. This requires unlisting the records X1976, X1977, X1978, and X1981 stacking them into a single column, replicating the Lake, Latitude, and Longitude variables four times, and adding a new column that lists the year the measurement was taken.

> newlakes<-data.frame(rep(lakes$Lake,4), rep(lakes$Latitude,4), rep(lakes$Longitude,4), unlist(lakes[,4:7]), rep(c(1976:1978,1981), rep(dim(lakes)[1],4)))
> colnames(newlakes)<-c('lake', 'latitude', 'longitude', 'SO4', 'year')
> newlakes[1:10,]
        lake latitude longitude SO4 year
X19761     1     58.0       7.2 6.5 1976
X19762     2     58.1       6.3 5.5 1976
X19763     4     58.5       7.9 4.8 1976
X19764     5     58.6       8.9 7.4 1976
X19765     6     58.7       7.6 3.7 1976
X19766     7     59.1       6.5 1.8 1976
X19767     8     58.9       7.3 2.7 1976
X19768     9     59.1       8.5 3.8 1976
X19769    10     58.9       9.3 8.4 1976
X197610   11     59.4       6.4 1.6 1976

Choosing a probability model for the response

As was discussed in lecture 3 there are a number of guidelines for helping us choose a probability model for the response variable SO4. I repeat some of these here.

  1. Is the quantity being measured discrete or continuous?
  2. Are the possible values of the response constrained to some interval or are they unbounded?
  3. If we group the values of the response variable with respect to the categories of a putative predictor, do the mean and variance of the response show a particular relationship?
  4. Is there some obvious probabilistic mechanism that might have generated the data we obtained?

Clearly SO4 is a continuous variable and, being a concentration, is bounded below by zero (but in principle unbounded above). Continuous distributions were covered in lecture 6 where four distributions were discussed: the normal, lognormal, gamma, and beta distributions. Of these we can immediately reject the beta distribution because the data aren't bounded between 0 and 1. Of the remaining three both the lognormal and gamma are bounded below by zero, while the normal is unbounded. Of course if the data are displaced far enough from zero, then the fact that the normal distribution is unbounded below need not be a problem.

Because the goal is to model SO4 concentration against year the probability model we're seeking must hold at each year separately rather than in the aggregate. Because both the lognormal and gamma distributions tend to be skewed while the normal is symmetric, a histogram of SO4 concentration at each year is a useful display.

> par(mar=c(5.1,3.1,1.1,1.1))
> par(mfrow=c(2,2))
> sapply(c(1976:1978,1981), function(x) hist(newlakes$SO4[newlakes$year==x], xlab=expression("SO"[4]), main=x, col='lightblue'))


Fig. 1  Histograms of SO4 concentration in each year

From Fig. 1 it's pretty clear that the normal distribution will be inappropriate. The actual distributions are skewed and with many values close to zero. Thus a model based on a normal distribution is likely to predict negative concentrations. Both the lognormal and gamma distributions have the same mean-variance relation (quadratic), so the best way to evaluate them is to actually use them in a model and compare the results with AIC. Even though we know a normal model is unlikely to be appropriate I also include it among the candidate models.

I fit each model with year as a linear predictor. The normal and lognormal models can be fit with lm, the lognormal by first log-transforming the response SO4. The gamma distribution is fit with glm and the family=Gamma argument.

> out.norm<-lm(SO4~year,data=newlakes)
> out.gamma<-glm(SO4~year,data=newlakes,family=Gamma)
> out.lnorm<-lm(log(SO4)~year,data=newlakes)

The loglikelihood of the log-transformed model is not directly comparable to models with an untransformed response without additional work. We wrote a function in lecture 22 that carried out the necessary steps and I reproduce it below.

> norm.loglike<-function(data,var,model)
{
t.y<-log(data[,var])
sigma2<-(sum(residuals(model)^2))/dim(data)[1]
loglike<-sum(log(dnorm(t.y, mean=predict(model), sd=sqrt(sigma2))*1/(data[,var])))
out<-list(loglike, c(coef(model), sigma2))
out
}

To get this function to work we need to give it a version of the data set in which the missing values of the response are removed. (These observations were automatically removed by the lm function when the model was fit.) We can do this by subsetting the data set using the !is.na( ) construction. The first argument to the function is the data set, the second is the name of the response, and the third is the model.

> norm.loglike(newlakes[!is.na(newlakes$SO4),], "SO4", out.lnorm)->out.loglike
> out.loglike
[[1]]
[1] -342.0175

[[2]]
(Intercept)        Year
41.66739080 -0.02050562 0.37574603

Each of the models has three estimated parameters (β0, β1, and σ2), so we could just as well have compared loglikelihoods rather than AIC.

> -2*out.loglike[[1]]+2*length(out.loglike[[2]])
[1] 690.0349
> AIC(out.gamma)
[1] 705.6806
> AIC(out.norm)
[1] 786.8488

The lognormal model has the lowest AIC by a sizeable amount. We'll proceed with a lognormal probability model for the response.

Choosing the form of the linear predictor

The variable year has four unique values. Hence there are a limited number of choices for the linear predictor. Without any theory to guide us we can

  1. Fit a linear model (2 parameters)
  2. Fit a quadratic model (3 parameters)
  3. Fit a cubic model (4 parameters) or equivalently treat year as a factor.

Each model was fit assuming a lognormally distributed response. AIC can be used to compare the models without any adjustment because all three models use the same response, log SO4.

> out.lnorm2<-lm(log(SO4)~year+I(year^2),data=newlakes)
> out.lnorm3<-lm(log(SO4)~factor(year),data=newlakes)
> sapply(list(out.lnorm,out.lnorm2,out.lnorm3),AIC)

[1] 318.3179 320.3132 322.3114

Clearly the linear model is to be preferred. It has both the lowest AIC and is simpler.r To see what's going on I graph the data (jittered) and superimpose the three models. For the separate means model I just plot the estimated mean at each year as predicted by the model.

> plot(log(SO4)~jitter(year), data=newlakes, xlim=c(1975.5,1981.5), xlab='year', ylab=expression(log('SO'[4])))
> abline(out.lnorm, col='grey80', lwd=5)
> curve(coef(out.lnorm2)[1]+ coef(out.lnorm2)[2]*x+ coef(out.lnorm2)[3]*x^2,col=2, lty=2, add=TRUE)
> points(c(1976:1978,1981), c(coef(out.lnorm3)[1], coef(out.lnorm3)[1]+ coef(out.lnorm3)[2],
coef(out.lnorm3)[1]+ coef(out.lnorm3)[3], coef(out.lnorm3)[1]+coef(out.lnorm3)[4]), col=3, pch=16, cex=2)
> legend(1979,2.5, c('linear','quadratic','means'), lty=c(1,2,NA), lwd=c(3,1,1), pch=c(NA,NA,16), col=c('grey80',2,3), bty='n', cex=.9)

Fig. 2  Three lognormal models using different versions of the linear predictor for year

Our conclusions remaing the same even if we use a normal or a gamma probability model for the response. A model linear in year is best.

> out.gamma1<-glm(SO4~year+I(year^2),data=newlakes,family=Gamma)
> out.gamma2<-glm(SO4~factor(year),data=newlakes,family=Gamma)
> sapply(list(out.gamma,out.gamma1,out.gamma2),AIC)

[1] 705.6806 707.4164 709.2101
> out.norm1<-lm(SO4~year+I(year^2),data=newlakes)
> out.norm2<-lm(SO4~factor(year),data=newlakes)
> sapply(list(out.norm,out.norm1,out.norm2),AIC)

[1] 786.8488 788.6666 790.4781

The table below summarizes the common pooling models that were fit.

Model
Linear predictor
AIC
lognormal
linear year
690.03
quadratic year
692.03
categorical year
694.03
gamma
linear year
705.68
quadratic year
707.42
categorical year
709.21
normal
linear year
786.85
quadratic year
786.67
categorical year
790.48

Of course it is possible that some weird transformation of year is preferable. To assess that possibility I plot the residuals from the linear model against year and superimpose a lowess curve.

> plot(residuals(out.norm)~ jitter(newlakes$year[!is.na(newlakes$SO4)]), ylab='residuals', xlab='year')
> lines(lowess(residuals(out.norm)~ jitter(newlakes$year[!is.na(newlakes$SO4)])), col=2)

Fig. 3  Plot of residuals from model linear in year against the predictor "year"

The plot provides no evidence that form of the linear predictor is inadequate. It's worth noting at this point that the modeled relationship between log(SO4) and year is not statistically significant.

> summary(out.lnorm)

Call:
lm(formula = log(SO4) ~ year, data = newlakes)

Residuals:
     Min       1Q   Median      3Q     Max
-1.50496 -0.52462 -0.08062 0.52028 1.58027

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept) 41.66739   49.32240   0.845    0.399
year        -0.02051    0.02493  -0.822    0.412

Residual standard error: 0.6167 on 166 degrees of freedom
Multiple R-Squared: 0.004057, Adjusted R-squared: -0.001942
F-statistic: 0.6763 on 1 and 166 DF, p-value: 0.4120

Question 2

The basic structure possessed by these these data is that they consist of repeated measures on individual lakes. Due to local geology and edaphic factors, we should expect measurements coming from the same lake, even separated by a time interval of a year or more, to be more similar to each other than to observations coming from other lakes. This provides the potential for what we've called observational heterogeneity—differing degrees of variability in subsets of observations. Observe that the repeated measures aspect of these data was imposed at the design level. Structure that arises from sampling design must be accounted for in the analysis.

In addition to temporal correlation there is the possibility of spatial structure in these data relating to their geographic proximities. We might expect lakes that are closer to each other to share a similar chemistry. I consider structure of this sort to be inadvertent structure. It arises because lakes are fixed objects that occupy space and hence their varying proximities will be an issue in even the most well-designed random selection scheme. The resulting spatial structure may turn out to be important, but it is not present by design. Hence we'll account for the designed structure first and then deal with the spatial structuring if necessary.

The ideal way to display the temporal structure of these data is in a lattice graph in which we plot the SO4 versus year separately for each lake. Because we've already seen that a log transformation of the SO4 concentration is justified, the most useful lattice graph plots log(SO4) versus year.

> trellis.par.set(col.whitebg())
> xyplot(log(SO4)~year|lake, data=newlakes, panel=function(x,y) {
panel.xyplot(x,y)
panel.abline(lm(y~x),col=2)},
ylab=expression(paste("SO"[4], ' Concentration',sep=' ')),
par.settings=list(axis.text=list(cex=.8)), layout=c(8,6))

Fig. 4  Lattice graph displaying the repeated measures structure of the data set

Question 3

As Fig. 4 indicates there is a dramatic difference in baseline SO4 concentrations across lakes. This is reflected in the widely different values for the intercepts of the regression lines. In addition there is some evidence for heterogeneity in slopes across the lakes. The natural way to handle structured data such as these is with a multilevel model. Fig. 4 coupled with our early work with the complete pooling model suggests that a linear model relating log(SO4) concentration to continuous time should be adequate.

Unconditional means model

The correct place to begin in fitting a multilevel model is with the unconditional means model. The unconditional means model here includes only an intercept but it allows that intercept to vary among the different lakes. It serves to partition the variance between and among lakes and acts as a benchmark that allows us to assess where most of the variability lies--between lakes or between years for the same lake. The unconditional means model is formulated as follows.

where

Written as a composite equation it takes the following form.

It's the latter form that matches the syntax used by R's lme function.

> library(nlme)
> lme(log(SO4)~1,random=~1|lake, data=newlakes, method='ML', na.action=na.omit)->unc.mean

The unconditional means model is of interest only for its variance components. From these we can calculate the intraclass correlation coefficient.

I calculate this for the lakes data using output from VarCorr.

> VarCorr(unc.mean)
lake = pdLogChol(1)
              Variance    StdDev
(Intercept) 0.33272801 0.5768258
Residual    0.03104858 0.1762061

> as.numeric(VarCorr(unc.mean)[1,1])/ (as.numeric(VarCorr(unc.mean)[1,1])+ as.numeric(VarCorr(unc.mean)[2,1]))
[1] 0.9146493

So we see that 91% of the variability in log(SO4) concentrations occurs between lakes rather than within lakes across years. Put another way measurements from the same lake exhibit a correlation of 0.91. This is dramatic evidence that a multilevel model is needed here and argues strongly against using a common pooling model.

Random intercepts model

The next step is to add the predictor year. As we've seen the individual lines in Fig. 4 exhibit a large variability in intercepts. Thus the natural second model to consider is a random intercepts model, a linear population model in which the intercepts of the regression lines for individual lakes are allowed to vary but they still share a common slope.

where

Written in composite form it looks like the following.

This is the form that matches the syntax used by R's lme function.

> lme(log(SO4)~year,random=~1|lake, data=newlakes, method='ML', na.action=na.omit)->random.ints
> sapply(list(out.lnorm,unc.mean,random.ints),AIC)
[1] 318.31792 73.76328 65.73269

Observe the dramatic decrease in AIC relative to the common pooling model, strong evidence that we need to account for the repeated measures structure of the data set. (The AIC are directly comparable here because both models use log(SO4) as the response.) There is a more modest decrease in AIC relative to the unconditional means model indicating that the inclusion of year has improved the fit of the model somewhat. Next we examine the variance components.

> VarCorr(random.ints)
lake = pdLogChol(1)
              Variance    StdDev
(Intercept) 0.33345020 0.5774515
Residual    0.02855946 0.1689954

As measured by the relative reduction in level-1 variance the effect of year on accounting for the within lake variability is more modest.

> (as.numeric(VarCorr(unc.mean)[2,1])- as.numeric(VarCorr(random.ints)[2,1]))/ as.numeric(VarCorr(unc.mean)[2,1])
[1] 0.08016856

So we've explained only 8% of the within lake variability by including year as a predictor. Still the predictor year is statistically significant.

> summary(random.ints)
Linear mixed-effects model fit by maximum likelihood
Data: newlakes
     AIC      BIC    logLik
65.73269 78.22854 -28.86634

Random effects:
Formula: ~1 | lake
        (Intercept)  Residual
StdDev:   0.5774515 0.1689954

Fixed effects: log(SO4) ~ year
               Value Std.Error  DF   t-value p-value
(Intercept) 45.76822 13.897196 119  3.293342  0.0013
      year  -0.02259  0.007026 119 -3.215169  0.0017
Correlation:
(Intr)
year -1

Standardized Within-Group Residuals:
        Min          Q1         Med         Q3        Max
-4.02214991 -0.47329909 -0.05670254 0.55159957 3.33206296

Number of Observations: 168
Number of Groups: 48

Random slopes and intercepts model

The last model we might consider at this point is a random slopes and intercepts model. The lattice graph of Fig. 4 suggests there is some variability in individual slopes. Unfortunately, and this is frequently the case with mixed models in the frequentist paradigm, this model is very difficult to fit. One of the pernicious features of fitting such models is that it's not always clear that there is a problem.

Recall that the random slopes and intercepts model takes the following form.

where .

In composite form the equations reduce to the following.

So the following function call fits this equation in R.

> lme(log(SO4)~year,random=~year|lake, data=newlakes, method='ML', na.action=na.omit)->random.slopes

If we look at the variance components of this model we see the following.

> VarCorr(random.slopes)
lake = pdLogChol(year)
                Variance       StdDev  Corr
(Intercept) 3.334502e-01 5.774515e-01 (Intr)
year        1.192613e-14 1.092068e-07 0
Residual    2.855945e-02 1.689954e-01

From the displayed correlation of 0 between the random slopes and intercepts and the estimated variance component of the slopes of approximately 0 we can see that R has converged to a solution in which the slopes don't vary. If we examine the loglikelihoods of the random slopes and intercepts model with that of the random intercepts model, we see that they are identical!

> sapply(list(random.ints,random.slopes),logLik)
[1] -28.86634 -28.86634

It is extremely unlikely that introducing to additional parameters to the model has had no effect on the loglikelihood. Given that we've already seen from our lattice plot that the slopes do vary between lakes out best conclusion is that lme has converged to a local solution (the solution of the random intercepts model) rather than a global solution.

The usual tack when this happens is to try reparameterizing the predictor, perhaps by centering, and if that doesn't work to modify some of the control settings of the maximization algorithm. I tried all of this with very little success. For instance here's an attempt to shift the origin of year to zero.

> lme(log(SO4)~I(year-1976), random=~I(year-1976)|lake, data=newlakes, method='ML', na.action=na.omit)->random.slopes
Error in lme.formula(log(SO4) ~ I(year - 1976), random = ~I(year - 1976) | :
iteration limit reached without convergence (9)

With only 48 lakes and a maximum of 4 measurements per lake, obtaining point estimates of the variance parameters in a problem such as this is difficult. We may have outstripped the ability of our software to fit such a model. Bayesian models tend to have much greater success with data such as these, so a logical step at this point would be to try to fit the model using WinBUGS. But, that was not part of this assignment so I'll leave that as a summer homework exercise! Without evidence to the contrary we'll stick with the random intercepts model as our best model.

The table below summarizes our results. Because all the models being compared have log(SO4) as the response we can use AIC applied directly to the model object to compare them. The AIC values reported here cannot be used to compare these models against models with other probability models for the response. Note: the norm.loglike function we used above does not return the correct AIC when random effects are involved.

Lognormal Model
Predictor
# parameters
loglikelihood (for log y)

AIC
(for log y)

common pooling
year
3
–159.159
318.318
unconditional means
3
–33.882
73.763
random intercepts
year
4
–28.866
65.732
random slopes and intercepts
year
6
failed to converge

Note: One model I don't include in the above list is the separate intercepts model, a model that estimates a separate intercept for each lake. This model would be fit as follows.

# the following model was not considered
> lm(log(SO4)~ year + factor(lake), data=newlakes)

In this model we would end up estimating 45 additional parameters, one intercept for each lake. Not surprisinglyl this model does turn to be far and away the best model in terms of loglikelihood and AIC, but given that we have at most four observations per lake using such a model is clearly an example of overfitting.

Question 4

A multilevel model returns a population level model from which individual lakes are allowed to deviate. The manner in which the lakes are allowed to deviate depends on how the random effects portion was specified. The population model we obtained is the following.

> fixef(random.ints)
(Intercept) year
45.76822277 -0.02258819

The population model states that for every one year increment the average log(SO4) concentrations are predicted to fall by 0.023. To put this into more understandable terms we can express this in terms of concentration units. If we exponentiate the model equation we obtain an expression for the geometric mean sulfate concentration.

So in each subsequent year the model predicts the geometric mean sulfate concentration will be 98% of what is was the year before. Because of the way the model was formulated the intercept is currently not interpretable. It actually represents the geometric mean sulfate concentration in year 0 AD!! To fix this we can refit the model centering year at the first measured year 1976.

> lme(log(SO4)~I(year-1976),random=~1|lake, data=newlakes, method='ML', na.action=na.omit)->random.ints2)
> fixef(random.ints2)
(Intercept) I(year - 1976)
1.13395508    -0.02258819

So now we have the following.

So the average lake in 1976 had a geometric mean sulfate concentration of 3.1. Under the normality assumption of the random effects and given the point estimate for the variability of the random effects (τ = 0.57 in the output above), we predict the geometric mean sulfate concentration in 1976 varies between 1 and 9.6 and will decrease by a factor of 0.98 with each subsequent year over the range of the data.

> exp(1.13395508-1.96*0.5774515)
[1] 1.002152
> exp(1.13395508+1.96*0.5774515)
[1] 9.638447

Questions 5–6

Because we have the geographic locations of the lakes we can use the semivariogram to assess spatial correlation. For question 5 we use the residuals from the complete pooling model and for question 6 the residuals from the random intercepts model. We need to remove the missing values from the original data set so that the residual vector and the latitude and longitude vectors are the same length. I also do the same with a vector of years.

> lakes.reduced<-newlakes[!is.na(newlakes$SO4),]
#assemble data in the form preferred by the geoR package
> newdat1<-data.frame(lakes.reduced$latitude, lakes.reduced$longitude, residuals(out.lnorm))
> newdat2<-data.frame(lakes.reduced$latitude, lakes.reduced$longitude, residuals(random.ints))
#remove years with missing concentrations
> years.reduced<-newlakes$year[!is.na(newlakes$SO4)]

Technically we should convert latitude and longitude measured in degrees to distance units. Since we're only operating on a limited geographic scale, this should not have a big effect so I stick with degrees. I do some initial experimentation to determine how wide to make the bins and what the maximum distance between lakes is.

> library(geoR)
> curdata<-newdat1[years.reduced==1976,]
> geodat1<-as.geodata(curdata)
> variog(geodat1, uvec=seq(0,8,.25), max.dist=8, option='bin')->geodata.v1
> rbind(geodata.v1$n, geodata.v1$u)

       [,1] [,2]  [,3] [,4]  [,5] [,6]  [,7] [,8]   [,9] [,10] [,11] [,12] [,13]
[1,]  20.00 31.0 45.00   56 46.00 60.0 67.00   54 49.00   54.0 53.00    66 50.00
[2,]   0.25  0.5  0.75    1  1.25  1.5  1.75    2  2.25    2.5  2.75     3  3.25
      [,14] [,15] [,16]  [,17] [,18] [,19] [,20] [,21] [,22] [,23] [,24] [,25] [,26]
[1,]   39.0 49.00    45 33.00   39.0 34.00    27 27.00  16.0 13.00    23  6.00  11.0
[2,]    3.5  3.75     4  4.25    4.5  4.75     5  5.25   5.5  5.75     6  6.25   6.5
     [,27] [,28] [,29] [,30]
[1,]  9.00     2  8.00   2.0
[2,]  6.75     7  7.25   7.5

From the output we see that by the time we reach a distance of 5 the number of observations per bin has dropped below 30, a number that's considered a minimum value for obtaining a reasonable estimate of the semivariogram. Thus I should not plot or calculate the semivariogram for distances beyond 5.

I fit and plot a semivariogram separately for each year (because duplicate coordinates are not allowed and it probably makes sense to treat years separately). To facilitate things I write a generic function that calculates and plots the semivariogram separately by year that then I sapply to a list of year values.

> par(mar=c(5.1,4.1,2.1,1.1))
> par(mfrow=c(2,2))
> sapply(c(1976:1978,1981),function(x) {
curdata<-newdat1[years.reduced==x,]
geodat1<-as.geodata(curdata)
variog(geodat1,uvec=seq(0,5,.25), max.dist=5, option='bin')->geodata.v1
plot(geodata.v1)
mtext(side=3,line=.5, paste(x,': Complete Pooling Model Residuals', sep=''), cex=.8)
})

Fig. 5 Semivariograms by year for complete pooling model residuals

Nest I repeat these steps for the random intercepts model.

> sapply(c(1976:1978,1981),function(x) {
curdata<-newdat2[years.reduced==x,]
geodat1<-as.geodata(curdata)
variog(geodat1,uvec=seq(0,5,.25), max.dist=5, option='bin')->geodata.v1
plot(geodata.v1)
mtext(side=3,line=.5, paste(x,': Random Intercepts Model Residuals', sep=''), cex=.8)
})

Fig. 6 Semivariograms by year for random intercepts model residuals

It's pretty clear that while there was serious spatial correlation among the residuals of the complete pooling model (as evidence by the clear increasing trend in the semivariogram plot), this has completely disappeared in the residuals of the random intercepts model. In Fig. 6 we see only random scatter and no clear trend. Thus by accounting for the temporal correlation in the data we have also managed to account for the spatial correlation.

Problem 2

I suggested five different approaches to this problem in the hints. I work through in detail the code for the first approach, the paired t-test, but then just present the code with minimal comments for the remaining four methods.

Solution using a paired t-test

Question 1

Read in the data and pull out the unique species names.

> cover<- read.table('http://www.unc.edu/courses/2007spring/enst/562/001/assignments/final/coverclass.txt', header=T)
> unique(cover$CHISSpCode)->species

The function below subsets the data using the name of the current species. It then carries out the paired t-test and assigns the results to an object. The p-value of the test, an estimate of the mean of the paired differences, and a confidence interval for the mean difference are all extracted from this object. I also record the sample size for each species for use in annotating the plot that we'll produce.

> rep.func.t<-function(x) {
cover[cover$CHISSpCode==x,]->name
t.test(name[,4], name[,3], paired=TRUE, conf.int=TRUE, conf.level=.95)->out
n<-length(name[,3])
c(out$estimate,out$p.value,out$conf.int,n)
}

I sapply the function to the list of species names to obtain the results.

> sapply(species,rep.func.t)->out
> temp<-t(out)
> rownames(temp)<-species
> colnames(temp)<-c('estimate','p-value','lower95','upper95','n')
> temp

        estimate      p-value     lower95    upper95  n
ACMI  0.25000000 0.1910542981 -0.14493408  0.6449341 12
ATSE -0.76923077 0.0024154024 -1.20735056 -0.3311110 13
AVXX -0.07692308 0.7762430284 -0.65346398  0.4996178 13
BAPI  0.47826087 0.0082593612  0.13652817  0.8199936 23
BRDI  0.19230769 0.4668980374 -0.34377195  0.7283873 26
CAMA  0.15000000 0.5453609184 -0.35986558  0.6598656 20
CIOC  0.29166667 0.0160893434  0.05940813  0.5239252 24
COGI  0.61538462 0.0051616813  0.22232546  1.0084438 13
DISP  0.77777778 0.0147043609  0.17338851  1.3821670 18
GNLU  0.13333333 0.4985453188 -0.27825004  0.5449167 15
HOBR -0.05882353 0.7498058561 -0.44317570  0.3255286 17
ISME  0.17391304 0.4619848747 -0.30783817  0.6556643 23
MEIN -0.77777778 0.0007769294 -1.17981154 -0.3757440 18
MEPO  0.08333333 0.7544993470 -0.48871327  0.6553799 12
SOOL -0.41666667 0.0091516301 -0.71954808 -0.1137853 24
ASMI  0.05263158 0.7898521784 -0.35614654  0.4614097 19
ATCA  0.04761905 0.7711217886 -0.28923077  0.3844689 21
CAAE  0.38095238 0.1188353147 -0.10668380  0.8685886 21
DAPU -0.06666667 0.8177843203 -0.67570648  0.5423732 15
ERGL  0.33333333 0.2376664725 -0.24621912  0.9128858 15
LODE  0.38461538 0.1368122627 -0.14095024  0.9101810 13
LUAL  1.00000000 0.0016069697  0.47961888  1.5203811 11
MAIN -0.63157895 0.0007944877 -0.96124944 -0.3019085 19
ERGR  0.18181818 0.4405216964 -0.32254736  0.6861837 11
SIBE  0.58333333 0.0674332787 -0.04962508  1.2162917 12

Questions 2–3

I count the number of tests, extract the p-values, and identify the results by species.

> colnames(out)<-species
> out[2,]->pvalues
> numtests <-length(pvalues)

I set up the four testing criteria: unadjusted, Bonferroni, sequential Bonferroni (Holm method), and FDR.

> alpha<-.05
> alphai<-alpha/numtests
> bonf<-rep(alpha/numtests,numtests)
> holm<-alpha/seq(numtests,1,-1)
> fdr<-(1:numtests)*alpha/numtests
> sort(pvalues)->pvals
> cbind(pvals,bonf,holm,fdr)

               pvals  bonf        holm   fdr
   MEIN 0.0007769294 0.002 0.002000000 0.002
   MAIN 0.0007944877 0.002 0.002083333 0.004
   LUAL 0.0016069697 0.002 0.002173913 0.006
   ATSE 0.0024154024 0.002 0.002272727 0.008
   COGI 0.0051616813 0.002 0.002380952 0.010
   BAPI 0.0082593612 0.002 0.002500000 0.012
   SOOL 0.0091516301 0.002 0.002631579 0.014
   DISP 0.0147043609 0.002 0.002777778 0.016
   CIOC 0.0160893434 0.002 0.002941176 0.018
   SIBE 0.0674332787 0.002 0.003125000 0.020
   CAAE 0.1188353147 0.002 0.003333333 0.022
   LODE 0.1368122627 0.002 0.003571429 0.024
   ACMI 0.1910542981 0.002 0.003846154 0.026
   ERGL 0.2376664725 0.002 0.004166667 0.028
   ERGR 0.4405216964 0.002 0.004545455 0.030
   ISME 0.4619848747 0.002 0.005000000 0.032
   BRDI 0.4668980374 0.002 0.005555556 0.034
   GNLU 0.4985453188 0.002 0.006250000 0.036
   CAMA 0.5453609184 0.002 0.007142857 0.038
   HOBR 0.7498058561 0.002 0.008333333 0.040
   MEPO 0.7544993470 0.002 0.010000000 0.042
   ATCA 0.7711217886 0.002 0.012500000 0.044
   AVXX 0.7762430284 0.002 0.016666667 0.046
   ASMI 0.7898521784 0.002 0.025000000 0.048
   DAPU 0.8177843203 0.002 0.050000000 0.050
#carry out tests
> unadjusted<-pvals<alpha
> b.test<- pvals<bonf
> holm.t<- pvals<holm
> fdr.t<- pvals<fdr
> cbind(pvals,unadjusted,b.test,holm.t,fdr.t)->results

            pvals unadjusted b.test holm.t fdr.t
MEIN 0.0007769294          1      1      1     1
MAIN 0.0007944877          1      1      1     1
LUAL 0.0016069697          1      1      1     1
ATSE 0.0024154024          1      0      0     1
COGI 0.0051616813          1      0      0     1
BAPI 0.0082593612          1      0      0     1
SOOL 0.0091516301          1      0      0     1
DISP 0.0147043609          1      0      0     1
CIOC 0.0160893434          1      0      0     1
SIBE 0.0674332787          0      0      0     0
CAAE 0.1188353147          0      0      0     0
LODE 0.1368122627          0      0      0     0
ACMI 0.1910542981          0      0      0     0
ERGL 0.2376664725          0      0      0     0
ERGR 0.4405216964          0      0      0     0
ISME 0.4619848747          0      0      0     0
BRDI 0.4668980374          0      0      0     0
GNLU 0.4985453188          0      0      0     0
CAMA 0.5453609184          0      0      0     0
HOBR 0.7498058561          0      0      0     0
MEPO 0.7544993470          0      0      0     0
ATCA 0.7711217886          0      0      0     0
AVXX 0.7762430284          0      0      0     0
ASMI 0.7898521784          0      0      0     0
DAPU 0.8177843203          0      0      0     0

We could read off the results at this point or we can automate the procedure. For the sequential Bonferroni method we need to identify the first zero (nonsignificant result) and then count all tests up to that point as significant. For the FDR approach we find the last significant result and count all tests up until that point as significant. We can automate this as follows. The vector sig.holm below records the species that yield statistically significant results according to the sequential Bonferroni and the vector sig.fdr does the same for the FDR.

#sequential Bonferroni
> sig.holm<-vector(mode='numeric',length=n)
> if(!any(holm.t)) sig.holm[1:n]<-0 else
if(!any(holm.t==0)) sig.holm[1:n]<-1 else {
min((1:25)[holm.t==0])->min.holm
sig.holm[1:(min.holm-1)]<-1
sig.holm[min.holm:25]<-0 }

#false discovery rate
> sig.fdr<-vector(mode='numeric',length=25)
> if(!any(fdr.t)) sig.fdr[1:n]<-0 else
if(!any(fdr.t==1)) sig.fdr[1:n]<-1 else {
max((1:25)[fdr.t==1])->max.fdr
sig.fdr[(max.fdr+1):25]<-0
sig.fdr[1:max.fdr]<-1 }

Finally I get a list of species that yield significant results according to each of the criteria.

#unadjusted
> names(pvals)[unadjusted]
[1] "MEIN" "MAIN" "LUAL" "ATSE" "COGI" "BAPI" "SOOL" "DISP" "CIOC"
#Bonferroni
> names(pvals)[b.test]
[1] "MEIN" "MAIN" "LUAL"
#sequential Bonferroni
> names(pvals)[sig.holm==1]
[1] "MEIN" "MAIN" "LUAL"
#FDR results
> names(pvals)[sig.fdr==1]
[1] "MEIN" "MAIN" "LUAL" "ATSE" "COGI" "BAPI" "SOOL" "DISP" "CIOC"

Question 4

We can described the differences between these three approaches as follows.

  1. The unadjusted method controls the individual error rate, α, at a fixed nominal level and it uses the same fixed α-level for all the tests. Thus on a test by test basis we are guaranteed a probability of making a mistake of α separately on each test. Although the individual α-rate is the same the experimentwise error rate will generally be larger than α. Thus there is no attempt to control the percentage of Type I errors made among the entire set of tests that is carried out. Thus in this example we can delcare nine results to be statistically significant but in doing so we ignore the fact that the tests were done in concert and hence it is likely that at least some of these significant results are due to chance and are not real.
  2. The Bonferroni-adjusted method attempts to control the experimentwise error rate at a fixed nominal level. This is done by reducing the individual error rate to a value below the nominal rate. The result is that the probability of making a single Type I error for the entire set of tests is held fixed at α. Thus in order to avoid the possibility of making a Type I error in carrying out 25 tests we are only able to declare with certainty that 3 of our tests are statisically significant (even though 6 others yielded p-values less than .05).
  3. The false discovery rate protocol jettisons the Type I error rate perspective and instead tries to keep the number of false discoveries (false positives) to a low level. Thus if the FDR is set to α we are guaranteed on average that no more than a fraction α of the positive results we obtain are likely to be bogus. Thus with nine significant results we anticipate that on average 9 × 0.05 = 0.45 of them are in error.

Question 5

I produce an error bar graph to display the results. The error bars represent 95% confidence intervals for the mean difference in cover over time. I enhance the requested graph slightly by listing the sample sizes on the right margin and color coding the results of the significance tests. Red confidence intervals correspond to results deemed significant using the false discovery rate protocol while blue sample sizes followed by an asterisk denote results deemed significant using the sequential Bonferroni protocol.

#collect results of significance testing and place in alphabetical order
> newdat<-data.frame(rownames(results),sig.fdr,sig.holm)
> newdat<-newdat[order(newdat[,1]),]
#collect confidence interval results and add to significance test results
> out1<-cbind(t(out),1:25)
> out2<-out1[order(rownames(out1)),]
> out3<-cbind(out2,newdat[,2],newdat[,3])
> out3[,6]<-1:25
#generate graph
> par(mar=c(5.1,5.1,2.1,5.1))
> plot(range(out3[,3:4]),c(.75,25.25), xlab='Mean Change in Cover', type='n', axes=F, ylab='')
> axis(1)
> axis(2,at=out3[,6],labels=rownames(out3),cex.axis=.75,las=2)
> arrows(out3[,3],out3[,6],out3[,4],out3[,6],code=3, angle=90, length=.05,col=(out3[,7]+1),lwd=2)
> points(out3[,1],out3[,6],pch=16,cex=1.2,col='white')
> points(out3[,1],out3[,6],pch=1,cex=1.1,col=(out3[,7]+1))
> mytext<-ifelse(out3[,8]==1,paste('n = ',out3[,5],'*'),paste('n = ',out3[,5]))
> mycol<-ifelse(out3[,8]==1,4,1)
> mtext(side=4,line=.5,at=1:25,text=mytext,col=mycol,cex=.75,las=2)
> abline(v=0,lty=2,col='seagreen')
> box()
> mtext(side=3,line=.5,'Results based on paired t-test')

Pair t-test Multiple Comparison Results
species showing significant differences in cover values

Unadjusted
Bonferroni
Sequential Bonferroni
False Discovery Rate (FDR)
"MEIN" "MAIN" "LUAL" "ATSE" "COGI" "BAPI" "SOOL" "DISP" "CIOC"
"MEIN" "MAIN" "LUAL"
"MEIN" "MAIN" "LUAL"
"MEIN" "MAIN" "LUAL" "ATSE" "COGI" "BAPI" "SOOL" "DISP" "CIOC"
Fig. 7  Confidence intervals for species-specific paired t-tests. Confidence intervals in red are significant with FDR = 0.05 (9 species). Results with a blue * (shown in sample sizes on right side) are significant using a sequential Bonferroni adjustment with an experimentwise error of α = .05 (3 species)

Solution using a Wilcoxon signed rank test

The wilcox.exact test is found in the exactRankTests package. Its syntax is the same as that of the t.test function. As noted in the hints we need to correct the reported confidence interval for COGI. The rest of the code is the same as for the paired t-test. I summarize the significance test results at the end.

> cover<-read.table('http://www.unc.edu/courses/2007spring/enst/562/001/assignments/final/coverclass.txt', header=T)
> unique(cover$CHISSpCode)->species
> library(exactRankTests)
> rep.func<-function(x) {
cover[cover$CHISSpCode==x,]->name
wilcox.exact(name[,4], name[,3], paired=TRUE, conf.int=TRUE, conf.level=.95)->out
n<-length(name[,3])
c(out$estimate,out$p.value,out$conf.int,n)
}
> sapply(species,rep.func)->out

Warning message:
no finite arguments to max; returning -Inf


> colnames(out)<-species
> out[2,]->pvalues
> numtests<-length(pvalues)
> sort(pvalues)->pvals
> unadjusted<- pvals<.05
> alphai<-0.05/numtests
> bonf<-rep(.05/numtests,numtests)
> holm<-.05/seq(numtests,1,-1)
> fdr<-(1:numtests)*.05/numtests
> #carry out tests
> b.test<-(pvals<bonf)
> holm.t<-(pvals<holm)
> fdr.t<-(pvals<fdr)
> cbind(pvals,b.test,holm.t,fdr.t)->results
> results
           pvals b.test holm.t fdr.t
MEIN 0.002685547      0      0     0
MAIN 0.003173828      0      0     1
LUAL 0.007812500      0      0     0
ATSE 0.010742188      0      0     0
COGI 0.015625000      0      0     0
BAPI 0.019210815      0      0     0
SOOL 0.019531250      0      0     0
DISP 0.025634766      0      0     0
CIOC 0.039062500      0      0     0
SIBE 0.113281250      0      0     0
CAAE 0.164550781      0      0     0
LODE 0.234375000      0      0     0
ERGL 0.269531250      0      0     0
ACMI 0.375000000      0      0     0
BRDI 0.455734253      0      0     0
ISME 0.478607178      0      0     0
CAMA 0.621337891      0      0     0
ERGR 0.687500000      0      0     0
GNLU 0.726562500      0      0     0
AVXX 1.000000000      0      0     0
HOBR 1.000000000      0      0     0
MEPO 1.000000000      0      0     0
ASMI 1.000000000      0      0     0
ATCA 1.000000000      0      0     0
DAPU 1.000000000      0      0     0


> sig.holm<-vector(mode='numeric',length=n)
> if(!any(holm.t)) sig.holm[1:n]<-0 else
if(!any(holm.t==0)) sig.holm[1:n]<-1 else {
min((1:25)[holm.t==0])->min.holm
sig.holm[1:(min.holm-1)]<-1
sig.holm[min.holm:25]<-0 }


> sig.fdr<-vector(mode='numeric',length=25)
> if(!any(fdr.t)) sig.fdr[1:n]<-0 else
if(!any(fdr.t==1)) sig.fdr[1:n]<-1 else {
max((1:25)[fdr.t==1])->max.fdr
sig.fdr[(max.fdr+1):25]<-0
sig.fdr[1:max.fdr]<-1 }

#unadjusted

> names(pvals)[unadjusted]
[1] "MEIN" "MAIN" "LUAL" "ATSE" "COGI" "BAPI" "SOOL" "DISP" "CIOC"
#Bonferroni
> names(pvals)[b.test]
character(0)
#sequential Bonferroni
> names(pvals)[sig.holm==1]
character(0)
#FDR results
> names(pvals)[sig.fdr==1]
[1] "MEIN" "MAIN"
> newdat<-data.frame(rownames(results),sig.fdr,sig.holm)
> newdat<-newdat[order(newdat[,1]),]

> out1<-cbind(t(out),1:25)

#fix COGI
> cover[cover$CHISSpCode=="COGI",]->cogi
> cogi.diff<-cogi[,4]-cogi[,3]


#Walsh averages

> Wmat<-outer(cogi.diff,cogi.diff,'+')/2
> Walsh<-c(Wmat[lower.tri(Wmat)],diag(Wmat))

> n.cogi<-dim(cogi)[1]
> q1<-n.cogi*(n.cogi+1)/2+1-qsignrank(.975,n.cogi)
> q2<-qsignrank(.975,n.cogi)

> Walsh.sort<-sort(Walsh)
> cogi.median<-median(Walsh)
> cogi.lower95<-Walsh.sort[q1]
> cogi.upper95<-Walsh.sort[q2]

> out1["COGI",1]<-cogi.median
> out1["COGI",3]<-cogi.lower95
> out1["COGI",4]<-cogi.upper95
> out1

(pseudo)median
ACMI  0.50 0.375000000 -1.0  1.0 12  1
ATSE -1.25 0.010742188 -1.5 -1.0 13  2
AVXX -0.25 1.000000000 -1.5  1.0 13  3
BAPI  0.50 0.019210815  0.0  1.0 23  4
BRDI  0.25 0.455734253 -1.0  1.0 26  5
CAMA  0.25 0.621337891 -1.0  1.0 20  6
CIOC  0.50 0.039062500  0.0  1.0 24  7
COGI  0.50 0.015625000  0.0  1.0 13  8
DISP  0.75 0.025634766  0.0  2.0 18  9
GNLU  0.50 0.726562500 -1.0  1.0 15 10
HOBR -0.50 1.000000000 -1.0  1.0 17 11
ISME  0.25 0.478607178 -0.5  1.0 23 12
MEIN -1.25 0.002685547 -2.0 -1.0 18 13
MEPO  0.50 1.000000000 -1.5  1.0 12 14
SOOL -1.25 0.019531250 -1.5  0.0 24 15
ASMI  0.50 1.000000000 -1.0  1.0 19 16
ATCA  0.50 1.000000000 -1.5  1.0 21 17
CAAE  0.75 0.164550781 -0.5  1.5 21 18
DAPU -0.25 1.000000000 -1.5  1.0 15 19
ERGL  0.25 0.269531250 -1.0  1.0 15 20
LODE  0.75 0.234375000 -1.0  1.5 13 21
LUAL  1.25 0.007812500  1.0  2.0 11 22
MAIN -1.25 0.003173828 -1.5 -1.0 19 23
ERGR  0.50 0.687500000 -1.0  1.0 11 24
SIBE  0.75 0.113281250 -1.0  1.5 12 25


> newdat<-data.frame(rownames(results),sig.fdr,sig.holm)
> newdat<-newdat[order(newdat[,1]),]
> out2<-out1[order(rownames(out1)),]
> out3<-cbind(out2,newdat[,2],newdat[,3])
> out3[,6]<-1:25

#generate the graph
> par(mar=c(5.1,5.1,2.1,5.1))
> plot(range(out3[,3:4]), c(.75,25.25), xlab='Median Change in Cover', type='n', axes=F, ylab='')
> axis(1)
> axis(2,at=out3[,6], labels=rownames(out3), cex.axis=.75, las=2)
> arrows(out3[,3],out3[,6],out3[,4],out3[,6],code=3, angle=90, length=.05, col=(out3[,7]+1), lwd=2)
> points(out3[,1],out3[,6],pch=16,cex=1.2,col='white')
> points(out3[,1],out3[,6],pch=1,cex=1.1,col=(out3[,7]+1))
> mytext<-ifelse(out3[,8]==1, paste('n = ', out3[,5],'*'), paste('n = ', out3[,5]))
> mycol<-ifelse(out3[,8]==1,4,1)
> mtext(side=4,line=.5,at=1:25, text=mytext, col=mycol, cex=.75, las=2)
> abline(v=0,lty=2,col='seagreen')
> box()
> mtext(side=3, line=.5, 'Results based on Wilcoxon signed rank test')

Wilcoxon Signed Rank Test Multiple Comparison Results
(species showing significant differences in cover values)

Unadjusted
Bonferroni
Sequential Bonferroni
False Discovery Rate (FDR)
"MEIN" "MAIN" "LUAL" "ATSE" "COGI" "BAPI" "SOOL" "DISP" "CIOC"
None
None
"MEIN" "MAIN"
Fig. 8  Confidence intervals for species-specific Wilcoxon signed rank tests. Confidence intervals in red are significant with FDR = 0.05 (2 species). Results with a blue * (shown in sample sizes on right side) are significant with experimentwise error rate of α = .05 using a sequential Bonferroni adjustment (0 species)

Solution using a permutation test

The perm.test function is found in the exactRankTests package. Its syntax is the same as the t.test function. Due to the programming complexity of finding permutation-based confidence intervals I report basic bootstrap confidence intervals for the difference instead. I summarize the significance test results at the end.

> cover<-read.table('http://www.unc.edu/courses/2007spring/enst/562/001/assignments/final/coverclass.txt', header=T)
> unique(cover$CHISSpCode)->species
> library(exactRankTests)
> library(boot)
> rep.func.boot<-function(x) {
cover[cover$CHISSpCode==x,]->name
perm.test(name[,4], name[,3], paired=TRUE)->out
n<-length(name[,3])
pval<-out$p.value
actdiff<-(name[,4]-name[,3])
est<-mean(actdiff)
mean.func<-function(x,indices) mean(x[indices])
boot(actdiff,mean.func,9999)->myboot
boot.ci(myboot,type='basic')->mybootci
c(est,pval,mybootci$basic[4:5],n)
}


> sapply(species,rep.func.boot)->out
> colnames(out)<-species
> out[2,]->pvalues
> numtests<-length(pvalues)
> unadjusted<-(pvals<.05)
> sort(pvalues)->pvals
> alphai<-0.05/numtests
> bonf<-rep(.05/numtests,numtests)
> holm<-.05/seq(numtests,1,-1)
> fdr<-(1:numtests)*.05/numtests


#carry out tests
> b.test<-(pvals<bonf)
> holm.t<-(pvals<holm)
> fdr.t<-(pvals<fdr)
> cbind(pvals,b.test,holm.t,fdr.t)->results
> results
           pvals b.test holm.t fdr.t
MEIN 0.002685547      0      0     0
MAIN 0.003173828      0      0     1
LUAL 0.007812500      0      0     0
ATSE 0.010742188      0      0     0
COGI 0.015625000      0      0     0
BAPI 0.019210815      0      0     0
SOOL 0.019531250      0      0     0
DISP 0.025634766      0      0     0
CIOC 0.039062500      0      0     0
SIBE 0.113281250      0      0     0
CAAE 0.164550781      0      0     0
LODE 0.234375000      0      0     0
ERGL 0.269531250      0      0     0
ACMI 0.375000000      0      0     0
BRDI 0.455734253      0      0     0
ISME 0.478607178      0      0     0
CAMA 0.621337891      0      0     0
ERGR 0.687500000      0      0     0
GNLU 0.726562500      0      0     0
AVXX 1.000000000      0      0     0
HOBR 1.000000000      0      0     0
MEPO 1.000000000      0      0     0
ASMI 1.000000000      0      0     0
ATCA 1.000000000      0      0     0
DAPU 1.000000000      0      0     0

> sig.holm<-vector(mode='numeric',length=n)
> if(!any(holm.t)) sig.holm[1:n]<-0 else
if(!any(holm.t==0)) sig.holm[1:n]<-1 else {
min((1:25)[holm.t==0])->min.holm
sig.holm[1:(min.holm-1)]<-1
sig.holm[min.holm:25]<-0 }


> sig.fdr<-vector(mode='numeric',length=25)

> if(!any(fdr.t)) sig.fdr[1:n]<-0 else
if(!any(fdr.t==1)) sig.fdr[1:n]<-1 else {
max((1:25)[fdr.t==1])->max.fdr
sig.fdr[(max.fdr+1):25]<-0
sig.fdr[1:max.fdr]<-1 }

#unadjusted
> names(pvals)[unadjusted]
[1] "MEIN" "MAIN" "LUAL" "ATSE" "COGI" "BAPI" "SOOL" "DISP" "CIOC"
#Bonferroni
> names(pvals)[b.test]
character(0)
#sequential Bonferroni
> names(pvals)[sig.holm==1]
character(0)
#FDR results
> names(pvals)[sig.fdr==1]
[1] "MEIN" "MAIN"

> newdat<-data.frame(rownames(results),sig.fdr,sig.holm)
> newdat<-newdat[order(newdat[,1]),]
> out1<-cbind(t(out),1:25)
> out2<-out1[order(rownames(out1)),]
> out3<-cbind(out2,newdat[,2],newdat[,3])
> out3[,6]<-1:25

#generate the graph
> par(mar=c(5.1,5.1,2.1,5.1))
> plot(range(out3[,3:4]), c(.75,25.25), xlab='Mean Change in Cover', type='n', axes=F, ylab='')
> axis(1)
> axis(2,at=out3[,6], labels=rownames(out3), cex.axis=.75, las=2)
> arrows(out3[,3], out3[,6], out3[,4], out3[,6], code=3, angle=90, length=.05,col=(out3[,7]+1),lwd=2)
> points(out3[,1], out3[,6], pch=16, cex=1.2, col='white')
> points(out3[,1], out3[,6], pch=1, cex=1.1, col=(out3[,7]+1))
> mytext<-ifelse(out3[,8]==1, paste('n = ',out3[,5],'*'), paste('n = ', out3[,5]))
> mycol<-ifelse(out3[,8]==1,4,1)
> mtext(side=4, line=.5, at=1:25, text=mytext, col=mycol, cex=.75, las=2)
> abline(v=0, lty=2, col='seagreen')
> box()
> mtext(side=3, line=.5, 'Results based on permutation test and basic bootstrap interval')

Permutation Test Multiple Comparison Results
(species showing significant differences in cover values)

Unadjusted
Bonferroni
Sequential Bonferroni
False Discovery Rate (FDR)
"MEIN" "MAIN" "LUAL" "ATSE" "COGI" "BAPI" "SOOL" "DISP" "CIOC"
None
None
"MEIN" "MAIN"
Fig. 9  Confidence intervals for species-specific permutation tests on the differences. Bootstrap confidence intervals in red correspond to significant permutation tests with FDR = 0.05 (9 species). Results with a blue * (shown in sample sizes on right side) are significant using a sequential Bonferroni adjustment with an experimentwise error rate α = .05 (3 species)

Solution using a randomized block design

The randomized block design approach yields results identical to the paired t-test. The only change from that code are the details in locating the statistics needed from the output of the lm function. In addition the confidence intervals are constructed by hand. I summarize the significance test results at the end

################# randomized block design model ###################
> cover<-read.table('http://www.unc.edu/courses/2007spring/enst/562/001/assignments/final/coverclass.txt', header=T)
> unique(cover$CHISSpCode)->species


> rep.func.lm<-function(x) {
cover[cover$CHISSpCode==x,]->acmi
new.acmi<-data.frame(rep(acmi$PlotCode,2), c(acmi$Coverclass.1983, acmi$Coverclass.2002), rep(c(1983,2002), c(dim(acmi)[1], dim(acmi)[1])))
colnames(new.acmi)<-c('Plot', 'Cover', 'Year')

lm(Cover~factor(Year)+ factor(Plot), data=new.acmi)->out5
summary(out5)->out6
p.val<-out6$coefficients[2,4]
est<-out6$coefficients[2,1]
df<-out5$df.residual
se<-out6$coefficients[2,2]
low95<-est+qt(.025,df)*se
hi95<-est+qt(.975,df)*se
n<-dim(acmi)[1]
c(est,p.val,low95,hi95,n)
}

> sapply(species,rep.func.lm)->out
> colnames(out)<-species
> out[2,]->pvalues
> numtests<-length(pvalues)
> sort(pvalues)->pvals
> unadjusted<-(pvals<.05)
> alphai<-0.05/numtests
> bonf<-rep(.05/numtests,numtests)
> holm<-.05/seq(numtests,1,-1)
> fdr<-(1:numtests)*.05/numtests


#carry out tests

> b.test<-(pvals<bonf)
> holm.t<-(pvals<holm)
> fdr.t<-(pvals<fdr)
> cbind(pvals,b.test,holm.t,fdr.t)->results

> cbind(pvals,unadjusted,b.test,holm.t,fdr.t)->results

            pvals unadjusted b.test holm.t fdr.t
MEIN 0.0007769294          1      1      1     1
MAIN 0.0007944877          1      1      1     1
LUAL 0.0016069697          1      1      1     1
ATSE 0.0024154024          1      0      0     1
COGI 0.0051616813          1      0      0     1
BAPI 0.0082593612          1      0      0     1
SOOL 0.0091516301          1      0      0     1
DISP 0.0147043609          1      0      0     1
CIOC 0.0160893434          1      0      0     1
SIBE 0.0674332787          0      0      0     0
CAAE 0.1188353147          0      0      0     0
LODE 0.1368122627          0      0      0     0
ACMI 0.1910542981          0      0      0     0
ERGL 0.2376664725          0      0      0     0
ERGR 0.4405216964          0      0      0     0
ISME 0.4619848747          0      0      0     0
BRDI 0.4668980374          0      0      0     0
GNLU 0.4985453188          0      0      0     0
CAMA 0.5453609184          0      0      0     0
HOBR 0.7498058561          0      0      0     0
MEPO 0.7544993470          0      0      0     0
ATCA 0.7711217886          0      0      0     0
AVXX 0.7762430284          0      0      0     0
ASMI 0.7898521784          0      0      0     0
DAPU 0.8177843203          0      0      0     0


> sig.holm<-vector(mode='numeric',length=n)
> if(!any(holm.t)) sig.holm[1:n]<-0 else
if(!any(holm.t==0)) sig.holm[1:n]<-1 else {
min((1:25)[holm.t==0])->min.holm
sig.holm[1:(min.holm-1)]<-1
sig.holm[min.holm:25]<-0 }


> sig.fdr<-vector(mode='numeric',length=25)
> if(!any(fdr.t)) sig.fdr[1:n]<-0 else
if(!any(fdr.t==1)) sig.fdr[1:n]<-1 else {
max((1:25)[fdr.t==1])->max.fdr
sig.fdr[(max.fdr+1):25]<-0
sig.fdr[1:max.fdr]<-1 }

#unadjusted
> names(pvals)[unadjusted]
[1] "MEIN" "MAIN" "LUAL" "ATSE" "COGI" "BAPI" "SOOL" "DISP" "CIOC"
#Bonferroni
> names(pvals)[b.test]
[1] "MEIN" "MAIN" "LUAL"
#sequential Bonferroni
> names(pvals)[sig.holm==1]
[1] "MEIN" "MAIN" "LUAL"
#FDR results
> names(pvals)[sig.fdr==1]
[1] "MEIN" "MAIN" "LUAL" "ATSE" "COGI" "BAPI" "SOOL" "DISP" "CIOC"

> newdat<-data.frame(rownames(results),sig.fdr,sig.holm)
> newdat<-newdat[order(newdat[,1]),]
> out1<-cbind(t(out),1:25)
> out2<-out1[order(rownames(out1)),]
> out3<-cbind(out2,newdat[,2],newdat[,3])

> out3[,6]<-1:25

#generate the graph
> par(mar=c(5.1,5.1,2.1,5.1))
> plot(range(out3[,3:4]), c(.75,25.25), xlab='Mean Change in Cover', type='n', axes=F, ylab='')
> axis(1)
> axis(2,at=out3[,6], labels=rownames(out3), cex.axis=.75, las=2)
> arrows(out3[,3], out3[,6], out3[,4], out3[,6], code=3, angle=90, length=.05, col=(out3[,7]+1), lwd=2)
> points(out3[,1], out3[,6], pch=16, cex=1.2, col='white')
> points(out3[,1], out3[,6], pch=1, cex=1.1, col=(out3[,7]+1))
> mytext<-ifelse(out3[,8]==1, paste('n = ', out3[,5],'*'), paste('n = ', out3[,5]))
> mycol<-ifelse(out3[,8]==1,4,1)
> mtext(side=4,line=.5, at=1:25, text=mytext, col=mycol, cex=.75, las=2)
> abline(v=0,lty=2, col='seagreen')
> box()
> mtext(side=3, line=.5, 'Results based on randomized block design')

Randomized Block Design Multiple Comparison Results
(species showing significant differences in cover values)

Unadjusted
Bonferroni
Sequential Bonferroni
False Discovery Rate (FDR)
"MEIN" "MAIN" "LUAL" "ATSE" "COGI" "BAPI" "SOOL" "DISP" "CIOC"
"MEIN" "MAIN" "LUAL"
"MEIN" "MAIN" "LUAL"
"MEIN" "MAIN" "LUAL" "ATSE" "COGI" "BAPI" "SOOL" "DISP" "CIOC"
Fig. 10  Confidence intervals for species-specific t-tests based on a randomized block design. Confidence intervals in red are significant with FDR = 0.05 (9 species). Results with a blue * (shown in sample sizes on right side) are significant using a sequential Bonferroni adjustment with an experimentwise error rate α = .05 (3 species)

Solution using a random effects model

The mixed effects model treats the blocks of the randomized block design as random effects. The only change from that code is in locating the statistics needed from the output of the lme function. The confidence intervals are constructed by hand. I summarize the significance test results at the end.

################# mixed model ###################
> cover<-read.table('http://www.unc.edu/courses/2007spring/enst/562/001/assignments/final/coverclass.txt', header=T)
> unique(cover$CHISSpCode)->species
> library(nlme)
> rep.func.lme<-function(x) {
cover[cover$CHISSpCode==x,]->acmi
new.acmi<-data.frame(rep(acmi$PlotCode,2), c(acmi$Coverclass.1983, acmi$Coverclass.2002), rep(c(1983,2002), c(dim(acmi)[1], dim(acmi)[1])))
colnames(new.acmi)<-c('Plot', 'Cover', 'Year')
lme(Cover~factor(Year),random=~1|Plot, data=new.acmi, method='ML')->out5
summary(out5)->out6
p.val<-out6$tTable[2,5]
est<-out6$tTable[2,1]
df<-out6$tTable[2,3]
se<-out6$tTable[2,2]
low95<-est+qt(.025,df)*se
hi95<-est+qt(.975,df)*se
n<-dim(acmi)[1]
c(est,p.val,low95,hi95,n)
}

> sapply(species,rep.func.lme)->out
> colnames(out)<-species


> out[2,]->pvalues
> numtests<-length(pvalues)
> sort(pvalues)->pvals
> unadjusted<-(pvals<.05)
> alphai<-0.05/numtests
> bonf<-rep(.05/numtests,numtests)
> holm<-.05/seq(numtests,1,-1)
> fdr<-(1:numtests)*.05/numtests
> #carry out tests
> b.test<-(pvals<bonf)
> holm.t<-(pvals<holm)
> fdr.t<-(pvals<fdr)
> cbind(pvals,b.test,holm.t,fdr.t)->results

> results
            pvals b.test holm.t fdr.t
MEIN 0.0005888852      1      1     1
MAIN 0.0007944875      1      1     1
LUAL 0.0016069638      1      1     1
ATSE 0.0024154023      0      0     1
COGI 0.0051616859      0      0     1
BAPI 0.0082593609      0      0     1
SOOL 0.0085271585      0      0     1
CIOC 0.0118833699      0      0     1
DISP 0.0147043609      0      0     1
SIBE 0.0429602640      0      0     0
CAAE 0.1188352972      0      0     0
LODE 0.1368122627      0      0     0
ACMI 0.1573816360      0      0     0
ERGL 0.1872034914      0      0     0
GNLU 0.4331929779      0      0     0
ERGR 0.4405216848      0      0     0
ISME 0.4619854920      0      0     0
BRDI 0.4668980315      0      0     0
CAMA 0.5453609093      0      0     0
MEPO 0.7161281594      0      0     0
HOBR 0.7498058540      0      0     0
DAPU 0.7564951971      0      0     0
ATCA 0.7711217816      0      0     0
AVXX 0.7762430425      0      0     0
ASMI 0.7898523960      0      0     0


> sig.holm<-vector(mode='numeric',length=n)
> if(!any(holm.t)) sig.holm[1:n]<-0 else
if(!any(holm.t==0)) sig.holm[1:n]<-1 else {
min((1:25)[holm.t==0])->min.holm
sig.holm[1:(min.holm-1)]<-1
sig.holm[min.holm:25]<-0 }


> sig.fdr<-vector(mode='numeric',length=25)
> if(!any(fdr.t)) sig.fdr[1:n]<-0 else
if(!any(fdr.t==1)) sig.fdr[1:n]<-1 else {
max((1:25)[fdr.t==1])->max.fdr
sig.fdr[(max.fdr+1):25]<-0
sig.fdr[1:max.fdr]<-1 }

#unadjusted
> names(pvals)[unadjusted]
[1] "MEIN" "MAIN" "LUAL" "ATSE" "COGI" "BAPI" "SOOL" "CIOC" "DISP" "SIBE"
#Bonferroni
> names(pvals)[b.test]
[1] "MEIN" "MAIN" "LUAL"
#sequential Bonferroni
> names(pvals)[sig.holm==1]
[1] "MEIN" "MAIN" "LUAL"
#FDR results
> names(pvals)[sig.fdr==1]
[1] "MEIN" "MAIN" "LUAL" "ATSE" "COGI" "BAPI" "SOOL" "CIOC" "DISP"

> newdat<-data.frame(rownames(results),sig.fdr,sig.holm)
> newdat<-newdat[order(newdat[,1]),]
> out1<-cbind(t(out),1:25)
> out2<-out1[order(rownames(out1)),]
> out3<-cbind(out2,newdat[,2],newdat[,3])
> out3[,6]<-1:25

#create graph
> par(mar=c(5.1,5.1,2.1,5.1))
> plot(range(out3[,3:4]), c(.75,25.25), xlab='Mean Change in Cover' ,type='n', axes=F, ylab='')
> axis(1)
> axis(2,at=out3[,6], labels=rownames(out3), cex.axis=.75, las=2)
> arrows(out3[,3], out3[,6], out3[,4], out3[,6], code=3, angle=90, length=.05, col=(out3[,7]+1), lwd=2)
> points(out3[,1], out3[,6], pch=16, cex=1.2, col='white')
> points(out3[,1], out3[,6], pch=1, cex=1.1, col=(out3[,7]+1))
> mytext<-ifelse(out3[,8]==1, paste('n = ', out3[,5],'*'), paste('n = ', out3[,5]))
> mycol<-ifelse(out3[,8]==1,4,1)
> mtext(side=4, line=.5, at=1:25, text=mytext, col=mycol, cex=.75, las=2)
> abline(v=0,lty=2, col='seagreen')
> box()
> mtext(side=3,line=.5, 'Results based on random intercepts model')

Random Effects Model Multiple Comparison Results
(species showing significant differences in cover values)

Unadjusted
Bonferroni
Sequential Bonferroni
False Discovery Rate (FDR)
"MEIN" "MAIN" "LUAL" "ATSE" "COGI" "BAPI" "SOOL" "CIOC" "DISP" "SIBE"
"MEIN" "MAIN" "LUAL"
"MEIN" "MAIN" "LUAL"
"MEIN" "MAIN" "LUAL" "ATSE" "COGI" "BAPI" "SOOL" "DISP" "CIOC"
Fig. 11  Confidence intervals for species-specific t-tests from a random effects model with random intercepts for each relevé. Confidence intervals in red are significant with FDR = 0.05 (9 species). Results with a blue * (shown in sample sizes on right side) are significant using a sequential Bonferroni adjustment with an experimentwise error rate of α = .05 (3 species)

Problem 3

I read in the data and merge files just as was done in Assignment 10.

#load data
> srrs2 <- read.table ("http://www.unc.edu/courses/2007spring/enst/562/001/assignments/assign10/radon.txt", header=T, sep=",")
#get Minnesota
> minn<-srrs2[srrs2$state=="MN",]
#abbreviate county code
> minn$countycode<-substr(as.character(minn$county),1,6)
#get county data
> cty <- read.table ("http://www.unc.edu/courses/2007spring/enst/562/001/assignments/assign10/cty.txt", header=T, sep=",")
#select Minnesota
> minn2<-cty[cty$st=='MN',]
#combine data sets
> merge(minn,minn2,by.x="cntyfips",by.y="ctfips")->merge3

Next I set up the numeric variables for use in WinBUGS.

#log transform activity
> y<-log(ifelse (merge3$activity ==0, .05, merge3$activity))
> J<-length(unique(merge3$county))
> county<-as.numeric(factor(merge3$county))
> x<-merge3$floor
> n<-length(y)
#log transform and select uranium values one per county
> z<-as.vector(tapply(log(merge3$Uppm), factor(merge3$county), function(x) x[1]))

I create the input objects to WinBUGS.

> radon.data<-list("n","J","y","county","x","z")
> radon.inits<-function(){list(a=rnorm(J),b=rnorm(1),a1=rnorm(1),
> sigma.y=runif(1), mu.a=rnorm(1),sigma.a=runif(1))}
> radon.parameters<-c("a","b","sigma.y","mu.a","sigma.a","a1")

The following WinBUGS model file is stored in D:/enst 562/final/radonmodel.txt. This is only a trivial modification of the random intercepts WinBUGS model shown in the notes.

model{
for(i in 1:n) {
y[i]~dnorm(y.hat[i],tau.y)
y.hat[i]<-a[county[i]]+b*x[i]
}
b~dnorm(0,.0001)
tau.y<-pow(sigma.y,-2)
sigma.y~dunif(0,100)
for(j in 1:J){
a[j]~dnorm(a.hat[j],tau.a)
a.hat[j]<-mu.a+a1*z[j]
}
mu.a~dnorm(0,.0001)
a1~dnorm(0,.0001)
tau.a<-pow(sigma.a,-2)
sigma.a~dunif(0,100)
}

I set the R working directory to the directory that contains the model description.

> setwd("D:/enst 562/final")
> library(arm)

I run the model. After a couple of preliminary runs in which the sigma.a chains were not mixing well enough, I let the model run for 1500 iterations throwing away the first 750 iterates. The Rhat statistic is reported to be 1 for all parameters indicating adequate mixing.

> radon.results<-bugs(radon.data, radon.inits, radon.parameters, "radonmodel.txt", n.chains=3, bugs.directory="D:/Program Files/WinBUGS14/", n.iter=1500, debug=T)
> radon.results

Inference for Bugs model at "radonmodel.txt", fit using winbugs,
3 chains, each with 1500 iterations (first 750 discarded), n.thin = 2
n.sims = 1125 iterations saved
           mean   sd   2.5%    25%    50% 75% 97.5% Rhat n.eff
a[1]        0.9  0.2    0.6    0.9    1.0    1.0    1.3 1 1100
a[2]        0.9  0.1    0.7    0.8    0.9    0.9    1.0 1  320
a[3]        1.4  0.2    1.1    1.3    1.4    1.5    1.7 1 1100
a[4]        1.1  0.2    0.9    1.0    1.1    1.2    1.5 1 1100
a[5]        1.4  0.1    1.1    1.3    1.4    1.5    1.6 1  520
a[6]        1.7  0.2    1.4    1.6    1.7    1.8    2.0 1 1100
a[7]        1.8  0.1    1.5    1.7    1.8    1.9    2.1 1  390
a[8]        1.7  0.2    1.4    1.6    1.7    1.8    2.1 1 1100
a[9]        1.1  0.1    0.9    1.1    1.2    1.2    1.4 1  870
a[10]       1.5  0.1    1.2    1.4    1.5    1.6    1.8 1  770
a[11]       1.1  0.2    0.8    1.0    1.1    1.2    1.4 1  660
a[12]       1.7  0.2    1.4    1.6    1.7    1.8    2.0 1 1100
a[13]       1.0  0.2    0.7    0.9    1.0    1.1    1.3 1 1100
a[14]       1.8  0.1    1.6    1.7    1.8    1.9    2.1 1 1100
a[15]       1.4  0.2    1.1    1.3    1.4    1.5    1.7 1 1100
a[16]       1.1  0.2    0.7    1.0    1.1    1.2    1.4 1  960
a[17]       1.6  0.2    1.3    1.5    1.6    1.7    1.9 1  340
a[18]       1.0  0.1    0.8    1.0    1.0    1.1    1.3 1 1100
a[19]       1.4  0.1    1.2    1.3    1.4    1.4    1.5 1  680
a[20]       1.7  0.2    1.4    1.6    1.7    1.8    2.0 1 1100
a[21]       1.6  0.1    1.3    1.5    1.6    1.7    1.9 1  820
a[22]       1.5  0.2    1.1    1.3    1.5    1.6    1.8 1 1100
a[23]       1.7  0.2    1.4    1.6    1.7    1.8    2.0 1 1000
a[24]       1.8  0.1    1.5    1.7    1.7    1.9    2.1 1 1100
a[25]       1.7  0.1    1.5    1.6    1.7    1.8    2.0 1  410
a[26]       1.4  0.1    1.2    1.3    1.4    1.4    1.5 1  640
a[27]       1.8  0.2    1.5    1.7    1.8    1.9    2.1 1 1100
a[28]       1.2  0.1    0.9    1.1    1.2    1.3    1.5 1 1100
a[29]       0.9  0.2    0.6    0.8    0.9    1.0    1.3 1 1100
a[30]       1.0  0.1    0.7    0.9    1.0    1.1    1.2 1 1100
a[31]       1.7  0.1    1.5    1.6    1.7    1.8    2.0 1 1100
a[32]       1.4  0.2    1.1    1.3    1.4    1.5    1.7 1 1000
a[33]       1.6  0.2    1.3    1.5    1.6    1.7    1.9 1 1100
a[34]       1.5  0.2    1.2    1.4    1.5    1.6    1.8 1 1100
a[35]       0.8  0.1    0.5    0.7    0.8    0.9    1.1 1  700
a[36]       1.8  0.2    1.5    1.7    1.8    1.9    2.2 1  660
a[37]       0.8  0.1    0.5    0.7    0.8    0.9    1.1 1 1100
a[38]       1.1  0.2    0.8    1.0    1.1    1.2    1.5 1  770
a[39]       1.6  0.1    1.3    1.5    1.6    1.7    1.9 1 1100
a[40]       1.9  0.2    1.6    1.7    1.8    2.0    2.2 1 1100
a[41]       1.8  0.1    1.5    1.7    1.8    1.9    2.1 1  640
a[42]       1.6  0.2    1.2    1.5    1.6    1.7    1.9 1 1100
a[43]       1.5  0.1    1.2    1.4    1.5    1.6    1.8 1  790
a[44]       1.5  0.2    1.1    1.4    1.5    1.6    1.7 1 1100
a[45]       1.4  0.1    1.2    1.4    1.5    1.5    1.7 1  460
a[46]       1.4  0.1    1.1    1.4    1.4    1.5    1.7 1  730
a[47]       1.3  0.2    0.9    1.2    1.3    1.4    1.6 1 1100
a[48]       1.3  0.1    1.1    1.2    1.3    1.4    1.6 1 1100
a[49]       1.7  0.1    1.4    1.6    1.7    1.8    1.9 1 1100
a[50]       1.8  0.2    1.5    1.7    1.8    1.9    2.2 1 1100
a[51]       1.7  0.2    1.4    1.6    1.7    1.8    2.1 1 1100
a[52]       1.8  0.2    1.4    1.7    1.8    1.9    2.1 1 1100
a[53]       1.6  0.2    1.3    1.5    1.6    1.7    1.9 1  310
a[54]       1.5  0.1    1.2    1.4    1.5    1.6    1.7 1  610
a[55]       1.4  0.1    1.1    1.3    1.4    1.5    1.7 1 1100
a[56]       1.4  0.2    1.0    1.3    1.4    1.5    1.7 1  620
a[57]       1.2  0.2    0.9    1.1    1.2    1.3    1.5 1 1100
a[58]       1.8  0.2    1.5    1.7    1.8    1.9    2.1 1  330
a[59]       1.7  0.2    1.3    1.6    1.7    1.8    2.0 1  280
a[60]       1.6  0.2    1.3    1.5    1.6    1.7    1.9 1 1100
a[61]       1.2  0.1    0.9    1.1    1.2    1.2    1.4 1  250
a[62]       1.8  0.2    1.5    1.7    1.8    1.9    2.1 1  880
a[63]       1.7  0.2    1.4    1.6    1.7    1.8    2.1 1 1100
a[64]       1.7  0.1    1.4    1.6    1.7    1.8    2.0 1 1100
a[65]       1.8  0.2    1.4    1.7    1.8    1.9    2.1 1 1100
a[66]       1.4  0.1    1.2    1.3    1.4    1.5    1.7 1 1100
a[67]       1.6  0.1    1.4    1.5    1.6    1.7    1.9 1 1100
a[68]       1.0  0.2    0.7    0.9    1.0    1.1    1.3 1 1100
a[69]       1.6  0.2    1.2    1.5    1.6    1.7    1.9 1 1100
a[70]       0.9  0.1    0.8    0.9    0.9    1.0    1.0 1  910
a[71]       1.5  0.1    1.3    1.4    1.5    1.6    1.7 1 1100
a[72]       1.6  0.1    1.4    1.6    1.6    1.7    1.9 1 1100
a[73]       1.8  0.2    1.5    1.7    1.8    1.9    2.1 1 1100
a[74]       1.6  0.2    1.2    1.5    1.6    1.7    1.9 1 1100
a[75]       1.5  0.2    1.2    1.4    1.5    1.6    1.8 1  330
a[76]       1.8  0.2    1.5    1.7    1.8    1.9    2.2 1 1100
a[77]       1.6  0.1    1.4    1.5    1.6    1.7    1.9 1  740
a[78]       1.0  0.2    0.7    0.9    1.0    1.1    1.4 1 1000
a[79]       1.5  0.2    1.1    1.4    1.5    1.6    1.8 1 1100
a[80]       1.3  0.1    1.2    1.3    1.3    1.4    1.5 1 1100
a[81]       1.7  0.2    1.4    1.6    1.7    1.8    2.1 1 1100
a[82]       1.7  0.2    1.3    1.6    1.7    1.8    2.0 1 1100
a[83]       1.7  0.1    1.4    1.6    1.7    1.8    2.0 1  570
a[84]       1.5  0.1    1.2    1.4    1.5    1.6    1.8 1 1000
a[85]       1.7  0.2    1.3    1.6    1.7    1.8    2.0 1 1100
b          -0.7  0.1   -0.8   -0.7   -0.7   -0.6   -0.5 1  430
sigma.y     0.8  0.0    0.7    0.8    0.8    0.8    0.8 1  840
mu.a        1.5  0.0    1.4    1.4    1.5    1.5    1.5 1  930
sigma.a     0.2  0.0    0.1    0.1    0.2    0.2    0.3 1 1100
a1          0.7  0.1    0.5    0.7    0.7    0.8    0.9 1 1100
deviance 2127.3 11.8 2104.1 2119.0 2127.0 2135.0 2150.0 1 1100

For each parameter, n.eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor (at convergence, Rhat=1).

pD = 23.8 and DIC = 2151.1 (using the rule, pD = Dbar-Dhat)
DIC is an estimate of expected predictive error (lower deviance is better).

The results we need are easily accessed from $sims.matrix component of the bugs object.

> myparms<-radon.results$sims.matrix
> colnames(myparms)
> names(myparms)
 
 [1] "a[1]"     "a[2]"    "a[3]"   "a[4]"   "a[5]"    "a[6]"
 [7] "a[7]"     "a[8]"    "a[9]"   "a[10]"  "a[11]"   "a[12]"
[13] "a[13]"    "a[14]"   "a[15]"  "a[16]"  "a[17]"   "a[18]"
[19] "a[19]"    "a[20]"   "a[21]"  "a[22]"  "a[23]"   "a[24]"
[25] "a[25]"    "a[26]"   "a[27]"  "a[28]"  "a[29]"   "a[30]"
[31] "a[31]"    "a[32]"   "a[33]"  "a[34]"  "a[35]"   "a[36]"
[37] "a[37]"    "a[38]"   "a[39]"  "a[40]"  "a[41]"   "a[42]"
[43] "a[43]"    "a[44]"   "a[45]"  "a[46]"  "a[47]"   "a[48]"
[49] "a[49]"    "a[50]"   "a[51]"  "a[52]"  "a[53]"   "a[54]"
[55] "a[55]"    "a[56]"   "a[57]"  "a[58]"  "a[59]"   "a[60]"
[61] "a[61]"    "a[62]"   "a[63]"  "a[64]"  "a[65]"   "a[66]"
[67] "a[67]"    "a[68]"   "a[69]"  "a[70]"  "a[71]"   "a[72]"
[73] "a[73]"    "a[74]"   "a[75]"  "a[76]"  "a[77]"   "a[78]"
[79] "a[79]"    "a[80]"   "a[81]"  "a[82]"  "a[83]"   "a[84]"
[85] "a[85]"    "b"     "sigma.y"  "mu.a"   "sigma.a" "a1"
[91] "deviance"

According to the code the estimate of mean log radon levels for houses without a basement at the county level is given by a[j] + b for j = 1 to 85. Thus to obtain the posterior distributions of these we quantities in WinBUGS we could have simply added these quantities together at each iteration. Of course we still have the results of each of those iterations so we can carry out the addition in R. Thus we need to add b, column 86, to the first 85 columns on an entry by entry basis to obtain the individual county-wide posterior distributions.

> county.post<-myparms[,1:85]+myparms[,86]

Next I use the apply function to get the quantiles needed for the smear graph.

> apply(county.post,2,function(x) c(median(x), quantile(x,.025), quantile(x,.975),quantile(x,.25), quantile(x,.75)))->quants

I transpose the results so the statistics occupy rows rather than columns, exponentiate the logged values to obtain measurements on the original radon scale, append a column of identifiers, and add some column names.

> tquants<-t(quants)
> tquants2<-cbind(1:85,exp(tquants))
> colnames(tquants2)<-c('county','mean','lo95','hi95','lo50','hi50')
> results<-tquants2

Finally I borrow the code from Assignment 8 to produce the smear graph.

> par(mar=c(5.1,7.1,2.1,1.1))
> par(lend=2)
> plot(range(results[,2:6]), c(1,9), type='n', axes=F, ylab='County', xlab='Mean Radon Level', ylim=c(.5,85.5),
xlim=c(0,max(results[,2:6])))
> axis(1,cex.axis=.8)
> axis(2,at=seq(5,85,5), labels=seq(5,85,5), cex.axis=.8, las=2)
#95% confidence interval
> apply(results,1, function(x) segments(x[3], x[1], x[4], x[1], lwd=1, col='grey65'))
#50% confidence interval
> apply(results,1, function(x) segments(x[5], x[1], x[6], x[1], lwd=2, col='grey50'))
> apply(results,1,function(x) points(x[2],x[1],pch=16, col='white', cex=.8))
> apply(results,1,function(x) points(x[2],x[1],pch=1, cex=.9))
> box()
#reset line ends back to the default value.
> par(lend=0)

Fig. 12  Probability smear graph for county-level mean radon levels for homes without basements for counties in Minnesota. Estimates are based on a random effects model treating intercepts random by county. Displayed are 50% and 95% credibility intervals along with the mean of the posterior distributions.

Problem 4

> manly<- read.table('http://www.unc.edu/courses/2007spring/enst/562/001/assignments/midterm/bycatch.csv', header=T,sep=',')

To fit a Poisson regression mixed model we need to use the lmer function from lme4 package.

> library(lme4)
> lmer(Data~Area+Gear.Type+Time+(1|Season), data=manly, family=poisson, offset=log(Tows), method='Laplace')->out.reff

Next I fit the model from the midterm and also a Poisson regression model without the season effect.

> out.manly<-glm(Data~Season+Area+Gear.Type+Time, data=manly, offset=log(Tows), family=poisson)
> out.noseason<-glm(Data~Area+Gear.Type+Time, data=manly, offset=log(Tows), family=poisson)

The AIC function doesn't work on lmer objects. We can either examine the AIC in the summary output or calculate it ourselves. The random effects model estimates five parameters: 4 regression coefficients plus the variance of the random effects.

> (-2*logLik(out.reff)+2*5)[1]
[1] 78.92763
> sapply(list(out.manly,out.noseason),AIC)
[1] 97.46053 192.57006

The random effects model actually beats the fixed effects model. This is an unusual result and I've never seen it happen before. It's almost always the case that estimating fixed effects for each "group" yields a better fit than assuming the group effects are samples from a normal distribution. That being the case the argument for using the random effects model is that it is more parsimonious and does not overfit the data. Apparently the bad estimate for the 1990-1991 season in the Manly model has really damaged the fit of the Poisson regression model. It's interesting to compare the estimates returned by the two models. I extract the coefficient estimates for season from manly's model and the corresponding random effects from the random effects model.

> coef(out.manly)
(Intercept) Season1990-91 Season1991-92 Season1992-93
-7.32782132  -19.66524028    1.81894771    0.07390532
Season1993-94 Season1994-95  AreaSouth Gear.TypeMid-Water
  -0.93191404   -0.13503221 1.82245026         1.91768692
 TimeNight
2.17656359

> ranef(out.reff)
An object of class “ranef.lmer”
[[1]]
        (Intercept)
1989-90   0.4308524
1990-91  -2.5583053
1991-92   2.2216679
1992-93   0.4873996
1993-94  -0.4696015
1994-95   0.2983646

We can make the two sets of estimates comparable by subtracting off the 1989-1990 random effect from the rest.

> ranef(out.reff)[[1]]-ranef(out.reff)[[1]][1,1]
        (Intercept)
1989-90  0.00000000
1990-91 -2.98915771
1991-92  1.79081548
1992-93  0.05654718
1993-94 -0.90045390
1994-95 -0.13248775

> coef(out.manly)[2:6]
Season1990-91 Season1991-92 Season1992-93 Season1993-94 Season1994-95
 -19.66524028    1.81894771    0.07390532   -0.93191404   -0.13503221

Now we see that except for the bad estimate in 1990-1991 from Manly's model, the estimates are almost exactly the same.

Extra Credit

I set up things for WinBUGS.

> area<-as.numeric(manly$Area)-1
> gear<-as.numeric(manly$Gear.Type)-1
> time<-as.numeric(manly$Time)-1
> ltows<-log(manly$Tows)

> season<-as.numeric(manly$Season)
> y<-manly$Data
> n<-length(y)
> J<-length(unique(season))


> manly.data<-list("n","J","y","season","area","gear","time", "ltows")
> manly.inits<-function() {list(a=rnorm(J), b1=rnorm(1), b2=rnorm(1),b3=rnorm(1),
mu.a=rnorm(1), sigma.a=runif(1))}
> manly.parameters<-c("a", "b1", "b2", "b3", "mu.a", "sigma.a")

> setwd("D:/enst 562/final")
> library(arm)

The WinBUGS model file is shown next. Notice that is a trivial modification of the random intercepts model in Problem 3. Following WinBUGS convention the log link appears in the definition of the derived quantity, log(yhat), and not as one of the arguments of dpois. Observe that the offset appears in the regression equation with a coefficient of 1.

model{
 for(i in 1:n) {
 y[i]~dpois(y.hat[i]) 
 log(y.hat[i])<-a[season[i]]+ b1*area[i]+ b2*gear[i]+ b3*time[i]+ltows[i]
 }
 b1~dnorm(0,.0001)
 b2~dnorm(0,.0001)
 b3~dnorm(0,.0001)
for(j in 1:J){
   a[j]~dnorm(a.hat[j],tau.a)
   a.hat[j]<-mu.a
   }
   mu.a~dnorm(0,.0001)
   tau.a<-pow(sigma.a,-2)
   sigma.a~dunif(0,100)
   }
I fit the model using 1500 iterations to improve the mixing.

> manly.3c<-bugs(manly.data, manly.inits, manly.parameters, "manlymodel2.txt", n.chains=3, bugs.directory = "D:/Program Files/WinBUGS14/", n.iter=1500, debug=T)

The columns labeled 2.5% and 97.5% provide 95% credibility intervals for each model parameter.

> manly.3c
Inference for Bugs model at "manlymodel2.txt", fit using winbugs,
3 chains, each with 1500 iterations (first 750 discarded), n.thin = 2
n.sims = 1125 iterations saved
         mean  sd  2.5%   25%   50%   75% 97.5% Rhat n.eff
a[1]     -7.6 0.6  -8.8  -8.1  -7.7  -7.2 -6.4     1   670
a[2]    -12.5 3.1 -20.4 -13.4 -11.8 -10.7 -9.1     1   410
a[3]     -5.8 0.6  -6.9  -6.2  -5.8  -5.4 -4.8     1