Lecture 31 (lab)—Friday, March 30, 2007
What was covered?
R functions and commands demonstrated
- array creates arrays
- as.vector converts objects such as matrices to vectors
- mvrnorm (in MASS package) generates random observations from a multivariate normal distribution
- nCm (in combinat package) calculates 'n choose m', the number of different combinations of size m that can be selected from a set of n elements
- oneway_test (in coin package) carries out two-sample and k-sample statistical randomization tests
- replicate will repeatedly apply a function a specified number of times
- vcov extracts the variance-covariance matrix of the parameters in a regression model
R function options
- alternative= (argument to oneway_test) is used to specify the "direction" of the hypothesis test to be carried out. Possible choices are "two.sided" (the default), "less", or "greater".
- distribution= (argument to oneway_test) is used to specify how to calculate the p-value of the randomization test. Possible choices are "asymptotic" (default), "approximate", or "exact".
- mu= (argument to mvrnorm) is a vector specifying the mean of a multivariate normal distribution
- Sigma= (argument to mvrnorm) is the covariance matrix of a multivariate normal distribution
R packages used
- coin contains functions for carrying out randomization tests such as oneway_test
- combinat contains combinatorial functions such as nCm
- MASS for the mvrnorm function
The parametric bootstrap
- We refit the coral reef disease model from Assignment 7, Question 2. This is a negative binomial regression model in which Prevalence is the response with Coral.Cover and WSSTA and WSSTA2 as the regressors.
> corals<-read.table('http://www.unc.edu/courses/2007spring/enst/562/001/assignments/assign7/corals.txt', header=T,sep=',')
> library(MASS)
>
model<-glm.nb(Prevalence~WSSTA+I(WSSTA^2)+Coral.Cover, data=corals)
- In Assignment 7 you did some preliminary work to assess fit. We'll extend that here by using a parametric bootstrap to carry out what is called predictive simulation. In predictive simulation the regression model is used to repeatedly generate new data. If the model fits the data then you would expect that data generated from the model will resemble the data that were actually collected. To assess fit we look to see if the observed data set appears to be a typical member of the simulation set. There are two steps to the process.
- Incorporating inferential uncertainty in the model.
- Adding predictive uncertainty to the model predictions.
Inferential uncertainty: uncertainty in the coefficient estimates
- Inferential uncertainty addresses the the fact that our parameter estimates are only estimates and thus are expected to vary with repeated sampling. This variability is captured in their standard errors. Furthermore the individual parameters estimated from a single data set are not independent and possess a covariance. This covariance can be extracted from the fitted model with the vcov function.
> vcov(model)
(Intercept) WSSTA I(WSSTA^2) Coral.Cover
(Intercept) 0.1602111601 -1.658806e-02 6.823475e-04 -2.774434e-03
WSSTA -0.0165880645 6.877421e-03 -3.210644e-04 1.278561e-05
I(WSSTA^2) 0.0006823475 -3.210644e-04 1.652645e-05 -1.256437e-06
Coral.Cover -0.0027744338 1.278561e-05 -1.256437e-06 8.723762e-05
- Displayed along the diagonal are the variances of the four model parameter estimates. The off-diagonal entries are the pairwise covariances. The variances tell us the extent to which we should expect the individual parameter estimates to vary. The covariances in turn serve to constrain this variability. For instance, the intercept and the coefficient of WSSTA are negatively correlated (negative covariance). Thus if the estimate of the intercept in a new sample turns out to be larger than the current point estimate (in accordance with its nonzero variance), the estimate of WSSTA would be expected to be smaller. On the other hand, not surprisingly, WSSTA and WSSTA2 have a positive covariance.
- The point estimates can be extracted from the model object with the coef function.
> coef(model)
(Intercept) WSSTA I(WSSTA^2) Coral.Cover
-0.74110690 0.47187658 -0.02387003 0.04742894
- Likelihood theory tells us that the maximum likelihood estimates of the parameters have an asymptotic normal distribution. If we let β denote the vector of parameter estimates and Σ their covariance matrix, then theory tells us

where the distribution shown is the multivariate normal distribution, the multivariate analog of the ordinary univariate normal distribution.
- There is one additional parameter in a negative binomial model, the dispersion, θ. In a typical maximum likelihood estimation problem,
would be an element of β. But the glm.nb function doesn't use a full likelihood approach to estimate the parameters of the negative binomial model. Instead it alternates between estimating β and estimating θ. As a result
is estimated separate from the model coefficients and has its own standard error. From likelihood theory we assume
.
> model$theta
[1] 0.3339442
> model$SE.theta
[1] 0.03734786
- To introduce inferential uncertainty, we use the multivariate and univariate normal distributions to obtain sample values from the distribution of β and θ. To obtain samples of β we use the random multivariate normal function from the MASS package, mvrnorm. To generate values of θ we use the rnorm function in base R. The two lines below generate 1000 random variates from each. The second and third arguments of mvrnorm are the mean, mu, and variance-covariance matrix, Sigma.
> beta<-mvrnorm(1000, coef(model), vcov(model))
> theta<-rnorm(1000, model$theta, sd=model$SE.theta)
- The negative binomial regression model uses β to predict log μ, where μ is the mean of a negative binomial distribution. (Recall that in a negative binomial regression a log link function is typically used.)

or as a single matrix equation

- The matrix of data values is called the design matrix. For the 236 observations in the coral data set we can create the design matrix as follows.
xmat<-cbind(rep(1, dim(corals)[1]) ,corals$WSSTA ,corals$WSSTA^2, corals$Coral.Cover)
- With the design matrix we can obtain the 236 estimated means using the first simulation set of β values as follows.
#obtain negative binomial means for first simulation
>
as.vector(exp(xmat%*%beta[1,]))->test
#display first ten means
> test[1:10]
[1] 0.04169032 1.74834239 3.57797963 6.16198930 10.95517412
[6] 0.10016315 1.58017517 1.74648264 1.87648033 3.98037768
Predictive uncertainty: uncertainty in the individual outcomes
- The last step is to use the predicted means and the simulated dispersion to generate new disease prevalence values. This is done with rnbinom function that generates random values from a negative binomial distribution. Since there are 236 estimated mean values for each simulation, one for each observation in the data set, and 1000 simulations, we need a 1000 × 236 matrix to hold the results. I initialize an array of the appropriate size using the array function.
> n<-length(corals$WSSTA)
> n.sims<-1000
#initialize the array with missing values
> y.rep<-array(NA,c(n.sims,n))
- A loop is not the most efficient way to do things repeatedly in R, but because the logic of the process is more transparent when viewed as a loop, I simulate new disease prevalences using an R for loop. Each run through the loop uses a different vector of β values and a different element of θ, the vector of simulated dispersion values. The fist line of the loop uses the current value of β to generate 236 new mean values. The second line of the loop generates a single observation from a negative binomial distribution for each of the 236 means using the current value of θ.
for(s in 1:n.sims) {
mus<- exp(xmat%*%beta[s,])
y.rep[s,]<-sapply(mus,function(x) rnbinom(1,mu=x, size=theta[s]))
}
Assessing model fit with the parametric bootstrap
- To assess fit, we examine various characteristics of the observed data and see how well they match up against the simulated values. A particularly simple approach is to examine the distribution of simulated values to determine whether the observed data look like a typical member from this distribution.
- There are no hard and fast rules for determining what data features to examine, but obvious choices are those features of the data that are considered crucial to the problem at hand. For the coral reef data these might include the maximum observed prevalence values, overall or for selective parameter combinations, or the fraction of disease-free corals.
- A good starting place is with the overall data distribution. In the next plot I compare a histogram of disease prevalence for three simulated distributions with the distribution observed in the actual data set. I set the bin widths to be 5 in all the plots to make the histograms more comparable.
> par(mfrow=c(2,2))
> sapply(1:3,function(x) hist(y.rep[x,], breaks=seq(-.5, max(y.rep[x,])+5,5)))
#now add the real thing
> hist(corals$Prevalence, breaks=seq(-.5, max(corals$Prevalence)+5,5))

Fig. 1 Histograms of three simulated data sets and the real data set
- As the plots show the three displayed simulated distributions look very much like the distribution of the actual data.
- Next I calculate the maximum disease prevalence for each simulation, display the results in a boxplot, and then overlay the observed maximum prevalence as a red asterisk. To obtain the maximum values I use the apply function to apply the max function separately to each row of the matrix of simulations. I choose to display the boxplot horizontally. The y-coordinate of the midline of the boxplot is y = 1 and I use this to plot the actual maximum in the points function.
#reset the graphics window
> par(mfrow=c(1,1))
#maximum prevalence
> apply(y.rep, 1, max)->max.out
> boxplot(max.out, horizontal=T, xlab='Maximum prevalence')
#is the observed maximum unusual?
> points(max(corals$Prevalence), 1, pch=8, col=2, cex=1.5)

Fig. 2 Maximum disease prevalence for simulated and observed data (*)
- As the plot shows, the observed maximum prevalence is a very typical value from this distribution.
- Next I look at the fraction of zero prevalence values. The generic function sum(x==0)/length(x) used below counts the number of values equal to zero in each simulation and divides by the number of observations in each simulation, 236. This time I display the simulated values in a histogram.
#zero fraction
> apply(y.rep,1, function(x) sum(x==0)/ length(x))->zero.frac
> hist(zero.frac)
> points(sum(corals$Prevalence==0)/ length(corals$Prevalence), 0, pch=8, col=2, cex=1.5)

Fig. 3 Fraction of zero prevalences for simulated and observed data (*)
- Once again the observed zero fraction looks like a fairly typical member of the distribution of simulated values.
- The statistics we've examined so far are all very crude measures in that they are summarizing over the entire set of observations. Because the point of the model was to establish a connection between the temperature metric WSSTA and disease prevalence, it might make more sense to evaluate our statistics conditional on the value of WSSTA. In this spirit I repeat the maximum prevalence calculations but this time calculate the maximum separately for each unique value of WSSTA. I display below the results obtained for for the first two simulations.
#look at maximum value at each unique value of WSSTA
> sapply(1:1000, function(x) tapply(y.rep[x,], corals$WSSTA, max))->max.x
> t(max.x[,1:2])
0 1 2 3 4 5 6 7 8 9 10 11 13 14 15 16 17 18 19 20
[1,] 23 24 191 17 17 29 107 159 85 23 35 339 34 65 101 3 29 40 42 4
[2,] 83 40 19 331 183 246 54 18 328 95 69 27 66 8 38 6 21 19 2 3
21 22 23 24 25 26 27 30
[1,] 0 3 3 0 0 0 0 0
[2,] 0 0 2 0 0 0 0 0
- Next I stack the results of each simulation, currently arranged as the columns of a matrix, into a single vector using the as.vector function. I also add the names of the entries (the WSSTA values) as an additional column in the data frame I create.
#first column are the WSSTA labels; the second column is the maximum prevalences
> data.frame(rep(as.numeric(rownames(max.x)),1000), as.vector(max.x))->max.WSSTA
>
colnames(max.WSSTA)<-c('WSSTA','max.prevalence')
- For plotting purposes I make a factor out of the WSSTA values and arrange the levels in sorted numerical order.
>max.WSSTA$cats<-factor(max.WSSTA$WSSTA, levels=sort(unique(max.WSSTA$WSSTA)))
- Next I produce individual boxplots for each unique value of WSSTA. In order to record the y-coordinates of the boxplots, I assign the return value of the boxplot to a variable.
#save boxplot locations
> boxplot(max.WSSTA$max.prevalence~ max.WSSTA$WSSTA, ylab='WSSTA', xlab='Maximum prevalence', horizontal=T, axes=F)->hmm
> axis(2, cex.axis=.8, las=2, at=sort(unique(hmm$group)), labels=levels(max.WSSTA$cats))
> axis(1)
> box()
- Next I use the plotted locations of the boxes, contained in the group component of the boxplot object, to overlay the observed maximum prevalences at each unique value of WSSTA.
> points(tapply(corals$Prevalence, corals$WSSTA,max), sort(unique(hmm$group)), pch=8, col=2, cex=1.2)

Fig. 5 Maximum prevalence as a function of WSSTA, simulated and observed
- This plot yields a different picture than did Fig. 2. While none of the observed values are truly anomalous we do see that the maximum prevalences at WSSTA values of 5, 7, 10, 11, and 13 are beyond the 3rd quartile of the simulated distributions. This suggests that the postulated quadratic relationship may not be ideal.
- It's worth comparing this approach to goodness of fit with what you did in Assignment 7. There you did something similar except that you used the model estimates to generate simulated prevalences and then calculated a p-value for the observed prevalences at each unique value of WSSTA. That approach differs from what was done here in the following respect. Although both approaches incorporate predictive uncertainty in their assessment of fit, predictive simulation also includes coefficient uncertainty. Thus predictive simulation is a more honest evaluation of model quality.
A randomization test for a two-sample comparison
- As an illustration of how to carry out a randomization test in R we revisit the slugs data set that we've used a number of times. Recall that the data consist of the number of slugs found under rocks in two types of fields, denoted Nursery and Rookery.
> slugs<-read.table( 'http://www.bio.ic.ac.uk/research/mjcraw/statcomp/data/slugsurvey.txt', header=TRUE)
> names(slugs)
[1] "slugs" "field"
> tapply(slugs$slugs,slugs$field,length)
Nursery Rookery
40 40
- In carrying out a randomization test we essentially pretend that the group labels have been lost and we are forced to randomly reassign the group labels to observations. This is then repeated multiple times each time computing a statistic of interest. The distribution of the statistic we obtain is called the randomization distribution, the distribution of the statistic under the assumption that the group labeling doesn't matter. If the same statistic calculated using the true group labels does not look like a typical member of the randomization distribution, then we conclude that the group labels matter and that groups are significantly different from each other with respect to the statistic in question.
- If the number of ways of rearranging the groups is small, then it makes sense to systematically carry out every possible randomization. If the number of possible rearrangements is large, we randomly sample from these randomizations instead. For the slugs data set we have 80 observations assigned to groups of size 40. The number of different ways of assigning 80 observations to two groups of size 40 is quite large. The nCm function from the combinat package calculates combinatorials.
> library(combinat)
> nCm(80,40)
[1] 1.075072e+23
- With a number this large we should take a sample of random assignments rather than try to enumerate the results from all of them.
- In carrying out a randomization test one decision that needs to be made is what statistic to use. An obvious choice for the slug data set is the difference in mean abundance in the two field types. I begin by calculating the observed mean difference.
> tapply(slugs$slugs, slugs$field, mean)->truemean
> truediff<-truemean[1]-truemean[2]
> truediff
Nursery
-1
- I next construct the randomization distribution. Rather than doing this as an explicit loop I make use of one of the built-in looping functions in R, in this case the replicate function. To use it I first write a function that carries out a single randomization and then calculates the difference in means for that random reassignment.
myperm<-function() {
#permute the response
newslugs<-sample(slugs$slugs)
#obtain the test statistic
tapply(newslugs,slugs$field,mean)->mymeans
mymeans[1]-mymeans[2]
}
- Notice that the permutation function contains the same two lines I used above to obtain the true difference in means. The only addition is the first line that uses the sample function of R to randomly permute the slug counts. When the sample function is given a single argument, a vector of data, it randomly permutes the elements of that vector.
- Notice that the choice made was to randomly reorder the values of the response rather than the group labels. This is a good practice to follow. For this example the choice is immaterial, but for more complicated problems the difference can be crucial. The group labels are part of the experimental design and by sampling the labels it is possible to disrupt the design if you're careless in how the sampling is done. This can't happen if it is the response vector that is permuted.
- Although the function, myperm, that I wrote has no arguments, it still needs the empty parentheses as part of its definition.
- The replicate function of R takes two arguments: the number of replicates desired and a set of expressions that are to be repeatedly evaluated. I replicate the function myperm, defined above, 999 times. I first set the value of the random seed so I can reproduce the results if desired.
> set.seed(10)
> out.diff<-replicate(999,myperm())
- Next I append the observed difference in means to the 999 randomized differences. To calculate a one-sided lower-tailed p-value I count the number of simulated differences that are less than or equal to the observed difference.
> alldiffs<-c(out.diff,truediff)
> onetail<-sum(alldiffs<=truediff)/length(alldiffs)
> onetail
[1] 0.028
- To obtain a two-tailed p-value we can either double this value, or add up the upper and lower tail areas separately. For symmetric distributions these calculations will yield the same result, but for asymmetric distributions the results may differ.
#symmetric p-value
> 2*sum(alldiffs<=truediff)/1000
[1] 0.056
#asymmetric p-value
> sum(abs(alldiffs)>=abs(truediff))/1000
[1] 0.056
- The use of the difference in means as a test statistic is not crucial here. A number of other statistics will give the same answer. Such statistics are called equivalent statistics. Two statistics that are equivalent to the differences in means are (1) the sum of the observations in one group and (2) the equal variances independent samples t-statistic. I rewrite my randomization function so that it returns all three of these test statistics.
- The observed value of the t-statistic is contained in the statistic component of the object created by the t.test function. I set var.equal=TRUE to carry out the equal variances version of the test.
myperm<-function() {
newslugs<-sample(slugs$slugs)
tapply(newslugs,slugs$field,mean)->mymeans
t.test(newslugs~slugs$field,var.equal=T)->out
c(mymeans[1]-mymeans[2],sum(newslugs[slugs$field=='Nursery']), out$statistic)
}
- This time I obtain 4999 randomizations, a number large enough to ensure potential significance at the .01 level.
> replicate(4999,myperm())->equiv.stat
- Finally I evaluate the two new statistics on the observed data and append these values to the simulation results. The one-tailed randomization test p-values are all the same verifying the equivalence of these three statistics.
> t.test(slugs$slugs~slugs$field, var.equal=T)->out
> cbind(equiv.stat,c(truediff, sum(slugs$slugs[slugs$field=='Nursery']), out$statistic))->full.set
> sapply(1:3,function(x) sum(full.set[x,]<=full.set[x,5000])/5000)
[1] 0.029 0.029 0.029
- Many types of randomization tests can be performed with the coin package of R. The function that carries out two-sample randomization tests is oneway_test. By default this function returns an asymptotic, large sample approximation to the p-value.
> library(coin)
> oneway_test(slugs~field,data=slugs)
Asymptotic 2-Sample Permutation Test
data: slugs by groups Nursery, Rookery
Z = -1.9397, p-value = 0.05242
alternative hypothesis: true mu is not equal to 0
- The randomization test we carried out above based on a sample of permutations can be obtained from oneway_test with the option distribution='approximate'.
> oneway_test(slugs~field,data=slugs,distribution='approximate')
Approximative 2-Sample Permutation Test
data: slugs by groups Nursery, Rookery
Z = -1.9397, p-value = 0.05
alternative hypothesis: true mu is not equal to 0
- Controlling the number of randomizations for the test requires an additional option as is shown next.
> oneway_test(slugs~field, data=slugs, distribution=approximate(B=4999))
Approximative 2-Sample Permutation Test
data: slugs by groups Nursery, Rookery
Z = -1.9397, p-value = 0.05441
alternative hypothesis: true mu is not equal to 0
- It's also possible to obtain the exact answer with oneway_test. The exact answer is obtained with an algorithm that does not actually involve constructing all possible permutations of the data.
> oneway_test(slugs~field, data=slugs, distribution='exact')
Exact 2-Sample Permutation Test
data: slugs by groups Nursery, Rookery
Z = -1.9397, p-value = 0.05755
alternative hypothesis: true mu is not equal to 0
- To get the lower-tailed test we carried out above requires the use of the alternative argument.
> oneway_test(slugs~field,data=slugs, distribution='exact', alternative='less')
Exact 2-Sample Permutation Test
data: slugs by groups Nursery, Rookery
Z = -1.9397, p-value = 0.02878
alternative hypothesis: true mu is less than 0
A randomization test for one-way analysis of variance [not done in lecture]
- One-way analysis of variance offers no new challenges. The oneway_test function in the coin package will handle it. Consider the following data set that comes from Manly's (1997) book on randomization tests, p. 117. The data are the number of ants eaten by horned lizards during different months of the year.
> X.data <- c(13, 242, 105, 8, 59, 20, 2, 245, 515, 488, 88, 233, 50, 600, 82, 40, 52, 1889, 18, 44, 21, 5, 6, 0)
> Month <- c("Jn", "Jn", "Jn", "J", "J", "J", "J", "J", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "S", "S", "S", "S", "S", "S")
> anova(lm(X.data~factor(Month)))
Analysis of Variance Table
Response: X.data
Df Sum Sq Mean Sq F value Pr(>F)
factor(Month) 3 726695 242232 1.6439 0.2110
Residuals 20 2947024 147351
- Ordinary analysis of variance (normal-based linear regression) is actually inappropriate here because the data are heteroscedastic as the boxplot below clearly shows.
> boxplot(X.data~factor(Month))

Fig. 5 Box plots showing heteroscedastic data
- A lognormal model is a better approach. Because there is one zero response, we need to add a constant to every value so that the logarithm is everywhere defined.
> anova(lm(log(X.data+1)~factor(Month)))
Analysis of Variance Table
Response: log(X.data + 1)
Df Sum Sq Mean Sq F value Pr(>F)
factor(Month) 3 36.808 12.269 6.0804 0.004099 **
Residuals 20 40.357 2.018
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
- A permutation test gives essentially the same answer as the parametric test.
> oneway_test(log(X.data+1)~factor(Month), distribution='approximate')
Approximative K-Sample Permutation Test
data: log(X.data + 1) by groups A, J, Jn, S
maxT = 2.9187, p-value = 0.007
- To do the randomization test by hand we can use the anova F-test as our test statistic. The results are stored in the "F value" component of the anova object.
myperm<-function() {
newdata<-sample(X.data)
anova(lm(log(newdata+1)~factor(Month)))->out
out["F value"][1,1]
}
> replicate(4999,myperm())->results
> anova(lm(log(X.data+1)~factor(Month)))["F value"][1,1]->realguy
> results2<-c(results,realguy)
> sum(results2>=realguy)/length(results2)
[1] 0.0034
Cited References
- Manly, Bryan F. J. 1997. Randomization, Bootstrap and Monte Carlo Methods in Biology. Chapman & Hall: London.
Course Home Page