We use a data set that appears on Michael Crawley's web site for his text Statistical Computing (Crawley 2002) and is also used in The R Book (2007).
Crawley describes these data as follows (Crawley 2002, p. 542). "Count data were obtained on the number of slugs found beneath 40 tiles placed in a stratified random grid over each of two permanent grasslands. This was part of a pilot study in which the question was simply whether the mean slug density differed significantly between the two grasslands." My interpretation of this is that the experiment was extraordinarily simple. Rocks of standard shape and size were laid out in two fields and at some point later in time the fields were revisited and the number of slugs under each rock was recorded. The data we are given are the raw counts, i.e., the individual slug counts for each rock. Thus we have a total of 80 observations, 40 from each field type. A value of 0 means no slugs were found under that rock.We can verify this by tabulating the data by field type.
Crawley then spends a chapter of his textbook trying one statistical test after another to test the hypothesis of no mean difference in slugs between the two field types. In the end his conclusion is ambiguous. Some tests say there is a difference, some say there isn't.
I submit that the problem posed by Crawley is essentially an uninteresting one. Any two populations in nature will typically differ with respect to whatever characteristic we care to measure. Whether that difference can be deemed statistically significant is not a statement about nature, but instead is a statement about the sample size used in the study. With enough data any difference, no matter how small, will achieve "statistical significance". A far more useful approach is to determine some way to characterize the differences in slug distribution that exist and then assess whether that characterization tells us anything interesting about slugs and/or nature in general. Our goal in this exercise is to find a statistical model that fits the data and provides insight into the distribution of slugs.
First we summarize the distribution of slugs under rocks in the two field types.
Nursery Rookery
0 25 9
1 5 9
2 2 8
3 2 5
4 2 2
5 1 4
6 1 1
7 1 0
8 0 1
9 0 1
10 1 0
A bar plot of the two distributions would be useful. One problem with using a bar plot with ordinal data such as count categories is that it treats the categories as nominal, so if there are any gaps (missing categories) they won't show up in the plot unless we insert the zero categories ourselves beforehand. This is not a problem here because each of the eleven count categories are present in at least one of the field types, so the table function has inserted the zero values when needed.
The output of the table function can be used as input to the barplot function to produce a bar plot. When we generate separate frequency distributions for one variable (slugs) by the levels of a second variable (field), the barplot default is to generate a stacked bar plot (Fig. 1a). If we add the beside=T argument we get two separate bar plots, one for each category of field.
(a) | (b) |
Fig. 1 Distribution of slugs counts using (a) a stacked bar plot and (b) separate bar plots |
Because our goal is to superimpose predictions from a model onto the observed distribution of slug counts, the stacked bar plot is of no use to us. Although it appears that there are two separate graphs in Fig 1b in fact this is a single graph with a single set of coordinates on the x-axis. We can see this by assigning a name to the output of the barplot function and printing the object.
The columns contain the x-coordinates of the bars with each column corresponding to a different field type. Notice that the top of the second column continues where the bottom of the first column left off.
Notice that the categories aren't labeled in the separate bar plots. We can add these with the names.arg argument in which we specify a single vector of names for both graphs. When we do so the field type labels under the graphs disappear so I need to add those with the mtext function. To position the labels at the center of each display I use the values in row 6 of the matrix of bar coordinates contained in the object coord1. The cex.names argument can be used to reduce the size of the bar labels so that more are displayed.
Fig. 2 Separate bar plots of the two slug distributions with labels for the bars
When there are only two groups to display in a bar plot, another option is to use a side by side bar plot in which the distributions of the two groups are interwoven. This can be accomplished with barplot and the beside=T option if we first transpose the table of counts with the t function so that the rows of the matrix correspond to the groups and the columns correspond to the separate count categories. There is a legend.text argument which if set to TRUE will add a legend indicating which bar corresponds to which group.
Fig. 3 Side by side bar plots of the two slug distributions with labels
The coordinates of the bars now are contained in a 2 × 11 matrix.
When the goal is to produce separate graphs according to the levels of another variable (such as the field variable here) lattice is usually a better choice than base graphics. In order to use lattice we need to organize the table of counts into a data frame. This can be accomplished with the data.frame function applied directly to the output from the table function.
The count categories have been converted to a factor. It will be useful to have the actual numerical values when we try to add model predictions. If we try to convert them to numbers with the as.numeric function we see that we don't get what we want.
The problem is that factor levels are always numbered starting at 1. The trick to getting the actual numeric labels is to first convert the factor levels to character data with the as.character function and then convert the character data to numbers.
The bar plot function in lattice is called barchart. The default arrangement is a horizontal bar plot so to get a vertical bar plot we need to add the horizontal = F argument. The graph is shown in Fig. 4a.
(a) | (b) |
Fig. 4 Bar plots using the lattice barchart function with (a) default settings and (b) with the origin=0 argument and after first converting Var1 to a factor. |
We see that the graph is not what want.
The following produces the graph shown in Fig. 4b.
We can get almost the same display by using the xyplot function using the type='h' option to draw vertical lines from the plotted points to the x-axis with lwd (line width) set to a large value to produce bars. To get square ends on the bars I use lineend=1. For xyplot we need the numerical version of Var1 rather than the factor version. Fig. 5 shows the result. The differences from barchart are that we don't get the dark borders around the bars and the missing categories are just missing and not indicated by horizontal lines as they are in Fig. 4.
Fig. 5 Bar plots using the lattice xyplot function
I start by fitting a single Poisson distribution to the data that ignores the field type from the which the slug data were obtained. The negative log-likelihood function we need is the same one we used in lecture 7 except modified to use the slug counts rather than the aphid counts we analyzed there. To make it more amenable to fitting the more complicated models that will come next, I write it in the expanded form shown below.
The major difference from the compact verion is the extra line that defines the redundant variable mu that is assigned the value of the argument p. This then becomes the value of the λ parameter of the Poisson distribution in the dpois function.
This is a single mean model, one mean for both field types. The parameter λ in the Poisson distribution is the mean and its maximum likelihood estimate is the sample mean. For the initial estimate required by the nlm function I choose a value close to the sample mean.
$estimate
[1] 1.774999
$gradient
[1] 1.056808e-06
$code
[1] 1
$iterations
[1] 5
The iterations converged (the gradient is approximately zero and code = 1) and as expected the function returns the sample mean as the point estimate of λ.
We can easily generalize this function to fit a separate mean to each field. I begin by defining a 0-1 dummy variable that indicates the field type. By default the numeric coding of factors begins at 1. To get a dummy variable I have to subtract 1 from these values.
I assign the result to the variable z.
Alternatively we could have obtained the dummy variable as follows.
I next rewrite the negative Poisson log-likelihood function using z so that the mean varies by field type.
The first line of the function creates a variable called mu that is defined in terms of the dummy variable z. Since z is a vector, mu will also be a vector. As the value of z changes from observation to observation, the value of mu will also change. For nursery slugs z equals 0. Thus for nursery slugs mu has the value
mu = p[1] + p[2]*0 = p[1]
For rookery slugs z equals 1. Thus for rookery slugs mu has the value
mu = p[1] + p[2]*1 = p[1] + p[2]
In summary we have the following.
So we see that p[1] is the value of the mean λ for nursery slugs. Since the value of λ for rookery slugs is p[1]+p[2] we see that p[2] represents the difference in the value of λ between nursery and rookery slugs. So, p[2] is the mean slug count difference between nursery and rookery slugs.
To obtain starting values for the elements of p I calculate the mean in each field type.
So the sample mean for nursery slugs is 1.275 and the mean difference is 1. I use p = c(1.2,1) as initial estimates.
$estimate
[1] 1.2749997 0.9999997
$gradient
[1] 1.125723e-05 -1.506351e-06
$code
[1] 1
$iterations
[1] 8
The iterations converged (both elements of the gradient vector are near zero and code = 1). The estimates nlm returns are the values we predicted they would be.
The two models we've fit for the mean are the following.
where z is the dummy variable defined above that indicates the rookery slugs. We wish to test
The $minimum component of the nlm object is the negative log-likelihood evaluated at the maximum likelihood estimate. The likelihood ratio statistic is twice the difference in these values for the two models ordered in such a way that the difference is positive.
This statistic has an asymptotic chi-squared distribution with degrees of freedom equal to the difference in the number of parameters estimated in the two models, which is 1 for these two models.
The likelihood ratio test is a one-tailed upper-tailed test. We reject the null hypothesis at α = .05 if . I obtain the p-value for this test next.
So using the likelihood ratio test I reject the null hypothesis and conclude that the Poisson rate constants (means) of slug occurrence are different in the two field types.
The Wald test is an alternative to the likelihood ratio test. In order to use it we only need to fit the second of the two models, but we do need to compute the Hessian matrix to obtain the standard errors of the parameter estimates. For that we need to include hessian=T as an argument of the nlm function.
$estimate
[1] 1.274999 1.000001
$gradient
[1] 1.337493e-07 5.996976e-06
$hessian
[,1] [,2]
[1,] 48.94677 17.58066
[2,] 17.58066 17.58087
$code
[1] 1
$iterations
[1] 9
From lecture 8 we learned that the inverse of the negative of the Hessian yields the variance-covariance matrix of the parameters. Because we're working with the negative log-likelihood, the negative sign has already been taken care of. To carry out matrix inversion we use the solve function of R. The variances occur on the diagonal of the inverted matrix and can be extracted with the diag function. Finally we take their square root to obtain the standard errors of the parameter estimates.
Individual 95% Wald confidence intervals are obtained from the formula
Using R we obtain the following.
The 95% Wald confidence interval for β_{0} is (0.95, 1.63) while that for β_{1} is (0.42, 1.58). Because 0 is not contained in the confidence interval for β_{1}, we reject the null hypothesis and conclude that β_{1} ≠ 0.
As an alternative to constructing confidence intervals we can carry out the Wald test formally. The Wald statistic for our null hypothesis is the following.
We reject H_{0} if . Because 3.356 > 1.96, we reject the null hypothesis and conclude that the Poisson rate constants are different in the two field types. The p-value of the test statistic is just the probability under the null model that . We obtain this by calculating this probability for one tail and then doubling the result.
Observe that this p-value is very close to what we obtained from the likelihood ratio test above.
With only two groups it's easy to calculate the probabilities we need. I do things a bit more elaborately in order to set up a framework for handling the more complicated situation you'll see in the homework assignment. We need the Poisson probabilities for the categories 0 through 10. The Poisson mean varies depending on the field type the slugs inhabit so we'll need to account for this. I elect to add the tail probability, P(X > 10) to the last category, P(X = 10). The slugtable data frame variable Var1 contains the different count categories while Var2 contains the field variable. I evaluate the dpois function on Var1 and use Var2 to change the value of the lambda parameter accordingly.
I add the tail probaility to the individual probabilities only for the last category.
If we sum the probabilities by count category for each rock by field type, we get the predicted counts for that field. Because our model is so simple the predicted probabilities are the same for each rock that comes from the same field type. So a simpler way to get the predicted counts is to multiply the probability distribution for one rock by the total number of rocks, which is 40 for each field type.
In general the situation could be more complicated so I handle this more generally. I tabulate the counts by field type and then use the Var2 variable to select which entry of this table I should use to multiply the individual probabilities.
To add the predicted counts to a lattice bar plot of the observed counts we need to use the subscripts variable in a panel function to select the appropriate observations for each panel. Here's an example where I use the panel.xyplot function to draw the bar chart and then use a panel.points function to superimpose the predicted counts on top of the bars.
In a lattice graph the formula expression Freq~Var1|Var2 defines the x- and y-variables for the panel function. In the panel function x and y correspond to the values of Var1 and Freq respectively for the subset of observations whose value of variable Var2 matches the the value for the panel currently being drawn. For all other variables not included in the formula we have to do the subsetting by panel explicitly ourselves. That's what the subscripts variable is for. It acts like a dummy variable that selects the current panel observations. So in the panel.points function slugtable$pred0.count[subscripts] selects the predicted counts for the current panel while x represents the already subsetted values of Var1.
There's a barchart panel function, panel.barchart, that can be used instead of panel.xyplot to generate the observed counts. Notice that when we use the panel.barchart function with xyplot Var1 does not need to be a factor in order to get the labels on the bars correct.
The two versions of the lattice bar plot are shown in Fig. 6.
(a) | (b) |
Fig. 6 Bar plots with superimposed predictions from a Poisson model. (a) uses the panel.xyplot function to draw the bars while (b) uses the panel.barchart function to draw the bars. |
There are 22 count categories displayed in Fig. 4. Of these only six of the predicted Poisson counts exceed five. So 73% of the expected counts are smaller than five which is far greater than the 20% value that is recommended in order for the chi-squared distribution of the Pearson statistic to hold. It's pretty clear that trying to correct things here by merging categories to meet this criterion would completely distort the distribution. Thus taking a simulation-based approach is the only reasonable option.
To assess the overall fit of the model we need to consider the fit in both field types simultaneously. This is not possible to do with the simulation-based goodness of fit test as implemented by chisq.test because it requires a single model probability vector whose elements sum to one. We have two probability vectors, one for each field type. We could multiply the probabilities by one-half and concatenate the results to meet the "sum to one" criterion, but then the simulated results do not fully correspond to the way the experiment was designed. What we need is a conditional test. Conditional on there being 40 rocks in each field type how likely is it to obtain the observed count distribution if two Poisson models with different rate parameters were truly generating the distributions. A simulation-based test is possible but we'll need to construct it ourselves.
The basic idea behind the simulation-based test is that once we obtain the predicted probabilities for individual categories, including lumping the tail probability into a single category, we essentially have separate multinomial distributions (the multivariate generalization of the binomial distribution) for each field type. Using the model probabilites and a random number function we can generate new count frequencies separately by habitat, compare them to the expected counts, and calculate the Pearson statistic using all 22 categories. Doing this repeatedly yields a null distribution for the Pearson statistic, a set of reasonable values of the Pearson statistic when the model is known to hold. We then examine where the observed value of the Pearson statistic falls in this distribution to determine how similar it is to a typical model result. This is exactly what the chisq.test function does when there is a single population.
I organize the predicted probabilities in a 2 × 11 matrix so that each field type occupies a separate row.
We could also have used the matrix function with the byrow and ncol arguments to accomplish the same thing.
I also tabulate the number of observations by field type.
The first step is to generate random data using the model. The following sapply call generates 80 observations (40 from each field type) using the random multinomial function of R, rmultinom. For each field x rmultinom draws a single multinomial sample of size 40, n[x], based on the probabilities contained in the vector model.p0[x]. Here x indicates the field type, x = 1 and 2.
Each column corresponds to a multinomial realization from one of the field types. I stack these two columns with the as.vector function so in the end I have a single vector.
To get a reasonable null distribution of Pearson statistics I carry this out 9,999 times using the replicate function. I first set the random seed explicitly with the set.seed function so that I can reproduce the results if needed.
From the output of the dim function we see that each replication occupies a separate column of the matrix sim.data. Each column contains a vector of pseudo-observations that are consistent with the model that was used to produce the multinomial probabilities. The Pearson statistic for these data takes the following form.
Each column of sim.data represents one possible set of O_{i} values. The vector slugtable$pred0.count contains the predicted values under the negative binomial model, the E_{i}. I carry out the calculation of the Pearson statistic for a generic column x of sim.data as follows.
Now we need to carry out this operation on the 9999 columns separately. The apply function of R is ideal for this. We can use it to apply a function to the individual rows (2nd argument = 1) or columns (2nd argument = 2) of a matrix. For this problem we want to operate on the columns of the matrix so the following function call is appropriate.
Finally we need to calculate the Pearson statistic using the observed data and then append the result to the vector of simulated Pearson statistics for a total of 10,000 Pearson statistics.
The p-value of the simulation-based test is defined as the total number of Pearson statistics that are as large or larger than the observed Pearson statistic divided by the total number of simulations.
Our p-value is 1 out of 10,000 telling us that the observed value of the Pearson statistic was the larger than any of the simulated values. The null hypothesis of the Pearson goodness of fit test is the following.
H_{0}: fit is adequate
H_{1}: fit is inadequate
Thus we should reject the null hypothesis and conclude that we have a significant lack of fit. The block code of code used to carry out the simulation-based test is repeated below uninterrupted by my commentary.
A compact collection of all the R code displayed in this document appears here.
Jack Weiss Phone: (919) 962-5930 E-Mail: jack_weiss@unc.edu Address: Curriculum in Ecology, Box 3275, University of North Carolina, Chapel Hill, 27516 Copyright © 2010 Last Revised--October 5, 2010 URL: http://www.unc.edu/courses/2008fall/ecol/563/001/docs/lectures/lecture9.htm |