Lecture 29—Friday, March 3, 2006
What was covered?
- Applying the Pearson test to transformed count data
- Analysis of covariance
Terminology Defined
Goodness of Fit Tests for Transformed Count Models
- A question came up about how to carry out the Pearson chi-squared goodness of fit test for count data when the model to be tested is one in which the response variable has been transformed. As an illustration, consider the islands data set of Assignment 6. Suppose you fit the log Arrhenius model to these data. In R this would be done as follows.
> log.arrhenius<-lm(log(sp.richness)~ log(island.area), data=islands)
- This regression model estimates the mean of a normal distribution for each value of log(island.area). It assumes this mean is linearly related to log(island.area). Thus we can visualize the regression model as a series of normal distributions with constant variance whose means track the regression line. Fig. 1 shows the predicted normal distributions for four islands, which from left to right are Machias Seal Island, Cuttyhunk Island, Tuckernuck Island, and Block Island, with corresponding island areas 10, 61, 350, and 2707 respectively. The remaining 18 islands generate similar normal distributions and these would appear at the locations indicated by the rug plot shown at the bottom edge of the figure.
- Suppose we wish to calculate the probability of obtaining a species richness value of 150 using the log-Arrhenius model.
- Under the model each island contributes to this probability. Because the probability distributions for different islands (with different areas) are shifted vertically as their means track the regression line, the probability will be different on different islands. In Fig. 1 we see for example that a richness of 150 is predicted to be a fairly typical value on Machias Seal and Cuttyhunk Islands, an unusual value on Tuckernuck Island, and a value nearly impossible to obtain on Block Island.
- Since the normal probability distribution only applies to the log-transformed data, we actually need to calculate the probability of observing a richness value of log(150).
- Because the normal distribution is a continuous distribution the probability of observing any specific single value is technically zero. On a continuous scale we can interpret an integer value of 150 as any value between 149.5 and 150.5. Thus the probability that X = 150 should be thought of as the probability that 149.5 ≤ X ≤ 150.5, and this is something that we can calculate. Formally we proceed as follows.
- This last calculation is just the difference in the values of two cdfs. Cumulative probabilities can be obtained in R with the pnorm function.
- Using R we would proceed as follows. First obtain the means predicted by the regression model using the fitted function. (The predict function would yield the same answer here because the link function being used is the identity link.)
> means<-fitted(log.arrhenius)
> means
Appledore Island Bear Island Block Island Cuttyhunk Island
5.050165 4.199691 6.434003 5.188721
Fishers Island Gardiners Island Grand Manan Island Gull Rock
6.164149 6.205569 6.964010 4.294147
Horse Island Isle au Haut Kent Island Machias Seal Island
4.294147 6.317777 5.432068 4.594997
Marthas Vineyard Matinicus Rock Mount Desert Island Muskeget Island
6.964010 4.521731 7.185109 5.461491
Nantucket Island Naushon Island Penikese Island Tuckernuck Island
6.891348 6.380507 4.996805 5.762341
Whaleboat Island Wooden Ball Island
5.103115 5.096054
- I next calculate the estimated common value of σ for each of the normal distributions.
> sigma<-sqrt((sum(residuals(log.arrhenius)^2))/dim(islands)[1])
> sigma
[1] 0.3751165
- I define a probability function that computes P(x–.5 ≤ X ≤ x+.5) for any x. Notice that I write the function so that the input argument is sp.richness rather than log(sp.richness).
> prob.func<-function(x, means, sigma)
mean(pnorm(log(x+.5) ,mean=means, sd=sigma)-pnorm(log(x-.5), mean=means, sd=sigma))
- I sum up the expected richness probabilities from 1 to 2000 to see if this exhausts the probability.
> sum(sapply(1:2000,function(x) prob.func(x,means,sigma)))
[1] 0.9884405
- There is only a trivial amount of probability left. To account for it I write a tail probability function to pick up what remains.
> tail.prob<-function(x, means, sigma) mean(1-pnorm(log(x-.5), mean=means, sd=sigma))
- Adding this to what we calculated previously we see that things now do sum to 1.
> sum(sapply(1:2000, function(x) prob.func(x, means, sigma))) + tail.prob(2001, means, sigma)
[1] 1
- I construct a vector of expected probabilities that consists of the probabilities of obtaining richness values of 1 to 2000 followed by the tail probability of what remains.
> exp.probs<-c(sapply(1:2000, function(x) prob.func(x, means, sigma)), tail.prob(2001, means, sigma))
> length(exp.probs)
[1] 2001
- To get the expected frequencies I multiply the expected probabilities by the number of islands, 22. This is equal to the number of rows in the data frame. (The number of rows is the first component of the object returned by the dim function.)
> exp.freqs<-exp.probs*dim(islands)[1]
> sum(exp.freqs)
[1] 22
- In order for the asymptotic distribution to be approximately chi-squared, I need to bin the expected frequencies so that the accumulated frequency in each binned cell exceeds 5. I do the binning starting from left to right by finding the first location where the cumulative sum exceeds 5 using R's cumsum function.
> min((1:2001)[cumsum(exp.freqs)>5])
[1] 114
> sum(exp.freqs[1:114])
[1] 5.032401
- In general I need the positions where the cumulative sums first exceed 5, 10, 15, and 20.
> sapply(c(5,10,15,20),function(x) min((1:2001)[cumsum(exp.freqs)>x]))
[1] 114 219 501 1136
- Actually I should stop at 15 otherwise the final category will be too small. So the expected frequencies are
> Ei<-c(sum(exp.freqs[1:114]), sum(exp.freqs[115:219]), sum(exp.freqs[220:501]), sum(exp.freqs[502:2001]))
[1] 5.032401 4.969453 5.004261 6.993884
- I could play around with the second category in order to get the probability to actually exceed 5 but 4.97 is clearly close enough. I next group the observed counts using the same cut points. I do this with a Boolean expression that defines the observations that should make up each bin. Summing the Boolean expression then gives me the observed counts in these categories.
> sum(islands$sp.richness<=114)
[1] 5
> sum(islands$sp.richness<=219)-sum(islands$sp.richness<=114)
[1] 4
> sum(islands$sp.richness<=501)-sum(islands$sp.richness<=219)
[1] 5
> sum(islands$sp.richness>501)
[1] 8
> Oi<-c(5,4,5,8)
- So the Pearson statistic is
> sum((Oi-Ei)^2/Ei)
[1] 0.3340717
which is quite small.
- But this has all been for naught. The degrees of freedom for this test are n – 1 – p, where n is the number of categories and p is the number of estimated parameters. We have 4 count categories and in the log-Arrhenius model we estimated 3 parameters (the intercept, slope, and variance of the normal distribution). Thus our degrees of freedom are 4–1–3 = 0. Since we have no degrees of freedom left we can't do the test.
- So we need to employ an alternative measure of fit. We can for example report R2 which in this case measures the proportion of the variance of log(species richness) that is explained by the model.
Energy Cost of Echo-Location in Bats: An Example of Analysis of Covariance
- Last time we began the discussion of a study by Speakman and Racey (1991) to assess the energy costs of echolocation in bats. In this study Speakman and Racey (1991) used data they had collected plus data combed from the literature. In total 18 studies provided 20 observations on 4 non-echolocating bats, 12 birds, and 4 echolocating bats.
- Not all of these observations were of different species, but the authors chose to treat data for the same species but coming from different studies as independent observations. Techniques and sample sizes varied from study to study so it is likely that the reported values were not equally precise. When sample sizes (n) were greater than 1, the means were used.
- The measured variables were:
- species: scientific name of animal
- mass: body mass of animal in grams
- type: type of animal: 1non-echolocating bat, 2bird, 3echolocating bat
- energy: in flight energy expenditure measured in Watts (W)
- n: sample size used in the study from which the data were obtained
- As is typical for observational studies, the standard principles of good experimental design could not be followed. In this study body mass was not randomly assigned to the "treatment" groupsanimal types. As a result echolocating bats came small and non-echolocating bats came big. Of course mass is not something that can be randomly assigned to an animal (although perhaps an effort could have been made to find a greater size diversity). In fact non-echolocating bats comprise the megachiroptera a taxonomic group that includes the flying foxes who are some the world's largest bats.
- The bottom line is that body size has the potential of undermining the purpose of this study because body size is related to energy use and at the same time confounded with the variable of interest (animal category type). But there is a solution to this problem. Even though we can't control the bias in the usual way, through randomization and good experimental design, we can still try to control for it after the fact by applying appropriate statistical methods.
- Regression is the standard technique for carrying out post hoc adjustments. What we'll do is correct for the bias by subtracting off (statistically) the effect due to mass. Having adjusted for the effect of body mass we can then use the animal's category to account for whatever unexplained variability remains. We'll do this by fitting a single regression model that includes both body mass and animal category as predictors. Because body mass appears in the model largely to correct for a bias in the sample, we generally refer to it as a covariate (rather than a predictor).
- In order for this method to be successful, we need to be able to make the same correction for all animals. This is possible if in the regression of energy use on body size the same slope works for all three animal groups.
- If the separate regression lines for the three animal groups are parallel, then we can use the vertical distance between the lines to measure the effect of the animal group category on energy use. Distances between the lines reflect differences in energy use between the groups. Because the lines are parallel, the distance is the same for all body masses. Hence the notion of a "category type" effect on energy use is well-defined.
- If on the other hand the lines are not parallel, then the distance between the lines varies. As a result there would be no single effect due to type of animal. The effect would vary depending on the size of the animal.
- So our first step should be to determine if a single slope suffices for all three animal categories.
The Models
- Using the log-transformed allometric equation to described the relationship between energy use and body mass as a starting point, we consider a series of models to assess the manner in which animal type modifies the nature of this relationship. To ensure that the categorical variable type is entered into the regression model as a set of dummy regressors, we declare it as a factor and assign labels to the categories 1, 2, and 3.
> type.factor<-factor(type,labels=c('nonecho bats', 'birds','echo bats'))
>
contrasts(type.factor)
X1 X2
birds echo bats
nonecho bats 0 0
birds 1 0
echo bats 0 1
- So we see that R has used the first numerical category (type = 1) as the reference category. Rather than use R's labels, let me denote the dummy variable R calls birds as x1 and the dummy variable R calls echo bats as x2. To simplify notation further, let y = log(energy) denote the response and z = log(mass) denote the covariate.
Model 1—Coincident Lines
- Model 1 is the coincident lines model. In this model all three groups of animals share exactly the same regression equation.

Model 2—Parallel Lines
- Model 2 is the parallel lines model, also called the analysis of covariance model. In this model the nature of the relationship between log(energy) and log(mass) is the same for all three groups of animals, but the different animal groups have different baseline values.

- To understand what this model says, let's write out the equations separately for nonecho bats, birds, and echo bats using the definitions of x1 and x2 that appear in the output of R's contrasts function above.
- For nonecho bats x1 = 0 and x2 = 0. Thus our equation is
nonecho bats: 
- For birds x1 = 1 and x2 = 0. Thus our equation is
birds: 
- For echo bats x1 = 0 and x2 = 1. Thus our equation is
echo bats: 
- All models have the same slopes. For the intercepts we see the groups differ as follows.
| Group |
Intercept |
| nonecho bats |
β0 |
| birds |
β0 + β2 |
| echo bats |
β0 + β3 |
- Thus a test of β2= 0 is a test of whether the regression lines of log(energy) versus log(mass) have the same intercept for birds and nonecho bats.
- A test of β3= 0 is a test of whether the regression lines of log(energy) versus log(mass) have the same intercept for echo bats and nonecho bats.
Model 3—Non-parallel Lines
- Model 3 is the non-parallel lines model, also known as the full interaction model.

The terms x1z and x2z are called interaction terms. Literally they are the products of the variables x1 with z and x2 with z.
- To understand what this model says, let's write out the equations separately for nonecho bats, birds, and echo bats using the definitions of x1and x2 that appear in the output from the contrasts function above.
- For nonecho bats x1 = 0 and x2 = 0. Thus our equation is
nonecho bats: 
- For birds x1 = 1 and x2 = 0. Thus our equation is
birds: 
- For echo bats x1 = 0 and x2 = 1. Thus our equation is
echo bats: 
- All models have the different slopes and intercepts as shown below.
| Group |
Intercept |
Slope |
| nonecho bats |
β0 |
β1 |
| birds |
β0 + β2 |
β1 + β4 |
| echo bats |
β0 + β3 |
β1 + β5 |
- Thus a test of β4 = 0 is a test of whether the regression lines of log(energy) versus log(mass) have the same slope for birds and nonecho bats.
- A test of β5 = 0 is a test of whether the regression lines of log(energy) versus log(mass) have the same slope for echo bats and nonecho bats.
Testing the Models
Step 1
- Following the protocol described last time, we should being by testing whether model 3 can be reduced to model 2. This is a test of whether the regression line for log(energy) versus log(mass) has the same slope in all three groups.
- In terms of model parameters this is a test of

- To carry out this test we fit the full interaction model and compare it to a model in which the two interaction terms, x1z and x2z, are omitted. In normal theory models this is done with what's called a partial F test, a test that is comparable to the likelihood ratio test in the generalized linear model framework. If we reject H0 we stop and conclude that model 3 is our best model. If we fail to reject H0 then model 3 can be simplified to model 2 and we proceed to the next step.
Step 2
- Having rejected the need for separate slopes, we next test whether we really have three lines, or only one. This is a test of whether the common slope model for log(energy) versus log(mass) also is a common intercept model.
- In terms of model parameters this is a test of

- To carry out this test we fit the parallel lines model and compare it to a model in which the two main effect terms, x1 and x2, are omitted. If we reject H0 we stop and conclude that model 2 is our best model. If we fail to reject H0 then model 2 can be simplified to model 1. The statistical test is once again a partial F test. If model 2 can be simplified to model 1 then we conclude that the study does not support the hypothesis that echo-locating while flying exacts an additional significant energy cost.
Cited reference
- Speakman, J. R. and P. A. Racey. 1991. No cost for echolocation for bats in flight. Nature 350: 421423.
Course Home Page