Lecture 11 (lab 3) —Tuesday, January 31, 2006
What was covered?
- R topics
- recreating data from summary tables using the rep function
- evaluating functions on components of vector data individually with sapply and lapply
- the numerical optimization functions nlm, optim, and optimizer
- the root-finding function uniroot
- the quantile functions qnorm and qchisq
- the probability distribution functions ppois and pchisq
- the Bhat library and the plkhci function
- constructing a loglikelihood function
- graphing a loglikelihood function
- numerically finding a maximum likelihood estimate
- confidence intervals (Wald and profile likelihood) for maximum likelihood estimates
- the Pearson chi-square goodness of fit test for fitting models to count data
- a Monte Carlo-based version of the Pearson chi-squared goodness of fit test
R functions and commands demonstrated
- chisq.test can be used to perform a Monte Carlo goodness of fit test based on the Pearson chi-squared statistic
- lapply is used to apply a vector individually to elements in a vector returning one value for each component of the vector. The return value of the function is a list.
- nlm is the nonlinear minimization function of R
- optim is the numerical optimization function of R. It offers a number of different optimization schemes
- optimize is a special optimization function for functions of only one variable
- plkhci (in Bhat package) is a function for finding profile likelihood confidence intervals
- qchisq is the quantile function of the chi-squared distribution
- qnorm is the quantile function of the standard normal distribution
- rep generates a vector all of whose elements are the same
- sapply is used to apply a vector individually to elements in a vector returning one value for each component of the vector. The return value of the function is a vector.
- uniroot is used to find a root (zero) of a function in an interval
- unlist turns a list into a vector
R function options
- gradtol= (argument to nlm) sets tolerance value for the gradient thus controlling when the iterations of nlm will stop
- hessian= (argument to nlm) takes on values TRUE or FALSE (the default). When TRUE the Hessian is reported evaluated at the minimum value obtained by nlm
- simulate.p.value= (argument to chisq.test) takes on values TRUE or FALSE. When TRUE a Monte Carlo-based p-value is calculated for the observed Pearson chi-squared statistic
- type= (argument to plot) controls what is plotted. We use type='l' in this exercise to plot a line
Data Entry
- The data set for today's exercise is shown below. It appears on p. 130 of Krebs, Charles. 1999. Ecological Methodology. Menlo Park, CA: Addison-Wesley. He doesn't give a source for it.
Table 1—Data for Lab Exercise
# of aphids on a stem |
Number of stems |
0 |
6 |
1 |
8 |
2 |
9 |
3 |
6 |
4 |
6 |
5 |
2 |
6 |
5 |
7 |
3 |
8 |
1 |
9 |
4 |
Total |
50 |
- Our goal today is to fit a Poisson distribution to these data. Since the Poisson distribution is a one-parameter distribution (with pameter λ) the problem reduces to estimating λ. In the future our models will include predictors and we'll try to make λ a function of these predictors. In the parlance of model-building what we're essentially doing today is fitting a Poisson regression model to these data in which the only predictor is an intercept.
- The data above are given in summary form. The raw data would have consisted of a list of stems such that for each stem there would have been a count of how many aphids were found on that stem. The table above groups the stems together based on how many aphids were found. The raw data can be recreated from the grouped data with R's rep function. With rep we can repeat the numbers in the first column—0, 1, 3, …, 9—the exact number of times listed in the second column.
- First let's explore how rep works. The first argument to rep is what we want to repeat; the second argument describes how the repetition is supposed to be done.
- If the second argument is a single number, then the first argument is repeated that many times as a unit.
> rep(4,10)
[1] 4 4 4 4 4 4 4 4 4 4
> rep(0:9,4)
[1] 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4
[36] 5 6 7 8 9
- If the second argument is a vector whose length is the same as the length of the first argument, then the elements of the two vectors are matched up 1-to-1. If each element of the vector in the second argument is the same, then the elements of the first argument are all repeated the same number of times, but individually this time, not as a unit.
> rep(0:9,rep(4,10))
[1] 0 0 0 0 1 1 1 1 2 2 2 2 3 3 3 3 4 4 4 4 5 5 5 5 6 6 6 6 7 7 7 7 8 8 8
[36] 8 9 9 9 9
In specifying the repetition pattern for the second argument I used a shortcut. I wrote rep(4,10) which produces, as shown above, a vector of all 4s that is of length 10.
- Now we're ready to recreate the data. To do this I make the second argument of rep the observed frequencies from the second column of the data table above.
> num.stems<-c(6,8,9,6,6,2,5,3,1,4)
#error check: numbers should sum to 50
> sum(num.stems)
[1] 50
> aphid.data<-rep(0:9,num.stems)
> aphid.data
[1] 0 0 0 0 0 0 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 4 4 4 4 4 4
[36] 5 5 6 6 6 6 6 7 7 7 8 9 9 9 9
Constructing the Loglikelihood
- The loglikelihood for a Poisson model was constructed in Lecture 10 and is shown below

- The quantity inside the parentheses is just the dpois function evaluated at xi and λ. The dpois function is listable. Thus for a fixed value of λ we can obtain the probabilities for all of our observations at once. For example, when λ = 1 we have
> dpois(aphid.data,lambda=1)
[1] 3.678794e-01 3.678794e-01 3.678794e-01 3.678794e-01 3.678794e-01
[6] 3.678794e-01 3.678794e-01 3.678794e-01 3.678794e-01 3.678794e-01
[11] 3.678794e-01 3.678794e-01 3.678794e-01 3.678794e-01 1.839397e-01
[16] 1.839397e-01 1.839397e-01 1.839397e-01 1.839397e-01 1.839397e-01
[21] 1.839397e-01 1.839397e-01 1.839397e-01 6.131324e-02 6.131324e-02
[26] 6.131324e-02 6.131324e-02 6.131324e-02 6.131324e-02 1.532831e-02
[31] 1.532831e-02 1.532831e-02 1.532831e-02 1.532831e-02 1.532831e-02
[36] 3.065662e-03 3.065662e-03 5.109437e-04 5.109437e-04 5.109437e-04
[41] 5.109437e-04 5.109437e-04 7.299195e-05 7.299195e-05 7.299195e-05
[46] 9.123994e-06 1.013777e-06 1.013777e-06 1.013777e-06 1.013777e-06
- Next we need the log of all these values.
> log(dpois(aphid.data,lambda=1))
[1] -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000
[7] -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000
[13] -1.000000 -1.000000 -1.693147 -1.693147 -1.693147 -1.693147
[19] -1.693147 -1.693147 -1.693147 -1.693147 -1.693147 -2.791759
[25] -2.791759 -2.791759 -2.791759 -2.791759 -2.791759 -4.178054
[31] -4.178054 -4.178054 -4.178054 -4.178054 -4.178054 -5.787492
[37] -5.787492 -7.579251 -7.579251 -7.579251 -7.579251 -7.579251
[43] -9.525161 -9.525161 -9.525161 -11.604603 -13.801827 -13.801827
[49] -13.801827 -13.801827
- Finally we need to sum up all these values.
> sum(log(dpois(aphid.data,lambda=1)))
[1] -215.9158
- The last line is what we need for our loglikelihood function. We just need to turn it into a function in which lambda is the sole argument.
> poisson.func<-function(lambda) sum(log(dpois(aphid.data,lambda)))
> poisson.func(1)
[1] -215.9158
> poisson.func(2)
[1] -146.0014
The function seems to work.
Graphing the Loglikelihood
- To graph the loglikelihood, we need to evaluate our function on a whole list of possible values of λ and then plot the results against λ. For example to plot over the range 1 ≤ λ ≤ 6, I would first generate λ in increments of 0.1 with seq(1,6,.1). But when I apply the loglikelihood function to it I can see something is wrong.
> poisson.func(seq(1,6,.1))
[1] -89.3177
It returns a single number. I was expecting it to return a list of numbers one for each value of seq(1,6,.1).
- What if I give it two numbers?
> poisson.func(1:2)
[1] -179.2257
> poisson.func(1)
[1] -215.9158
> poisson.func(2)
[1] -146.0014
As we can see when I give it both 1 and 2 the answer I get is neither the value at 1 nor the value at 2.
- To understand what's happening let's tease apart the function into its components. I repeat the above exercise using just the innermost part of our function, dpois.
> dpois(aphid.data,1)
[1] 3.678794e-01 3.678794e-01 3.678794e-01 3.678794e-01 3.678794e-01
[6] 3.678794e-01 3.678794e-01 3.678794e-01 3.678794e-01 3.678794e-01
[11] 3.678794e-01 3.678794e-01 3.678794e-01 3.678794e-01 1.839397e-01
[16] 1.839397e-01 1.839397e-01 1.839397e-01 1.839397e-01 1.839397e-01
[21] 1.839397e-01 1.839397e-01 1.839397e-01 6.131324e-02 6.131324e-02
[26] 6.131324e-02 6.131324e-02 6.131324e-02 6.131324e-02 1.532831e-02
[31] 1.532831e-02 1.532831e-02 1.532831e-02 1.532831e-02 1.532831e-02
[36] 3.065662e-03 3.065662e-03 5.109437e-04 5.109437e-04 5.109437e-04
[41] 5.109437e-04 5.109437e-04 7.299195e-05 7.299195e-05 7.299195e-05
[46] 9.123994e-06 1.013777e-06 1.013777e-06 1.013777e-06 1.013777e-06
> dpois(aphid.data,2)
[1] 0.1353352832 0.1353352832 0.1353352832 0.1353352832 0.1353352832
[6] 0.1353352832 0.2706705665 0.2706705665 0.2706705665 0.2706705665
[11] 0.2706705665 0.2706705665 0.2706705665 0.2706705665 0.2706705665
[16] 0.2706705665 0.2706705665 0.2706705665 0.2706705665 0.2706705665
[21] 0.2706705665 0.2706705665 0.2706705665 0.1804470443 0.1804470443
[26] 0.1804470443 0.1804470443 0.1804470443 0.1804470443 0.0902235222
[31] 0.0902235222 0.0902235222 0.0902235222 0.0902235222 0.0902235222
[36] 0.0360894089 0.0360894089 0.0120298030 0.0120298030 0.0120298030
[41] 0.0120298030 0.0120298030 0.0034370866 0.0034370866 0.0034370866
[46] 0.0008592716 0.0001909493 0.0001909493 0.0001909493 0.0001909493
> dpois(aphid.data,1:2)
[1] 3.678794e-01 1.353353e-01 3.678794e-01 1.353353e-01 3.678794e-01
[6] 1.353353e-01 3.678794e-01 2.706706e-01 3.678794e-01 2.706706e-01
[11] 3.678794e-01 2.706706e-01 3.678794e-01 2.706706e-01 1.839397e-01
[16] 2.706706e-01 1.839397e-01 2.706706e-01 1.839397e-01 2.706706e-01
[21] 1.839397e-01 2.706706e-01 1.839397e-01 1.804470e-01 6.131324e-02
[26] 1.804470e-01 6.131324e-02 1.804470e-01 6.131324e-02 9.022352e-02
[31] 1.532831e-02 9.022352e-02 1.532831e-02 9.022352e-02 1.532831e-02
[36] 3.608941e-02 3.065662e-03 1.202980e-02 5.109437e-04 1.202980e-02
[41] 5.109437e-04 1.202980e-02 7.299195e-05 3.437087e-03 7.299195e-05
[46] 8.592716e-04 1.013777e-06 1.909493e-04 1.013777e-06 1.909493e-04
If you compare the output from the three function calls. you'll see that in the last call R alternates its use of λ = 1 and λ = 2. For the first observation it uses 1, for the second it uses 2, for the third it uses 1, etc. Thus R is trying to match up the entries in each vector argument and when it runs out of entries in the second vector it just recycles the vector over and over again. Of course when we take the log of the results and then sum everything up we get total gibberish.
- The solution is to force R to use each element of the λ vector one at a time. The sapply function will do this.
> poisson.func(1)
[1] -215.9158
> poisson.func(2)
[1] -146.0014
> sapply(1:2,poisson.func)
[1] -215.9158 -146.0014
Now we got the right answer. There is a sister function to sapply that instead of returning a vector returns a list. It's called lapply.
> lapply(1:2,poisson.func)
[[1]]
[1] -215.9158
[[2]]
[1] -146.0014
A list is a data object in which the different components can be of different types (character, numeric, etc.) and of different lengths. We can even convert a list into a vector using the unlist function.
> unlist(lapply(1:2,poisson.func))
[1] -215.9158 -146.0014
- So now we're ready to do the plot. To obtain the y-coordinates of the graph we need to use sapply. I specify type='l' in the plot function so that plotted points are connected by line segments thus generating a curve. I use xlab and ylab to add intelligible labels
> plot(seq(1,6,.1), sapply(seq(1,6,.1), poisson.func), type='l', xlab=expression(lambda), ylab='loglikelihood')
- From the graph it would appear that the loglikelihood takes on its maximum value somewhere in the interval 3 ≤ λ ≤ 4. We can get a better estimate by gradually zooming in on it. I start by plotting only in the interval [3, 4] using increments of .01.
> plot(seq(3,4,.01), sapply(seq(3,4,.01), poisson.func), type='l', xlab=expression(lambda), ylab='loglikelihood')
- Based on the plot we should now focus on the interval [3.4, 3.5]. I use increments of .001 this time.
> plot(seq(3.4,3.5,.001), sapply(seq(3.4,3.65,.001), poisson.func), type='l', xlab=expression(lambda), ylab='loglikelihood')

Fig. 2 Zooming in on the maximum likelihood estimate
- From this last graph we would estimate the MLE (maximum likelihood estimate) to be around 3.46. In fact that's the exact value. From lecture 10 we learned that the MLE for λ in a Poisson distribution is
.
> mean(aphid.data)
[1] 3.46
Maximizing the loglikelihood numerically
- Clearly we could continue to zoom in on the MLE graphically and obtain an estimate that is as accurate as we please. It is also clear that the process is tedious and unwieldy. A far more efficient (but more dangerous approach) is to obtain the estimate using numerical optimization routines. R provides two numerical optimization functions that are relevant here, nlm and optim. I illustrate both.
nlm
- Figure 3 shows the help screen for nlm
> ?nlm

Fig. 3 Help screen for nlm
- From the help screen we see that there are only two arguments that are required, f and p, the rest have default values.
- Observe that nlm carries out minimization, not maximization. Thus in order for us to be able to use nlm we need to change the way we're formulating our problem. Since maximizing f is equivalent to minimizing –f, we need to reformulate our objective function so that it returns the negative loglikelihood rather than the loglikelihood.
poisson.negloglik<-function(lambda) -sum(log(dpois(aphid.data, lambda=lambda)))
- To use nlm we enter poisson.negloglik as the first argument and then an initial estimate for the parameter. Since our graphical analysis indicated that the MLE is near 3, I choose 3 for the second argument.
> nlm(poisson.negloglik,3)
$minimum
[1] 124.1764
$estimate
[1] 3.459998
$gradient
[1] 6.571497e-08
$code
[1] 1
$iterations
[1] 4
- The output tells us the following.
- The value of the negative loglikelihood at the minimum is 124.1764.
- The estimate of the MLE of λ is 3.459998.,
- The value of the gradient at the MLE is very small. This is what we want. The gradient is the same as the score, the derivative of the loglikelihood. In our calculus derivation of the MLE we differentiated the loglikelihood, set the result equal to zero, and then solved for the MLE. Thus the value of the score (gradient) at the MLE should be zero. The fact that the value of the gradient shown above is very close to zero encourages us to believe that nlm has found a solution (although there's no guarantee that it's the global minimum that we seek).
- The help screen tell us that code=1 means "relative gradient is close to zero, current iterate is probably solution."
- It took 4 iterations of the numerical algorithm to reach convergence.
- We can obtain a more accurate estimate of the MLE by specifying a lower value for gradtol than the default value of 1E-6.
> nlm(poisson.negloglik,3,gradtol=1E-12)
$minimum
[1] 124.1764
$estimate
[1] 3.46
$gradient
[1] 1.991984e-09
$code
[1] 2
$iterations
[1] 5
- Notice that this time we get the exact answer, 3.46. The gradient is smaller and it took one more iteration to reach convergence. The code = 2 means, according to the help screen, "successive iterates within tolerance, current iterate is probably solution." Apparently the value of the estimate was no longer changing so the method stopped.
optim
- The R function optim also does numerical optimization. Figure 4 shows the help screen for optim.
> ?optim

Fig. 4 Help screen for optim
- From the help screen we see that only two arguments are required—par and fn—the rest have default values. The argument order, the initial guess followed by the function, is the reverse from what is required by nlm. From the description of the fn argument in the help screen it is apparently possible by specifying an additional argument to get optim to maximize a function rather than minimize it. The default action is minimization.
> optim(3,poisson.negloglik)
$par
[1] 3.460547
$value
[1] 124.1764
$counts
function gradient
20 NA
$convergence
[1] 0
$message
NULL
Warning message:
one-diml optimization by Nelder-Mead is unreliable: use optimize in: optim(3, poisson.negloglik)
- Although optim returns the right answer and according to the help screen convergence = 0 indicates successful convergence, we are given a warning message telling us to try another function, called optimize, that is more appropriate when the objective function has only one argument.
optimize
- The help screen for optimize is shown in Fig. 5.
> ?optimize
Fig. 5 Help screen for optimize
- optimize has two required arguments—a function to minimize (the default action) followed by an interval in which to look for the value that yields the minimum. I specify the interval (3, 4).
> optimize(poisson.negloglik,c(3,4))
$minimum
[1] 3.459989
$objective
[1] 124.1764
- The answer agrees with what we obtained with nlm and optim.
Constructing confidence intervals for the MLE
- I consider two different ways to construct confidence intervals for maximum likelihood estimates: Wald confidence intervals and profile likelihood (likelihood ratio) confidence intervals.
Wald confidence intervals
- Wald confidence intervals are the easiest to construct. They depend on the asymptotic normality of the maximum likelihood estimator. A 95% confidence interval for any normally distributed parameter
takes the following form.

Here
is the 0.975 quantile of a standard normal distribution.
- As will be discussed in lecture 12, the standard error for a (scalar) maximum likelihood estimator can be obtained by taking the square root of the reciprocal of the negative of the Hessian evaluated at the MLE. (Note: In R we don't take minus the Hessian because we have already introduced the minus sign into this problem by working with the negative loglikelihood.) Both optim and nlm will return minus the Hessian evaluated at the MLE if requested. In what follows I use nlm.
- With nlm we need to add the argument hessian=TRUE. I assign the output from nlm to an object I call out so that I can access the various components of the output directly.
> out<-nlm(poisson.negloglik,3,hessian=TRUE)
> out
$minimum
[1] 124.1764
$estimate
[1] 3.459998
$gradient
[1] 6.571497e-08
$hessian
[,1]
[1,] 14.44799
$code
[1] 1
$iterations
[1] 4
- To access the elements of out, a list, I use list $ notation.
> out$estimate
[1] 3.459998
> out$hessian
[,1]
[1,] 14.44799
- The .975 quantile of a standard normal distribution is obtained with the qnorm function.
> qnorm(.975)
[1] 1.959964
- Finally I put all the pieces together.
> #lower bound
> out$estimate-qnorm(.975)*sqrt(1/out$hessian)
[,1]
[1,] 2.944361
> #upper bound
> out$estimate+qnorm(.975)*sqrt(1/out$hessian)
[,1]
[1,] 3.975636
- So our 95% Wald confidence interval rounded to two decimal places is (2.94, 3.98). We are 95% confident that the true value of the population parameter λ lies in this interval.
Profile likelihood confidence intervals
- The profile likelihood confidence interval (also called the likelihood ratio confidence interval) derives from the asymptotic chi-squared distribution of the likelihood ratio statistic. It is generally considered to be more accurate than the Wald confidence interval when the sample size is small.
- We'll go into further detail in lecture 13, but the basic rationale for it goes as follows. Suppose λ is a scalar parameter and we wish to test whether
where
is some specific value of interest. The likelihood ratio statistic for this test is the following.

where
is the MLE of λ. We reject
at α = .05 if
where
is the .95 quantile of a chi-squared distribution with one degree of freedom.
- Similarly we would fail to reject
if it turns out
. This last inequality defines all those values of
for which we would fail to reject the null hypothesis. By failing to reject the null hypothesis this says that these values are reasonable candidates for λ and hence should be included in our 95% confidence interval for λ. Thus we have

- We can depict what's going more clearly in a graph. I first calculate the lower limit for the loglikelihood defined by the inequality above. The function qchisq returns the quantile of a chi-squared distribution.
> lower.limit<- -out$minimum-.5*qchisq(.95,1)
> lower.limit
[1] -126.0971
- Then I add the lower limit as a horizontal line on our loglikelihood graph (Fig. 6).
> plot(seq(2,5,.1), sapply(seq(2,5,.1), poisson.func), type='l', xlab=expression(lambda), ylab='loglikelihood')
> abline(h=lower.limit,col=4,lty=2)
- As Fig. 6 shows the loglikelihoods that satisfy our inequality (vertical arrow at top of graph) define a corresponding set of values for the parameter λ (horizontal arrow at the bottom of the graph). These values for λ comprise the profile likelihood confidence interval. The boundaries of this interval are defined by the values of λ where the blue horizontal lower limit line intersects the graph of the loglikelihood.
- We could estimate the limits graphically, but it is far simpler to use numerical methods. Let H represent the horizontal line in Fig. 6. We seek the values of λ such that

where
. Thus we seek the roots (zeros) of the function f .
- Finding roots is a standard mathematical operation and there are many good routines for carrying this out. In R there is the function uniroot.
> ?uniroot

Fig 7 Help screen for uniroot
- The documentation reveals that the syntax is similar to that of optimize. The required arguments are the function whose roots are desired and an interval over which to search. Since there are two roots we'll have to run uniroot twice with different intervals to search in each time. First I create the function,
described above. Then I run uniroot twice each time using intervals that bracket the locations of the roots as determined from the graph in Fig. 5 above.
> root.function<-function(lambda) poisson.func(lambda)-lower.limit
> uniroot(root.function,c(2.5,3.5) )
$root
[1] 2.96967
$f.root
[1] -0.0002399496
$iter
[1] 6
$estim.prec
[1] 6.103516e-05
> uniroot(root.function,c(3.5,4.5))
$root
[1] 4.00152
$f.root
[1] -8.254986e-05
$iter
[1] 6
$estim.prec
[1] 6.103516e-05
- So to two decimal a 95% profile likelihood confidence interval for λ is (2.97, 4.00). This compares favorably with the Wald confidence interval we found earlier, (2.94, 3.98).
Profile likelihood confidence intervals using the Bhat package
- It turns out there is a function in the Bhat package that can be used to calculate profile likelihood confidence intervals. The function is called plkhci.
> library(Bhat)
> ?plkhci

Fig 8 Help documentation for plkhci
- The usage instructions are fairly complicated and it is probably best to imitate one of the examples that appears at the end of the documentation. There are three required arguments.
- The first argument is a list consisting of four components: a label containing the names of the parameters as used in the loglikelihood function, initial estimates for the MLEs, an estimated lower bound for the confidence interval, and an estimated upper bound for the confidence interval.
- The second argument is the name of the negative loglikelihood function. (Note: negative loglikelihood!).
- The third argument is the name of the parameter for which to calculate the confidence interval (matching the name that appears in the label, which is the first component of the list that is the first argument to the function).
- I create the list that is the first argument to plkhci and then evaluate the function.
> x.in<-list(label='lambda',est=3.4, low=2, up=5)
> plkhci(x.in,poisson.negloglik,'lambda')
At the end of quite a bit of output we find the calculated confidence interval:
[1] 2.966524 4.005384
This matches our hand calculations above.
Checking for lack of fit
Graphical Analysis
- Having estimated the Poisson model and obtained a confidence interval for the parameter, we lastly examine how well the model fits the observed data. There are a number of ways this can be done. A good place to start is by calculating the count frequencies predicted by the model and comparing these to the observed frequencies graphically. We developed the protcol for this in lab 2 when we plotted Poisson random variates using a bar plot and then superimposed the theoretical values on top of the bars as points connected by line segments.
- I begin by calculating the expected values under a Poisson model using the estimated value of λ. I generate expected values from 0 to 9 for the plot because those are the only observed values we have. (Ideally I should lump all the expected frequencies from the tail of the distribution, k ≥ 9, into this last category. We'll do this later when we formally test for lack of fit.) The object out used to obtain the MLE is the output from the nlm fit above.
> lambdahat<-out$estimate
> exp<-dpois(0:9,lambdahat)*50
> max(exp)
[1] 10.84896
> max(table(aphid.data))
[1] 9
- Notice that the expected values have a larger maximum than the observed values. To ensure that the expected values (which will be added to the graph with points and lines commands) are displayed completely I add a ylim option to the barplot to extend the y-axis to include the maximum expected count. I obtain the coordinates of the bars as the return value of the barplot function.
> coords<-barplot(table(aphid.data), ylim=c(0,11))
> points(coords,exp,pch=16,cex=1.5)
> lines(coords,exp)
> legend(coords[7],11, c('observed','expected'), pch=c(15,16), col=c('gray60',1), lty=c(NA,1), cex=c(.9,.9), bty='n')
> box()
Notice in the legend statement I specified NA, the R code for missing, as the first element of the vector entry for lty. This suppresses the drawing of a line for the first entry in the legend. I use the filled square symbol, pch=15, to represent the bars in the legend.
- Based on the plot the fit doesn't look very good. We can test this formally using the Pearson chi-squared goodness of fit test.
Goodness of Fit Test
- The Pearson chi-squared test compares the observed category frequencies to those predicted by a model (referred to as the expected frequencies) using the following formula.

where m is the number of categories.
- It turns out the Pearson chi-squared statistic has a chi-squared distribution.
The degrees of freedom are the number of categories, m, minus one minus the number of estimated parameters, p, used in obtaining the expected frequencies. The null hypothesis of this test is that the fit is adequate. We reject the null hypothesis at level α if the observed value of our test statistic exceeds the 1 – α quantile of a chi-squared distribution with m – 1 – p degrees of freedom.
Reject if
- The chi-squared distribution of
is an asymptotic result. For it to hold the fitted values, the expected frequencies obtained using the model, should be large. A general rule is that expected cell counts should be 5 or larger, although with many categories Agresti (2002) notes that an expected cell frequency as small as 1 is okay as long as no more than 20% of the expected counts are less than 5. The difficulty in applying this rule when fitting a model such as the Poisson model to data is that the theoretical count distribution is infinite (although the probabilities after a certain point are essentially zero). So the 20% guideline is not well-defined.
- For our data quite a few of the expected counts are small. In the code below I calculate the expected frequencies,
, for counts ranging from 0 to 12. From the output we see that the expected frequencies exceed 5 only for k = 1, 2, 3, 4, and 5.
> dpois(0:12,lambdahat)*50
[1] 1.571490812 5.437355498 9.406620322 10.848963362 9.384348629
[6] 6.493966013 3.744851867 1.851025857 0.800568284 0.307773876
[11] 0.106489708 0.033495837 0.009657961
- What to do in such a situation is not entirely clear. The standard recommendation, cited for example in Sokal and Rohlf (1995) is to pool categories. They write, p. 702, "Whenever classes with expected frequencies of less than five occur, expected and observed frequencies for those classes are generally pooled with an adjacent class to obtain a joint class with an expected frequency
." The number of categories m is thereby reduced in the degrees of freedom of the chi-squared distribution. Not everyone agrees with this. Here's a sampling from the literature.
- These authors recommend pooling but don't give an absolute criterion: "…it is insightful to compare predicted relative frequencies
with actual relative frequencies
. These are given in Table 5.6 where counts of five or more are grouped into the one cell to prevent cell sizes from getting too small." Cameron & Trivedi (1998), p. 157.
- This author uses a frequency of 1 as the cutoff: "It is also recommended that adjacent groups at the bottom of the table be combined in order to avoid having any expected frequencies less than 1." Le (1998), p. 209.
- This author says never pool: "We note here that several of the fitted values are small (say, less than 5), and it is a common practice to pool over categories, to obtain larger expected values. This is somewhat arbitrary and potentially dangerous, and should be avoided if possible." Morgan (2000), p. 17.
- My comments on this are the following.
- Pooling is unavoidable. Since tail probabilities for discrete count models run forever, it is always necessary to at least pool the observations in the tail. It is not clear though where this should begin without additional guidelines, but a sensible approach would be to add the expected tail probabilities to the last category for which the observed counts are also nonzero. My rationale for this is that otherwise every additional cell in which the observed count is zero will add the value of the expected count to the chi-squared statistic, no matter how small the expected count, but at the same time add 1 to the degrees of freedom. When the observed count is 0 the generic term of the Pearson chi-squared test becomes the following.
If there are many such cases the effective result will be to dilute a significant lack of fit that may be occurring if we just looked at the nonzero cells. (Generally speaking, a chi-squared random variable does not become significant until it exceeds twice its degrees of freedom. Each of these terms adds one degree of freedom to the test statistic. If at the same time the expected frequency is less than 2 then these terms will reduce rather than increase the lack of fit as measured by the test statistic.) So I think pooling these cells makes sense.
- There is no doubt that when the expected cell counts are small the asymptotic chi-squared distribution is a bad approximation to the true distribution of the test statistic.
- The reason you carry out the Pearson test is to provide evidence that your model fits the data. If you fail to find a significant lack of fit and you chose to pool (or not) then you need to investigate whether pooling had an effect on the result you obtained. This is the usual dichotomy of model falsification versus model confirmation. We can falsify models conclusively, but we can't unequivocably confirm them. If the manner in which the pooling was done was arbitrary, then pooling in other ways is a good check on your results. The bottom line is that you need to convince a skeptic that you did your best to falsify your model but were unable to do so.
- Randomization tests may be an option when there are small expected cell counts. I discuss this approach in the last section of this document.
Goodness of Fit Test by Pooling
- Using 5 as a cut-off we see that we should combine the first two expected counts corresponding to k = 0 and k = 1. Turning to the right hand tail of the distribution, k = 6 is the first expected count to drop under 5. If we calculate the sum of the expected counts greater than 6 we see that these don't exceed 5 either. So we should combine all category counts for k ≥ 6
> #P(X = 6)
> dpois(6,lambdahat)*50
[1] 3.744852
> #P(X > 6)
> (1-ppois(6,lambdahat))*50
[1] 3.112403
- Note: ppois(6,lambdahat)) = P(X ≤ 6). Therefore, 1–ppois(6,lambdahat)) = P(X > 6)
- So, I construct the expected counts as follows.
> expected<-c(sum(dpois(0:1,lambdahat)), dpois(2:5,lambdahat), 1-ppois(5,lambdahat))*50
> expected
[1] 7.008846 9.406620 10.848963 9.384349 6.493966 6.857255
- I next group the observed counts in exactly the same way: pool the first two, keep the next four separate, and pool the remainder. I use length(rawcounts) to obtain the location of the last value rather than entering the number myself.
> rawcounts<-table(aphid.data)
> observed<-c(sum(rawcounts[1:2]), rawcounts[3:6], sum(rawcounts[7:length(rawcounts)]))
> observed
2 3 4 5
14 9 6 6 2 13
- I add appropriate labels for the new categories.
> names(observed)<-c('0-1',2:5,'>5')
> observed
0-1 2 3 4 5 >5
14 9 6 6 2 13
- I generate a bar plot of our new observed and expected categories.
> coords<-barplot(observed,ylim=c(0,15))
> points(coords,expected,pch=16,cex=1.5)
> lines(coords,expected)
> legend(coords[2],15, c('observed','expected'), pch=c(15,16), col=c('gray60',1), lty=c(NA,1), cex=c(.9,.9), bty='n')
> box()
- Finally I carry out the Pearson chi-squared test.
> pearson<-sum((observed-expected)^2/expected)
> pearson
[1] 18.99147
> df<-length(observed)-1-1
> df
[1] 4
> p.val<-1-pchisq(pearson,df)
> p.val
[1] 0.0007889844
> qchisq(.95,df)
[1] 9.487729
- The p-value is quite small indicating a significant lack of fit. The critical value for our test is 9.49 and the observed value of our test statistic is 18.99. As a general rule for a chi-squared test, if the test statistic is more than twice its degrees of freedom, the p-value will be significant.
Goodness of Fit Test—Randomization Test Version
- The R function chisq.test can be used to perform the Pearson chi-square goodness of fit test. The help screen is shown below (Fig. 11).
> ?qchisq.test

Fig. 11 Help screen for chisq.test
- The entries relevant to us are x and p. The x= argument should contain the observed counts. The p= argument should contain the expected probabilities (not the expected counts). I try it for our grouped data, first constructing the expected probabilities.
> expprob<-c(sum(dpois(0:1,lambdahat)),dpois(2:5,lambdahat),1-ppois(5,lambdahat))
> chisq.test(observed,p=expprob)
Chi-squared test for given probabilities
data: observed
X-squared = 18.9915, df = 5, p-value = 0.001929
- Observe that the value of the test statistic is correct, but the p-value is wrong (as are the degrees of freedom). That's because the function doesn't know we estimated a parameter in obtaining the expected probabilities. To make this function more useful we should change it so that it takes an additional argument, call it PARMS, that records the number of estimated parameters. The code of chisq.test has an internal variable called PARAMETER and a line in the function that reads: PARAMETER <- length(x) - 1. (To see the code of chisq.test just type its name at the prompt in the console and hit return.) This line should then read: PARAMETER <- length(x) - 1 - PARMS
- Next suppose I try using the function on the unpooled data. I redefine the expected probabilities so all categories are distinct, except I pool the tail probabilities into the last category.
> expprob2<-c(dpois(0:8,lambdahat),1-ppois(8,lambdahat))
> #check that all cells are accounted for
> sum(expprob2)
[1] 1
> chisq.test(rawcounts,p=expprob2)
Chi-squared test for given probabilities
data: rawcounts
X-squared = 48.5686, df = 9, p-value = 1.999e-07
Warning message:
Chi-squared approximation may be incorrect in: chisq.test(rawcounts, p = expprob2)
- Once again the degrees of freedom are wrong, they should be 8. Assuming the chi-squared distribution is appropriate we can calculate the p-value using the correct degrees of freedom as follows.
> out.bad<-chisq.test(rawcounts,p=expprob2)
Warning message:
Chi-squared approximation may be incorrect in: chisq.test(rawcounts, p = expprob2)
> names(out.bad)
[1] "statistic" "parameter" "p.value" "method" "data.name" "observed" "expected" "residuals"
> out.bad["statistic"]
$statistic
X-squared
48.56862
> out.bad["parameter"]
$parameter
df
9
> 1-pchisq(as.numeric(out.bad["statistic"]), as.numeric(out.bad["parameter"])-1)
[1] 7.69037e-08
- Note: the as.numeric function is necessary in the last step because the entries of out.bad are actually character values.
- Also notice the warning message that occurs because it detects that some of the expected cell counts are less than 5.
- None of this would be interesting except for the fact that chisq.test has an additional option it calls simulate.p.value. Setting this argument to TRUE causes R to carry out a Monte Carlo simulation to obtain the p-value of the observed value of the test statistic
.
This works as follows.
- The expected probability vector p is used to randomly generate new data. Let the ten components of p be labeled p0, p1, ... , p9. At each trial we generate a new observation. The new observation has probability p0 of being a 0, probability p1 of being a 1, etc. This is easy to accomplish with a uniform random number generator.
- We repeat this as many times as there are observations, for our data set 50 times, each time obtaining a count from 0 to 9.
- These simulated data become our new set of raw counts. We next calculate
using these simulated data as the observed values. The expected values are obtained, as usual, by multiplying the components of p by 50. Denote this value by
.
- We now repeat the whole process generating a new set of 50 simulated observations once again calculating
using the same set of expected values.
- This is done a total of B times (set by default to 2000). So in the end we have 2000 values of
.
- Lastly calculate
using the observed data. Denote this value by
.
- Now count up the number of
that are greater than or equal to
. Add 1 to this number (to include
in the count) and divide the result by B + 1. This is the simulation-based p-value.
- The beauty of this approach is that it avoids the whole issue of whether the asymptotic chi-squared distribution is appropriate or not, because it doesn't use it! Thus the issue of pooling doesn't arise, because it's not necessary.
- So, I redo the goodness of fit test this time setting simulation.p.value=TRUE.
> chisq.test(rawcounts,p=expprob2,simulate.p.value=TRUE)
Chi-squared test for given probabilities with simulated p-value (based on 2000 replicates)
data: rawcounts
X-squared = 48.5686, df = NA, p-value = 0.0004998
- Observe that the reported p-value is equal to
meaning that the observed value of the test statistic was larger than any of the simulated values. With 2000 simulations we have accuracy to 3 decimal places. Thus the best we can say is that the true p-value is less than 0.001. To get a more accurate result we can increase the number of simulations by specifying our own value for the argument B. I try 20,000 simulations.
> chisq.test(rawcounts,p=expprob2,simulate.p.value=TRUE,B=20000)
Chi-squared test for given probabilities with simulated p-value (based on 20000 replicates)
data: rawcounts
X-squared = 48.5686, df = NA, p-value = 0.00015
- The reported p-value is equal to
meaning that there were two values of
as extreme as
. We now have 4-decimal place accuracy so when rounded to four decimal places we can conclude p < .0002.
- Notice that this result is not far from what we obtained with pooling (p = 0.0007889844) using the chi-squared distribution, but it is very different from the p-value obtained without grouping (p = 7.69037e-08). It's pretty clear that using the chi-squared distribution for the distribution of
with ungrouped data is inappropriate. Grouping on the other hand has performed reasonably well.
- In any case it is clear, both graphically and analytically, that the Poisson model does not provide an adequate fit to these data.
Cited References
- Agresti, Alan. 2002. Categorical Data Analysis. Wiley: New York.
- Cameron, A. Colin and Pravin K. Trivedi. 1998. Regression Analysis of Count Data. Cambridge University Press: New York.
- Krebs, Charles. 1999. Ecological Methodology. Addison-Wesley: Menlo Park, CA.
- Le, Chap T. 1998. Applied Categorical Data Analysis. Wiley: New York.
- Morgan, Byron J. T. 2000. Applied Stochastic Modelling. Arnold Press: London.
- Sokal, Robert R. and F. James Rohlf. 1995. Biometry. W. H. Freeman: New York.
Course Home Page