Assignment 3 — Solution

Question 1

Let Xi = number of utterances for bird i. Without loss of generality, let’s assume that the birds are numbered in the order they appear in the data table, i.e., first come the 132 birds who quit after one utterance, followed by the 52 birds who quit after two utterances, etc. So the probability we seek is the following joint probability.

Since this sample of 250 birds is a random sample, each of these observations is independent of each other. Thus this joint probability factors into a product of individual probabilities.

Now by assumption these individual probabilities don’t depend on which bird we happen to be considering, they depend only on the number of utterances. Thus we can drop the subscripts.

For the last term we note that

Using the geometric probability model we can replace each of these generic expressions with their corresponding probability expressions. Thus we have

Question 2

From the hint we know that in its various "geom" functions R counts the number of failures until the first success rather than the total number of trials. If we let Y be the total number of failures, then the probability we need, using the results from above, is the following.

Thus the likelihood is the following.

and the loglikelihood is

This can be efficiently programmed as follows. In "Method 1" I use the tabulated values which allows me to essentially translate the line immediately above into R code.

> #Method 1: using tabulated counts
> song.counts<-c(132,52,34,9,7,5,5,6)
>
loglik<-function(p) sum(song.counts[1:7]*log(dgeom(0:6,p)))+ song.counts[8]*log(1-pgeom(6,p))

"Method 2" below first creates the raw data from the tabulated counts. In the rep line I replicate the number of failures (rather than the total number of trials) for compatibility with R's "geom" functions.

> #Method 2: using raw data
> raw.data<-rep(0:7,song.counts)
> loglik.2<-function(p) sum(log(dgeom(raw.data[raw.data<=6],p)))+
          sum(log(1-pgeom(raw.data[raw.data==7]-1,p)))

The utterances are coded as 7 in the last category, but I need to calculate the probability as P(X > 6). Thus in the function definition I use the 7 to select these last observations but then I decrease their actual values by one when I calculate the probability.

Question 3

> plot(seq(0,1,.01),sapply(seq(0,1,.01),loglik), type='l', xlab='p', ylab='loglikelihood')

From the plot it's clear we should zoom in on the interval (0.4, 0.6).

> plot(seq(.4,.6,.01), sapply(seq(.4,.6,.01), loglik), type='l', xlab='p', ylab='loglikelihood')

    

Fig. 1   Zooming in on the MLE graphically

From the plot we would estimate p = 0.48 to two decimal places.

Question 4

> negloglik<-function(p) -loglik(p)
> nlm(negloglik,0.48,hessian=TRUE)->out
> out

$minimum
[1] 356.9039
$estimate
[1] 0.4728682
$gradient
[1] -1.932676e-08
$hessian
         [,1]
[1,] 2070.006
$code
[1] 1
$iterations
[1] 3

The gradient is small and code = 1 indicates convergence. The estimated MLE and minimum value of the negative loglikelihood match what we saw graphically. So we estimate  = 0.473 to three decimal places.

Question 5

Wald 95% confidence interval:

> out$estimate-qnorm(.975)*sqrt(1/out$hessian)

          [,1]
[1,] 0.4297895

> out$estimate+qnorm(.975)*sqrt(1/out$hessian)

          [,1]
[1,] 0.5159469

So a 95% Wald confidence interval is (0.430, 0.516). I use the plkhci function from the Bhat package to obtain the profile likelihood confidence interval.

> library(Bhat)
> x.in<-list(label='p',est=0.47,low=0.4,up=0.6)
> plkhci(x.in,negloglik,'p')->profile.ci
> profile.ci

[1] 0.4299107 0.5160957

So a 95% profile likelihood confidence interval is (0.430, 0.516), identical to the Wald interval to three decimal places.

Question 6

> exp.prob<-c(dgeom(0:6, out$estimate), 1-pgeom(6,out$estimate))
> Ei<-exp.prob*250
> Ei

[1] 118.217054 62.315967 32.848727 17.315608
[5]   9.127607  4.811452  2.536269  2.827316

> coords<-barplot(song.counts,ylim=c(0,135))
> points(coords,Ei,pch=16,cex=1.5)
> lines(coords,Ei)
> legend(coords[5],125, c('observed','expected'), pch=c(15,16), col=c('gray60',1), lty=c(NA,1), cex=c(.9,.9), bty='n')
> box()

Fig. 2   Goodness of fit

Question 7

> Ei

[1] 118.217054  62.315967  32.848727  17.315608   9.127607   4.811452   2.536269
[8]   2.827316

Category 6 is really close to 5. I try two different pooling methods. One leaves category 6 alone and pools the last two categories. The other pools the last three categories.

Method 1 (merge only two cells):

> Ei.merge1<-c(Ei[1:6],sum(Ei[7:8]) )
> Oi.merge1<-c(song.counts[1:6],sum(song.counts[7:8]))
> pearson1<-sum((Oi.merge1-Ei.merge1)^2/Ei.merge1)
> pearson1

[1] 13.77496

> 1-pchisq(pearson1,length(Oi.merge1)-1-1)

[1] 0.01710393

Method 2 (merge three cells):

> Ei.merge2<-c(Ei[1:5],sum(Ei[6:8]))
> Oi.merge2<-c(song.counts[1:5],sum(song.counts[6:8]))
> pearson2<-sum((Oi.merge2-Ei.merge2)^2/Ei.merge2)
> pearson2

[1] 11.17910

> 1-pchisq(pearson2,length(Oi.merge2)-1-1)

[1] 0.02462328

Randomization Test

I run the test three times and then try it with 20,000 randomizations.

> chisq.test(song.counts,p=exp.prob,simulate.p.value=TRUE)

        Chi-squared test for given probabilities with simulated p-value
        (based on 2000 replicates)
data:  song.counts
X-squared = 13.8053, df = NA, p-value = 0.05897

> chisq.test(song.counts,p=exp.prob,simulate.p.value=TRUE)

        Chi-squared test for given probabilities with simulated p-value
        (based on 2000 replicates)
data:  song.counts
X-squared = 13.8053, df = NA, p-value = 0.05947

> chisq.test(song.counts,p=exp.prob,simulate.p.value=TRUE)

        Chi-squared test for given probabilities with simulated p-value
        (based on 2000 replicates)
data:  song.counts
X-squared = 13.8053, df = NA, p-value = 0.06097

> chisq.test(song.counts,p=exp.prob,simulate.p.value=TRUE,B=20000)

        Chi-squared test for given probabilities with simulated p-value
        (based on 20000 replicates)
data:  song.counts
X-squared = 13.8053, df = NA, p-value = 0.0601

            Our results are ambiguous. The fit isn’t bad but it’s not ideal. The two Pearson chi-square tests in which we pool cells yielded significant results; the randomization test just misses being significant. At this point we might tentatively accept the model, but a skeptic would be justified in questioing the fit. We probably should try to refine it in some way, perhaps treating p as heterogeneous and by using bird characteristics to let p vary among the birds. But it's worth noting that what is essentially a null model nearly fits the data. This tells us that if there is "anything going on here" it's not likely to be very much.

Course Home Page


Jack Weiss
Phone: (919) 962-5930
E-Mail: jack_weiss@unc.edu
Address: Curriculum in Ecology, Box 3275, University of North Carolina, Chapel Hill, 27516
Copyright © 2006
Last Revised--Feb 14, 2006
URL: http://www.unc.edu/courses/2006spring/ecol/145/001/docs/solutions/assign3.htm