Midterm—Solution

Problem 1

Read in the data.

> crabs<- read.table('http://www.unc.edu/courses/2006spring/ecol/145/001/data/midterm/crabs.txt', header=TRUE)

Questions 1 and 2

The basic strategy here is to divide the continuous variable carapace width into groups and within each group calculate the mean and variance of the response variable, the number of satellite males. In doing this we need to balance the need for having an adequate amount of data to calculate the means and variances, and an adequate number of groupings for estimating the regression models. Since results may be sensitive to the way we choose the groupings I write a function that lets me specify the number of groupings, produces a scatter plot of the means and variances for those groupings, and adds various regression curves to the plot. I use the length argument of the seq function to determine the number of groups (the value for length needs to be one more than the desired number of groups) with cutpoints obtained from the quantile function. The function takes one argument, the desired number of categories.

> Varfunction<-function(numcats) {
widthcuts<-cut(crabs$width, quantile(crabs$width, seq(0, 1, length=numcats+1)), include.lowest=TRUE)
vars<-tapply(crabs$num.satellites, widthcuts, var)
means<-tapply(crabs$num.satellites, widthcuts, mean)
plot(means, vars, xlim=c(0, max(means)), ylim=c(0, max(vars)), xlab='Mean # of satellite males', ylab='Variance in # of satellite males')
abline(c(0, coef(nb1.rel)), col=2, lty=2, lwd=2)
abline(c(0, 1), col=3, lty=1, lwd=2)
nb2.rel<-lm(vars~offset(means) + I(means^2) - 1)
quad.func<-function(x) x + coef(nb2.rel)*x^2
lines(seq(0, 6, .1), quad.func(seq(0, 6, .1)), col=1, lty=1, lwd=2)
legend(0, max(vars), c('Poisson', 'quasi-Poisson (NB1)', 'negative binomial (NB2)'), col=c(3, 2, 1), lwd=c(2, 2, 2), lty=c(1, 2, 1), cex=c(.8, .8, .8), bty='n')
mtext(paste(numcats,' categories of carapace width', sep=''), side=3, line=.5, cex=.9)
nb1.rel<-lm(vars~means-1)
#add R2 to output
c(summary(nb1.rel)$r.squared,summary(nb2.rel)$r.squared)
}

I illustrate the results using two different choices for the number of categories.

> Varfunction(9)
> Varfunction(11)

     

Question 3

Based on the graph we see that the variance does increase with the mean and the rate of increase is greater than what is predicted by a Poisson model. Of the negative binomial models, the NB1 model seems quite reasonable here. There's nothing to suggest the parabolic pattern predicted by the NB2 model is a better choice. As mentioned in class, the R2 for models without intercepts are not comparable with the R2 for models with intercepts, but they can be used to compare models of the same type against each other.

> cbind(sapply(8:12,Varfunction))->r2.out
> rownames(r2.out)<-c('NB1','NB2')
> colnames(r2.out)<-8:12
> r2.out

            8         9        10        11        12
NB1 0.8835496 0.8711384 0.8683304 0.8836334 0.8436923
NB2 0.8447387 0.8175356 0.8094694 0.8044818 0.7879245

From the output we see that the R2 in each grouping is higher for the NB1 model than for the NB2 model. In truth either one is clearly a better choice than the Poisson. Since we did not learn how to fit an NB1 model in this class, I will proceed with the NB2 model.

Question 4

Model 1—no predictors

> negloglike1<-function(data,p)
- sum(log(dnbinom(data$num.satellites, mu=p[1], size=p[2])))
> model1<-nlm(function(p) negloglike1(mycrabs,p),c(2,2))
> model1

$minimum
[1] 383.7046

$estimate
[1] 2.9190737 0.7577588

$gradient
[1] -2.920965e-07 1.358558e-05

$code
[1] 1

$iterations
[1] 9

Model 2—carapace width as a predictor

> negloglike2<-function(data,p)
- sum(log(dnbinom(data$num.satellites, mu=exp(p[1]+p[2]*data$width), size=p[3])))
> model2<-nlm(function(p) negloglike2(mycrabs, p), c(0, .1, .7))
> model2

$minimum
[1] 375.6455

$estimate
[1] -4.0504244 0.1919944 0.9045750

$gradient
[1] 4.069843e-07 1.182343e-05 3.979039e-07

$code
[1] 1

$iterations
[1] 32

Question 5

We know , where K is the number of estimated parameters. The element labeled $minimum in the output from nlm is the negative loglikelihood of the model while the length of $estimate is K. Thus we calculate AIC as follows.

> -2*(-model1$minimum)+2*length(model1$estimate)
[1] 771.4092
> -2*(-model2$minimum)+2*length(model2$estimate)
[1] 757.291

Since the AIC of the second model, 757.3, is much smaller than that of the first, 771.4, we should prefer the second model that includes carapace width.

Question 6

I fit the models as negative binomial regression models.

> library(MASS)
> glm.nb(num.satellites~1,data=mycrabs)->crab.null
> glm.nb(num.satellites~width,data=mycrabs)->crab.out

We can obtain a likelihood ratio test of the two models using R's anova function.

> anova(crab.null, crab.out)
Likelihood ratio tests of Negative Binomial Models

Response: num.satellites
  Model    theta Resid. df 2 x log-lik.   Test  df LR stat.      Pr(Chi)
1     1 0.757759       172    -767.4092
2 width 0.904568       171    -751.2910 1 vs 2   1 16.11827 5.950707e-05

Alternatively we can use the Wald test reported in the output of the summary function.

> summary(crab.out, corr=FALSE)

Call:
glm.nb(formula = num.satellites ~ width, data = mycrabs, init.theta = 0.904568080033868,
link = log)

Deviance Residuals:
    Min      1Q  Median     3Q    Max
-1.7798 -1.4110 -0.2502 0.4770 2.0177

Coefficients:
            Estimate Std. Error z value Pr(>|z|)
(Intercept) -4.05251    1.17143 -3.459  0.000541 ***
      width  0.19207    0.04406  4.360  1.30e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for Negative Binomial(0.9046) family taken to be 1)

    Null deviance: 213.05 on 172 degrees of freedom
Residual deviance: 195.81 on 171 degrees of freedom
AIC: 757.29

Both tests agree; the p-values for the respective tests are highly significant. Carapace width is a significant predictor of the number of satellite males.

Question 7

To carry out the goodness of fit test we need the predicted values for each crab as determined by its covariate pattern. This is easily obtained using the fitted function applied to the best negative binomial regression model, the one that includes carapace width.

> fitted(crab.out)->mu.vals

With the individual means we can obtain a probability distribution for each crab. If I then sum this up over all the crabs I get the expected frequencies. I carry this out below for counts 0 to 14 and then lump the remaining counts into a final tail probability category. The sapply is needed here so that the function is applied separately to each desired count.

> exp<-c(sapply(0:14, function(x) sum(dnbinom(x, mu=mu.vals, size=crab.out$theta))), sum(1-pnbinom(14, mu=mu.vals, size=crab.out$theta)))
> exp

[1] 50.4831317 33.0465902 23.0054531 16.4408921 11.9555982 8.8133485 6.5725880 4.9519279
[9]  3.7655803  2.8878813  2.2322659  1.7382086  1.3628484 1.0754905 0.8539286 3.8142667

The observed counts are obtained with the table function.

> table(mycrabs$num.satellites)->obs
> obs

 0  1 2  3  4  5  6 7 8 9 10 11 12 14 15
62 16 9 19 19 15 13 4 6 3  3  1  1  1  1

Observe that there are no data for count category 0, so I insert a zero in this spot.

> newobs<-c(obs[1:13],0,obs[14:15])
> newobs

 0  1 2  3  4  5  6 7 8 9 10 11 12    14 15
62 16 9 19 19 15 13 4 6 3  3  1  1  0  1  1

Next we need to merge expected frequencies to achieve a minimum count of about 5. I proceed from the tail inward.

> sum(exp[14:16])
[1] 5.743686
> sum(exp[11:13])
[1] 5.333323
> sum(exp[9:10])
[1] 6.653462

I leave the 8th category alone since 4.95 is close enough to 5. I concatenate these into a vector.

> Ei<-c(exp[1:8],sum(exp[9:10]),sum(exp[11:13]),sum(exp[14:16]))
> Ei

[1] 50.483132 33.046590 23.005453 16.440892 11.955598 8.813348 6.572588 4.951928 6.653462
[10] 5.333323  5.743686

Next I repeat the exact same merging process with the observed values.

> Oi<-c(newobs[1:8],sum(newobs[9:10]),sum(newobs[11:13]),sum(newobs[14:16]))
> Oi

 0  1 2  3  4  5  6 7
62 16 9 19 19 15 13 4 9 5 2

Finally I carry out the Pearson chi-square goodness of fit test.

> pearson<-sum((Oi-Ei)^2/Ei)
> pearson

[1] 38.59571
> 1-pchisq(pearson,length(Oi)-1-3)
[1] 2.333998e-06

Since the reported p-value is small, we have a significant lack of fit. We conclude that the negative binomial regression model with carapace width as the sole predictor does not fit the data.

Problem 2

Read in the data.

> hurricane<- read.table('http://www.unc.edu/courses/2006spring/ecol/145/001/data/midterm/hurricanes.csv', sep=',', header=TRUE)

Question 1

The hint suggests using the method of moments estimates as the initial estimates in the maximum likelihood estimation. From lecture we know the mean and variance of a gamma distribution are given by

Plugging eqn [1] into eqn [2] allows me to solve for β.

Using this result in eqn [1] allows me to solve for α.

Using these formulas I can find initial estimates for α and β.

> alpha.init<-(mean(hurricane$Precipitation))^2/var(hurricane$Precipitation)
> alpha.init

[1] 1.589584
> beta.init<-mean(hurricane$Precipitation)/var(hurricane$Precipitation)
> beta.init

[1] 0.2181248

I use R's dgamma function to construct the negative loglikelihood. Using the hint I parameterize the shape parameter as exp(p[1]). The initial condition then is log(alpha.init). The rate argument of dgamma corresponds to our β. (Alternatively the scale parameter is 1 over β.)

> negloglike.gamma<- function(data, p) -sum(log(dgamma(data$Precipitation, shape=exp(p[1]), rate=p[2])))
> nlm(function(p) negloglike.gamma(hurricane, p), c(log(alpha.init), beta.init))->gamma.out

> gamma.out
$minimum
[1] 102.3594

$estimate
[1] 0.7826200 0.3001292

$gradient
[1] 3.339551e-06 -2.714273e-05

$code
[1] 1

$iterations
[1] 7

From the output we obtain α and β.

> #alpha
> exp(gamma.out$estimate[1])

[1] 2.187195
> #beta
> gamma.out$estimate[2]
[1] 0.3001292

Question 2

I display the fitted gamma density superimposed on a histogram of the data.

> hist(hurricane$Precipitation, freq=FALSE, xlab='Precipitation', ylim=c(0,.1), main='Gamma Distribution Fit to Hurricane Data')
> lines(seq(0,35,.1), dgamma(seq(0,35,.1), shape=exp(gamma.out$estimate[1]), rate=gamma.out$estimate[2]), col=2)

Question 3

There are only two things that we obtained in our fit of the gamma distribution to the data in question 1 that are relevant here: the negative loglikelihood and the estimates. I can obtain the loglikelihood from the GLIM as follows.

> glm(Precipitation~1,data=hurricane,family=Gamma)->out.gamma
> logLik(out.gamma)

'log Lik.' -102.4119 (df=2)
> gamma.out$minimum
[1] 102.3594

So except for the sign and a slight difference at the first decimal place they are the same. I can't obtain α and β but I can show that both approaches return the same mean. Since glm uses the inverse link for the gamma distribution, I obtain the mean of the gamma distribution as follows.

> 1/coef(out.gamma)
(Intercept)
7.2875

Using the results from Question 1, I can also obtain the mean by dividing α by β.

> exp(gamma.out$estimate[1])/gamma.out$estimate[2]
[1] 7.287513

Problem 3

Read in the data.

> gall<- read.table('http://www.unc.edu/courses/2006spring/ecol/145/001/data/midterm/gallfly.txt',
header=TRUE)

Question 1

I subset the data set so that we're only looking at eggs in 1935. I then replicate things to regenerate the raw data from the summary table.

> eggs.1935<-gall[gall$Type=='Eggs' & gall$Year=='1935',]
> raw.data<-rep(eggs.1935$Number,eggs.1935$Freq)

Following the hints I opt to fit a truncated Poisson distribution, a distribution that is Poisson except for the absence of a zero category.

> trunc.Pois<-function(x,lambda) dpois(x,lambda)/(1-dpois(0,lambda))
> negloglike<- function(data,p) -sum(log(trunc.Pois(data,p)))

I use the mean as an initial guess for the parameter λ.

> mean(raw.data)
[1] 3.02027
> nlm(function(p) negloglike(raw.data,p),3)->out.trunc
> out.trunc
$minimum
[1] 276.5876

$estimate
[1] 2.844621

$gradient
[1] -2.755625e-05

$code
[1] 1

$iterations
[1] 4

Checking the fit.

> #Goodness of fit
> exp<-c(trunc.Pois(1:11,out.trunc$estimate),1-sum(trunc.Pois(1:11,out.trunc$estimate)))
> exp.counts<-exp*sum(eggs.1935[,1])
> exp.counts

[1] 25.995857407 36.974178517 35.059172585 24.932513006 14.184709084 6.725019781 2.732875891
[8] 0.971749455 0.307139858 0.087369643 0.022593955 0.006820816

> Ei<-c(exp.counts[1:5],sum(exp.counts[6:12]))
> Ei

[1] 25.99586 36.97418 35.05917 24.93251 14.18471 10.85357
> Oi<-c(eggs.1935$Freq[1:5],sum(eggs.1935$Freq[6:12]))
> Oi

[1] 29 38 36 23 8 14
> pearson<-sum((Oi-Ei)^2/Ei)
> pearson

[1] 4.159417
> 1-pchisq(pearson,length(Oi)-1-1)
[1] 0.3848608

So there is no evidence of a significant lack of fit. Interestingly, the randomization test on the ungrouped data does yield a significant lack of fit.

> chisq.test(eggs.1935$Freq,p=exp,simulate.p.value=TRUE)

Chi-squared test for given probabilities with simulated p-value (based on 2000
replicates)

data: eggs.1935$Freq
X-squared = 152.9483, df = NA, p-value = 0.002999

Question 2

This question was interpreted two different ways by different people. One interpretation was to check if a truncated Poisson also fits the gall-cells of 1935. This would just repeat the analysis of Question 1 with different data.

> gall.1935<-gall[gall$Type=='Gall-cells' & gall$Year=='1935',]
> raw.data<-rep(gall.1935$Number,gall.1935$Freq)
> mean(raw.data)
[1] 2.283296
> nlm(function(p) negloglike(raw.data,p),3)->out.trunc
> out.trunc$estimate
[1] 1.962466
> #Goodness of fit
> exp<-c(trunc.Pois(1:11,out.trunc$estimate),1-sum(trunc.Pois(1:11,out.trunc$estimate)))
> exp.counts<-exp*sum(gall.1935[,1])
> Ei<-c(exp.counts[1:6],sum(exp.counts[7:12]))
> Ei
[1] 284.254752 278.920093 182.457034 89.516416 35.134578 11.491734 4.225393
> Ei<-c(exp.counts[1:5],sum(exp.counts[6:12]))
> Oi<-c(gall.1935$Freq[1:5],sum(gall.1935$Freq[6:12]))
> pearson<-sum((Oi-Ei)^2/Ei)
> pearson
[1] 6.884697
> 1-pchisq(pearson,length(Oi)-1-1)
[1] 0.1421084

The lack of fit test is not significant. The truncated Poisson distribution appears to provide an adequate fit.

The second interpretation, which is what I intended, is to address whether a common distribution works for 1935 gall-cell counts and egg counts. To do this I fit a truncated Poisson model in which egg and gall-cell counts have a common mean and compare it to a model in which we allow egg and gall-cell counts to have separate means.

> gall1935<-gall[gall$Year==1935,]
> raw.count<-rep(gall1935$Number,gall1935$Freq)
> raw.type<-rep(gall1935$Type,gall1935$Freq)
> newdata<-cbind(raw.count,raw.type)

The variable raw.type above records whether the raw count is of eggs or gall-cells. For the initial conditions I obtain the means.

> tapply(newdata[,1],newdata[,2],mean)
       1        2
3.020270 2.283296

> mean(newdata[,1])
[1] 2.388781

I use my negloglike function from Question 1 to estimate the common mean model.

> nlm(function(p) negloglike(newdata[,1],p),2.3)->model2a
> model2a
$minimum
[1] 1626.559

$estimate
[1] 2.094706

$gradient
[1] 1.389399e-05

$code
[1] 1

$iterations
[1] 5

I write a new negloglike function for separate means mmodel by including a dummy variable that records whether an observation is an egg or a gall cell.

> negloglike1935<- function(type,data,p) {
type.dummy<-as.numeric(type)-1
mylambda<-p[1]+p[2]*type.dummy
-sum(log(trunc.Pois(data,mylambda)))
}
> nlm(function(p) negloglike1935(newdata[,2],newdata[,1],p),c(3,-1))->model2b
> model2b

$minimum
[1] 1608.328

$estimate
[1] 2.8446144 -0.8821483

$gradient
[1] 2.525828e-05 2.478373e-05

$code
[1] 1

$iterations
[1] 5

The loglikelihoods are quite different. Formally, we can do a likelihood ratio test.

> LR.stat<-2*(model2a$minimum-model2b$minimum)
> LR.stat

[1] 36.46320
> 1-pchisq(LR.stat,1)
[1] 1.555763e-09

This tests whether the parameters that are found in one model but not the other can be set equal to zero. The parameter not in both is the one identifying the type, egg or gall-cell. Since we reject the hypothesis that this parameter is zero we conclude that there is a need to account for whether an observation corresponds to an egg or gall-cell.

Alternatively, we can compare AIC values.

> #model2a AIC
> -2*(-model2a$minimum)+2*length(model2a$estimate)
[1] 3255.118
> #model2b AIC
> -2*(-model2b$minimum)+2*length(model2b$estimate)
[1] 3220.655

The separate means model has the lowest AIC. The weight of evidence leads us to conclude that eggs and gall-cells have truncated Poisson distributions with different values for the parameter λ.

Question 3

I repeat the methodology of Question 2 but this time I examine whether we need separate means for each year or whether a common mean will suffice. Since we decided egg and gall-cells have different distributions in Question 2, I do this twice: once for eggs and once for gall-cells.

Analysis for eggs

> gall.eggs<-gall[gall$Type=='Eggs',]
> count.egg<-rep(gall.eggs$Number,gall.eggs$Freq)
> year.egg<-rep(gall.eggs$Year,gall.eggs$Freq)
>
>
> neglogliketype<- function(year,data,p) {
+ year.dummy<-(year==1936)
+ mylambda<-p[1]+p[2]*year.dummy
+ -sum(log(trunc.Pois(data,mylambda)))
+ }

Obtain initial values for estimates using means.

> mean(count.egg)
[1] 3.025424
> tapply(count.egg,year.egg,mean)
    1935     1936
3.020270 3.034091

> nlm(function(p) negloglike(count.egg,p),3)->model3a
> nlm(function(p) neglogliketype(year.egg,count.egg,p),c(3,0))->model3b
> model3a
$minimum
[1] 443.1378

$estimate
[1] 2.850508

$gradient
[1] 1.375964e-06

$code
[1] 1

$iterations
[1] 3

> model3b
$minimum
[1] 443.1357

$estimate
[1] 2.84462054 0.01578324

$gradient
[1] 4.188390e-05 5.712764e-05

$code
[1] 1

$iterations
[1] 5

Likelihood ratio test

> -2*(-model3a$minimum+model3b$minimum)
[1] 0.004220353
> 1-pchisq(-2*(-model3a$minimum+model3b$minimum),1)
[1] 0.9482025
> #one year AIC
> -2*(-model3a$minimum)+2*length(model3a$estimate)
[1] 888.2756
> #two year AIC
> -2*(-model3b$minimum)+2*length(model3b$estimate)
[1] 890.2714

AIC and the LR.test agree. We don't need separate parameters for separate years.

Analysis for gall-cells

> gall.gall<-gall[gall$Type=='Gall-cells',]
> count.gall<-rep(gall.gall$Number,gall.gall$Freq)
> year.gall<-rep(gall.gall$Year,gall.gall$Freq)

Obtain initial values for estimates using means.

> mean(count.gall)
[1] 2.29617
> tapply(count.gall,year.gall,mean)
    1935     1936
2.283296 2.335640

> nlm(function(p) negloglike(count.gall,p),2)->model3c
> nlm(function(p) neglogliketype(year.gall,count.gall,p),c(2,.1))->model3d
> model3c
$minimum
[1] 1780.905

$estimate
[1] 1.978739

$gradient
[1] 5.975235e-06

$code
[1] 1

$iterations
[1] 3

> model3d
$minimum
[1] 1780.716

$estimate
[1] 1.96246603 0.06592922

$gradient
[1] 0.0003044832 0.0000959517

$code
[1] 1

$iterations
[1] 4

Likelihood ratio test

> -2*(-model3c$minimum+model3d$minimum)
[1] 0.3782449
> 1-pchisq(-2*(-model3c$minimum+model3d$minimum),1)
[1] 0.5385441
> #one year AIC
> -2*(-model3c$minimum)+2*length(model3c$estimate)
[1] 3563.811
> #two year AIC
> -2*(-model3d$minimum)+2*length(model3d$estimate)
[1] 3565.433

Our results for gall-cells concur with those for eggs. We find no evidence for a difference between years.

Course Home Page


Jack Weiss
Phone: (919) 962-5930
E-Mail: jack_weiss@unc.edu
Address: Curriculum in Ecology, Box 3275, University of North Carolina, Chapel Hill, 27516
Copyright © 2006
Last Revised--March 23, 2006
URL: http://www.unc.edu/courses/2006spring/ecol/145/001/docs/solutions/midterm.htm