Assignment 8—Solution

Setting up the data

I read in the data and create the presence-absence variable.

> crabs<- read.table('http://www.unc.edu/courses/2006spring/ecol/145/001/data/midterm/crabs.txt', header=TRUE)
> pres.abs<-crabs$num.satellites>0

Question 1

To examine the linearity of logit, I divide the predictor, carapace width, into categories and calculate the empirical logit in each category. I then plot the empirical logit against the midpoint of each interval, fit a linear regression, and superimpose a lowess curve.

> width.cuts<-cut(crabs$width,quantile(crabs$width,seq(0,1,.1)),include.lowest=TRUE)
> tapply(pres.abs,width.cuts,sum)->num.successes
> tapply(pres.abs,width.cuts,length)->ni
> emp.logit<-log((num.successes+0.5)/(ni-num.successes+1/2))
> midpoint<-(quantile(crabs$width,seq(0,1,.1))[1:10] + quantile(crabs$width,seq(0,1,.1))[2:11])/2
> plot(midpoint,emp.logit,xlab='width',ylab='logit')
> abline(lm(emp.logit~midpoint),col=2)
> lines(lowess(emp.logit~midpoint),col=4,lty=2)
> mtext('Linearity of the logit',side=3,line=.5)

Fig. 1 shows that the linearity assumption looks quite good. We can also examine the relationship on the probability scale by plotting the presence-absence data versus width and superimposing a lowess curve. The lowess curve should be approximately S-shaped.

> plot(crabs$width, pres.abs, xlab='width', ylab='probability')
> lines(lowess(pres.abs~crabs$width), col=2)

    

Fig. 1 Graph of data on a logit and probability scale

Question 2

One of the requested significance tests is a likelihood ratio test that compares a logistic regression model with width as the predictor to a model with only an intercept. The test can be obtained by applying the anova function to a model with width as a predictor.

> glm(pres.abs~width,data=crabs,family=binomial)->out
> #LR test
> anova(out,test='Chisq')


Analysis of Deviance Table
Model: binomial, link: logit
Response: pres.abs
Terms added sequentially (first to last)

      Df Deviance Resid. Df Resid. Dev P(>|Chi|)
NULL                    172    225.759
width  1   31.306       171    194.453 2.204e-08

The second requested significance test is a Wald test. Wald tests are obtained by applying the summary function to the model and examining the coefficient table that is produced.

> summary(out)

Call:
glm(formula = pres.abs ~ width, family = binomial, data = crabs)

Deviance Residuals:
    Min      1Q Median     3Q    Max
-2.0281 -1.0458 0.5480 0.9066 1.6941

Coefficients:
            Estimate Std. Error z value Pr(>|z|)
(Intercept) -12.3508     2.6287  -4.698 2.62e-06 ***
width         0.4972     0.1017   4.887 1.02e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 225.76 on 172 degrees of freedom
Residual deviance: 194.45 on 171 degrees of freedom
AIC: 198.45

Number of Fisher Scoring iterations: 4

Both the likelihood ratio test and the Wald test indicate a highly significant linear relationship between width and the log odds of satellite males being present.

By exponentiating the coefficient estimate of width we can obtain an estimate of the odds ratio.

> exp(coef(out)[2])
width
1.644162

The model predicts that each one centimeter increase in width increases the odds of a satellite male being present by 1.64.

Question 3

Fig. 2  Fitted logistic regression model

I illustrate two different ways to generate the plot. The result is shown in Fig. 2.

> #method 1--connecting fitted values
> plot(crabs$width, pres.abs, xlab='width', ylab='probability')
> temp<-cbind(crabs$width,fitted(out))
> sorted.temp<-temp[order(temp[,1]),]
> lines(sorted.temp[,1], sorted.temp[,2], col=2)

> #method 2--superimpose regression function
> plot(crabs$width, pres.abs, xlab='width', ylab='probability')
> prob.func<-function(x) 1/(1+exp(-coef(out)[1] - coef(out)[2]*x))
> lines(seq(21,34,.1), prob.func(seq(21,34,.1)), col=2)

Question 4

Method 1: Group by intervals of the predictor. I start by trying deciles.

> width.cuts<-cut(crabs$width, quantile(crabs$width, seq(0,1,.1)), include.lowest=TRUE)
> tapply(fitted(out),width.cuts,sum)->np
> fail<-ni-np
> Ei<-rbind(fail,np)
> Ei

     [21,23.7] (23.7,24.5] (24.5,25.1] (25.1,25.7] (25.7,26.1] (26.1,26.7] (26.7,27.4]
fail 13.610632  10.375685     7.448845    8.017449    5.904548    5.701006    3.940702
np    5.389368   7.624315     7.551155   10.982551   10.095452   12.298994   12.059298
    (27.4,28.2] (28.2,29]   (29,33.5]
fail  3.100147   2.982941  0.9180457
np   12.899853  19.017059 13.0819543

There are far too many small categories. I try grouping the last four categories. To maximize the degrees of freedom for the test adjust the cut-offs manually so that I can obtain two categories from these four when grouped.

> quantile(crabs$width,seq(0,1,.1))
   0%   10%   20%   30%   40%   50%   60%   70%   80%   90%  100%
21.00 23.70 24.50 25.06 25.70 26.10 26.70 27.44 28.20 29.00 33.50

> new.cat<-c(quantile(crabs$width,seq(0,1,.1))[1:7],27.7,33.6)
> new.width.cuts<-cut(crabs$width,new.cat,include.lowest=TRUE)
> tapply(fitted(out),new.width.cuts,sum)->np
> tapply(fitted(out),new.width.cuts,length)->new.ni
> fail<-new.ni-np

> rbind(fail,np)->Ei
> table(pres.abs,new.width.cuts)->Oi
> sum((Oi-Ei)^2/Ei)

[1] 3.186008

The degrees of freedom are the number of categories minus 2, the number of estimated parameters.

> 1-pchisq(sum((Oi-Ei)^2/Ei),dim(Oi)[2]-2)
[1] 0.7851649

We fail to find a significant lack of fit.

Method 2: Hosmer-Lemeshow test

> grouped.logit<-cut(predict(out),quantile(predict(out),seq(0,1,.1)),include.lowest=TRUE)
> table(pres.abs,grouped.logit)->Oi2
> rbind(tapply(fitted(out), grouped.logit, length)-tapply(fitted(out), grouped.logit,sum), tapply(fitted(out), grouped.logit,sum))->Ei2
> sum((Oi2-Ei2)^2/Ei2)

[1] 4.385541
> 1-pchisq(sum((Oi2-Ei2)^2/Ei2),8)
[1] 0.8207722

The Hosmer-Lemeshow test also fails to find a lack of fit.

Method 3: Goodness of fit test from Design package.

> library(Design)
> out.harrell<-lrm(pres.abs~width,data=crabs,x=TRUE,y=TRUE)
> residuals.lrm(out.harrell,type='gof')

Sum of squared errors Expected value|H0        SD         Z
           33.3562220        33.0342322 0.3031117 1.0622807
        P
0.2881083

The p-value exceeds .05. So we fail to find any evidence of lack of fit using any of the tests.

Course Home Page


Jack Weiss
Phone: (919) 962-5930
E-Mail: jack_weiss@unc.edu
Address: Curriculum in Ecology, Box 3275, University of North Carolina, Chapel Hill, 27516
Copyright © 2006
Last Revised--April 7, 2006
URL: http://www.unc.edu/courses/2006spring/ecol/145/001/docs/solutions/assign8.htm