I read in the data and create the presence-absence variable.
> crabs<- read.table('http://www.unc.edu/courses/2006spring/ecol/145/001/data/midterm/crabs.txt', header=TRUE)
> pres.abs<-crabs$num.satellites>0
Question 1
To examine the linearity of logit, I divide the predictor, carapace width, into categories and calculate the empirical logit in each category. I then plot the empirical logit against the midpoint of each interval, fit a linear regression, and superimpose a lowess curve.
> width.cuts<-cut(crabs$width,quantile(crabs$width,seq(0,1,.1)),include.lowest=TRUE)
> tapply(pres.abs,width.cuts,sum)->num.successes
> tapply(pres.abs,width.cuts,length)->ni
> emp.logit<-log((num.successes+0.5)/(ni-num.successes+1/2))
> midpoint<-(quantile(crabs$width,seq(0,1,.1))[1:10] + quantile(crabs$width,seq(0,1,.1))[2:11])/2
> plot(midpoint,emp.logit,xlab='width',ylab='logit')
> abline(lm(emp.logit~midpoint),col=2)
> lines(lowess(emp.logit~midpoint),col=4,lty=2)
> mtext('Linearity of the logit',side=3,line=.5)
Fig. 1 shows that the linearity assumption looks quite good. We can also examine the relationship on the probability scale by plotting the presence-absence data versus width and superimposing a lowess curve. The lowess curve should be approximately S-shaped.
> plot(crabs$width, pres.abs, xlab='width', ylab='probability')
> lines(lowess(pres.abs~crabs$width), col=2)

Fig. 1 Graph of data on a logit and probability scale
Question 2
One of the requested significance tests is a likelihood ratio test that compares a logistic regression model with width as the predictor to a model with only an intercept. The test can be obtained by applying the anova function to a model with width as a predictor.
> glm(pres.abs~width,data=crabs,family=binomial)->out
> #LR test
> anova(out,test='Chisq')
Analysis of Deviance Table
Model: binomial, link: logit
Response: pres.abs
Terms added sequentially (first to last)
Df Deviance Resid. Df Resid. Dev P(>|Chi|)
NULL 172 225.759
width 1 31.306 171 194.453 2.204e-08
The second requested significance test is a Wald test. Wald tests are obtained by applying the summary function to the model and examining the coefficient table that is produced.
> summary(out)
Call:
glm(formula = pres.abs ~ width, family = binomial, data = crabs)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.0281 -1.0458 0.5480 0.9066 1.6941
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -12.3508 2.6287 -4.698 2.62e-06 ***
width 0.4972 0.1017 4.887 1.02e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 225.76 on 172 degrees of freedom
Residual deviance: 194.45 on 171 degrees of freedom
AIC: 198.45
Number of Fisher Scoring iterations: 4
Both the likelihood ratio test and the Wald test indicate a highly significant linear relationship between width and the log odds of satellite males being present.
By exponentiating the coefficient estimate of width we can obtain an estimate of the odds ratio.
> exp(coef(out)[2])
width
1.644162
The model predicts that each one centimeter increase in width increases the odds of a satellite male being present by 1.64.
Question 3
Fig. 2 Fitted logistic regression model
I illustrate two different ways to generate the plot. The result is shown in Fig. 2.
> #method 1--connecting fitted values
> plot(crabs$width, pres.abs, xlab='width', ylab='probability')
> temp<-cbind(crabs$width,fitted(out))
> sorted.temp<-temp[order(temp[,1]),]
> lines(sorted.temp[,1], sorted.temp[,2], col=2)
> #method 2--superimpose regression function
> plot(crabs$width, pres.abs, xlab='width', ylab='probability')
> prob.func<-function(x) 1/(1+exp(-coef(out)[1] - coef(out)[2]*x))
> lines(seq(21,34,.1), prob.func(seq(21,34,.1)), col=2)
Question 4
Method 1: Group by intervals of the predictor. I start by trying deciles.
> width.cuts<-cut(crabs$width, quantile(crabs$width, seq(0,1,.1)), include.lowest=TRUE)
> tapply(fitted(out),width.cuts,sum)->np
> fail<-ni-np
> Ei<-rbind(fail,np)
> Ei
[21,23.7] (23.7,24.5] (24.5,25.1] (25.1,25.7] (25.7,26.1] (26.1,26.7] (26.7,27.4]
fail 13.610632 10.375685 7.448845 8.017449 5.904548 5.701006 3.940702
np 5.389368 7.624315 7.551155 10.982551 10.095452 12.298994 12.059298
(27.4,28.2] (28.2,29] (29,33.5]
fail 3.100147 2.982941 0.9180457
np 12.899853 19.017059 13.0819543
There are far too many small categories. I try grouping the last four categories. To maximize the degrees of freedom for the test adjust the cut-offs manually so that I can obtain two categories from these four when grouped.
> quantile(crabs$width,seq(0,1,.1))
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
21.00 23.70 24.50 25.06 25.70 26.10 26.70 27.44 28.20 29.00 33.50
> new.cat<-c(quantile(crabs$width,seq(0,1,.1))[1:7],27.7,33.6)
> new.width.cuts<-cut(crabs$width,new.cat,include.lowest=TRUE)
> tapply(fitted(out),new.width.cuts,sum)->np
> tapply(fitted(out),new.width.cuts,length)->new.ni
> fail<-new.ni-np
> rbind(fail,np)->Ei
> table(pres.abs,new.width.cuts)->Oi
> sum((Oi-Ei)^2/Ei)
[1] 3.186008
The degrees of freedom are the number of categories minus 2, the number of estimated parameters.
> 1-pchisq(sum((Oi-Ei)^2/Ei),dim(Oi)[2]-2)
[1] 0.7851649
We fail to find a significant lack of fit.
Method 2: Hosmer-Lemeshow test
> grouped.logit<-cut(predict(out),quantile(predict(out),seq(0,1,.1)),include.lowest=TRUE)
> table(pres.abs,grouped.logit)->Oi2
> rbind(tapply(fitted(out), grouped.logit, length)-tapply(fitted(out), grouped.logit,sum), tapply(fitted(out), grouped.logit,sum))->Ei2
> sum((Oi2-Ei2)^2/Ei2)
[1] 4.385541
> 1-pchisq(sum((Oi2-Ei2)^2/Ei2),8)
[1] 0.8207722
The Hosmer-Lemeshow test also fails to find a lack of fit.
Method 3: Goodness of fit test from Design package.
> library(Design)
> out.harrell<-lrm(pres.abs~width,data=crabs,x=TRUE,y=TRUE)
> residuals.lrm(out.harrell,type='gof')
Sum of squared errors Expected value|H0 SD Z
33.3562220 33.0342322 0.3031117 1.0622807
P
0.2881083
The p-value exceeds .05. So we fail to find any evidence of lack of fit using any of the tests.
| Jack Weiss Phone: (919) 962-5930 E-Mail: jack_weiss@unc.edu Address: Curriculum in Ecology, Box 3275, University of North Carolina, Chapel Hill, 27516 Copyright © 2006 Last Revised--April 7, 2006 URL: http://www.unc.edu/courses/2006spring/ecol/145/001/docs/solutions/assign8.htm |