I read in the data and create the presence-absence variable.
> crabs<- read.table('http://www.unc.edu/courses/2006spring/ecol/145/001/data/midterm/crabs.txt', header=TRUE)
> pres.abs<-crabs$num.satellites>0
Question 1
Fig. 1 Restricted cubic spline plot of weight
There are four variables that are candidates for predictors.
> names(crabs)
[1] "color" "spine" "width" "num.satellites" "weight"
In assignment 8 we learned that width was linear in the logit. I examine weight next. I use the rcspline.plot function from the Design library to regress the logit against a restricted cubic spline transformation of weight.
> library(Design)
> rcspline.plot(y=pres.abs, x=crabs$weight, nk=5, m=20)
> mtext(side=1, line=2.5, 'weight')
Fig. 1 shows that the linearity assumption looks quite good. The Wald test for linearity is far from being statistically significant, so we have no basis for rejecting the assumption of linearity based on the data. The LR statistic and association Wald test both indicate a significant relationship between the log odds of a satellite male being present and weight.
The remaining variables that are available, spine and color, are categorical variables. It's worth noting that technically these are ordinal categorical variables in that the categories are arranged monotonically. For color the categories are ordered in levels of increasing "darkness" of color. For spine the categories are ordered in increasing damage to the two spines. The actual numeric values assigned to these categories have no meaning. All we really know is the order of the categories not how much "bigger" one category is than another. We will investigate later whether the monotonicity of the categories can be used to our advantage, but for the moment will will just treat these as generic categorical variables and enter them into the model as factors. Note: It is important not to enter them as numeric variables because that would treat the totally arbitrary categorical labels as being somehow meaningful.
I use the default dummy coding scheme generated by R for the factors.
> glm(pres.abs~width+weight+spine.f+color.f, data=crabs, family=binomial)->out1
> summary(out1)
Call:
glm(formula = pres.abs ~ width + weight + spine.f + color.f,
family = binomial, data = crabs)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.1977 -0.9424 0.4849 0.8491 2.1198
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -8.0650134 3.9285518 -2.053 0.0401 *
width 0.2631279 0.1952986 1.347 0.1779
weight 0.0008258 0.0007038 1.173 0.2407
spine.f2 -0.0959809 0.7033698 -0.136 0.8915
spine.f3 0.4002868 0.5027043 0.796 0.4259
color.f3 -0.1029024 0.7825912 -0.131 0.8954
color.f4 -0.4888642 0.8531183 -0.573 0.5666
color.f5 -1.6086658 0.9355326 -1.720 0.0855 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 225.76 on 172 degrees of freedom
Residual deviance: 185.20 on 165 degrees of freedom
AIC: 201.2
Number of Fisher Scoring iterations: 4
From the summary output it would appear nothing is statistically significant. The tests of the categorical variables are in fact tests of specific contrasts, so a more honest picture can be obtained using likelihood ratio tests comparing models in which either color or spine are omitted.
> glm(pres.abs~width+weight+color.f, data=crabs, family=binomial)->out1.nospine
> glm(pres.abs~width+weight+spine.f, data=crabs, family=binomial)->out1.nocolor
> anova(out1.nospine, out1, test='Chisq')
Analysis of Deviance Table
Model 1: pres.abs ~ width + weight + color.f
Model 2: pres.abs ~ width + weight + spine.f + color.f
Resid. Df Resid. Dev Df Deviance P(>|Chi|)
1 167 186.211
2 165 185.202 2 1.009 0.604
> anova(out1.nocolor,out1,test='Chisq')
Analysis of Deviance Table
Model 1: pres.abs ~ width + weight + spine.f
Model 2: pres.abs ~ width + weight + spine.f + color.f
Resid. Df Resid. Dev Df Deviance P(>|Chi|)
1 168 192.798
2 165 185.202 3 7.596 0.055
So, categorical spine and categorical color do not significantly improve the model when the other variables are already present.
Question 2
The troubling thing is that we already know from Assignment 8 is that there is a highly significant relationship between presence-absence of satellite males and the width of the female. Yet when we fit the full model, this relationship goes undetected. The explanation must lie in the presence of the other variables. The summary function reports a variable-added-last test, i.e., it tests width when the other three variables are already in the model. We can get some insight into this by applying the anova function to the model. The anova function does a variables added in order series of likelihood ratio tests.
> anova(out1,test='Chisq')
Analysis of Deviance Table
Model: binomial, link: logit
Response: pres.abs
Terms added sequentially (first to last)
Df Deviance Resid. Df Resid. Dev P(>|Chi|)
NULL 172 225.759
width 1 31.306 171 194.453 2.204e-08
weight 1 1.561 170 192.892 0.212
spine.f 2 0.094 168 192.798 0.954
color.f 3 7.596 165 185.202 0.055
Now we see that width is highly significant as we've seen before. That's because the test being reported is for width when nothing else is in the model. The test for weight is a test when width is already present. Suppose we reverse the order of weight and width in the model formula.
> glm(pres.abs~weight+width+spine.f+color.f, data=crabs, family=binomial)->out1.5
> anova(out1.5,test='Chisq')
Analysis of Deviance Table
Model: binomial, link: logit
Response: pres.abs
Terms added sequentially (first to last)
Df Deviance Resid. Df Resid. Dev P(>|Chi|)
NULL 172 225.759
weight 1 30.021 171 195.737 4.273e-08
width 1 2.845 170 192.892 0.092
spine.f 2 0.094 168 192.798 0.954
color.f 3 7.596 165 185.202 0.055
When weight is added first its seen to be a statistically significant addition to the model, but now width is not. Clearly the presence of width and weight in the same model is messing things up.
Question 3
I drop weight entirely and refit the model
> glm(pres.abs~width+spine.f+color.f, data=crabs, family=binomial)->out2
> summary(out2)
Call:
glm(formula = pres.abs ~ width + spine.f + color.f, family = binomial,
data = crabs)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.1206 -0.9723 0.5076 0.8750 2.1158
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -11.09953 2.97706 -3.728 0.000193 ***
width 0.45624 0.10779 4.233 2.31e-05 ***
spine.f2 -0.05782 0.70308 -0.082 0.934453
spine.f3 0.37703 0.50191 0.751 0.452540
color.f3 -0.14340 0.77838 -0.184 0.853830
color.f4 -0.52405 0.84685 -0.619 0.536030
color.f5 -1.66833 0.93285 -1.788 0.073706 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 225.76 on 172 degrees of freedom
Residual deviance: 186.61 on 166 degrees of freedom
AIC: 200.61
Number of Fisher Scoring iterations: 4
Now width is statistically significant. It was the presence of weight alone that was causing the trouble. I check the correlation of width and weight.
> cor(crabs$width,crabs$weight)
[1] 0.8868715
They're highly correlated. This of course makes perfect sense. Wider crabs are bigger crabs and hence are heavier crabs. Width and weight are essentially measuring the same thing and clearly both should not be in the same model.
Question 4
From what we've seen clearly width should be in the model. All that remains is to decide whether spine.f, color.f, or both should be in the model with width. We can assess this parsimoniously by fitting two additional models: a model with width and spine.f, and a model with width and color.f and compare these to a model with just width.
> glm(pres.abs~width+spine.f, data=crabs, family=binomial)->out3.spine
> glm(pres.abs~width+color.f, data=crabs, family=binomial)->out3.color
> glm(pres.abs~width, data=crabs, family=binomial)->out0.width
> anova(out3.spine,out0.width,test='Chisq')
Analysis of Deviance Table
Model 1: pres.abs ~ width + spine.f
Model 2: pres.abs ~ width
Resid. Df Resid. Dev Df Deviance P(>|Chi|)
1 169 194.425
2 171 194.453 -2 -0.028 0.986
> anova(out3.color,out0.width,test='Chisq')
Analysis of Deviance Table
Model 1: pres.abs ~ width + color.f
Model 2: pres.abs ~ width
Resid. Df Resid. Dev Df Deviance P(>|Chi|)
1 168 187.457
2 171 194.453 -3 -6.996 0.072
So neither test is significant at the 5% level, although the test for color is close. If we are strict "5-percenters", we'd stop with a model that contains only width.
Question 5
Following the implied "blessing" I use the stepAIC function starting with the main effects model, choosing a model with all possible interactions as the maximal model, and a model with only an intercept as the minimal model. The main effects model was called out2 above.
> library(MASS)
> stepAIC(out2, scope=list(upper=~width*spine.f*color.f, lower=~1))
Start: AIC= 200.61
pres.abs ~ width + spine.f + color.f
Df Deviance AIC
- spine.f 2 187.46 197.46
<none> 186.61 200.61
+ width:color.f 3 181.64 201.64
- color.f 3 194.43 202.43
+ spine.f:color.f 6 177.60 203.60
+ width:spine.f 2 186.41 204.41
- width 1 208.83 220.83
Step: AIC= 197.46
pres.abs ~ width + color.f
Df Deviance AIC
<none> 187.46 197.46
- color.f 3 194.45 198.45
+ width:color.f 3 183.08 199.08
+ spine.f 2 186.61 200.61
- width 1 212.06 220.06
Call: glm(formula = pres.abs ~ width + color.f, family = binomial, data = crabs)
Coefficients:
(Intercept) width color.f3 color.f4 color.f5
-11.38519 0.46796 0.07242 -0.22380 -1.32992
Degrees of Freedom: 172 Total (i.e. Null); 168 Residual
Null Deviance: 225.8
Residual Deviance: 187.5 AIC: 197.5
Following the logic of the iterations we see at the first step spine.f is dropped because it yields a model with the lowest AIC. At the next iteration this model is retained because dropping or adding any term permitted by the marginality principle causes the AIC to rise. The final model includes width and color.f.
Interpreting the parameter estimates
Since the coefficient of width is positive we see increases in width crease the probability of a satellite male being present. Formally,
> glm(pres.abs~width+color.f, data=crabs, family=binomial)->out3
> coef(out3)
(Intercept) width color.f3 color.f4 color.f5
-11.38519276 0.46795598 0.07241694 -0.22379766 -1.32991913
> exp(coef(out3)[2])
width
1.596727
The odds go up by a factor of 1.6 for every 1 cm increase in female carapace width. The color variable is a bit more complicated to interpret. The coding scheme is given below.
> contrasts(color.f)
3 4 5
2 0 0 0
3 1 0 0
4 0 1 0
5 0 0 1
The color "medium light" is the baseline color. Our generic model for the log odds of a satellite male being present is the following.
![]()
where
are the dummy variables for color. Using the contrasts above we can write equations for each of the four colored females.

If we have two females of exactly the same width but different colors we interpret the color coefficients as a difference of log odds of a satellite male being present where the comparison female is a medium light female.
A similar relationship holds for medium dark and dark females, once again with medium light females as the reference group. Thus each coefficient is the log odds ratio of satellite male being present for two crabs of the same width where one is medium, medium dark, or dark, and the other is medium light in color.
> exp(coef(out3)[3])
color.f3
1.075104
So the ratio of odds of a satellite male being present is 1.08 for medium females versus medium light females of the same size, roughly the same.
> exp(coef(out3)[4])
color.f4
0.7994769
So the ratio of odds of a satellite male being present is 0.8 for medium dark females versus medium light females of the same size, a decrease.
> exp(coef(out3)[5])
color.f5
0.2644987
So the ratio of odds of a satellite male being present is 0.26 for dark females versus medium light females of the same size, a sizable decrease.
Question 6
There are just two three nested models to consider in addition to the model of question 5.
Model 0: intercept only
Model 1: width only
Model 2: color.f only
Model 3: width + color.f
Because color is a categorical variable with four categories, the proper way to test it is to use a likelihood ratio test. The univariate Wald tests reported in the summary table do not test the construct color, but instead the contrast specified in the individual dummy variables.
> model0<-glm(pres.abs~1,data=crabs,family=binomial)
> model1<-glm(pres.abs~width,data=crabs,family=binomial)
> model2<-glm(pres.abs~color.f,data=crabs,family=binomial)
> model3<-glm(pres.abs~width+color.f,data=crabs,family=binomial)
> anova(model2,model3,test='Chisq')
Analysis of Deviance Table
Model 1: pres.abs ~ color.f
Model 2: pres.abs ~ width + color.f
Resid. Df Resid. Dev Df Deviance P(>|Chi|)
1 169 212.061
2 168 187.457 1 24.604 7.041e-07
> anova(model1,model3,test='Chisq')
Analysis of Deviance Table
Model 1: pres.abs ~ width
Model 2: pres.abs ~ width + color.f
Resid. Df Resid. Dev Df Deviance P(>|Chi|)
1 171 194.453
2 168 187.457 3 6.996 0.072
> anova(model0,model3,test='Chisq')
Analysis of Deviance Table
Model 1: pres.abs ~ 1
Model 2: pres.abs ~ width + color.f
Resid. Df Resid. Dev Df Deviance P(>|Chi|)
1 172 225.759
2 168 187.457 4 38.301 9.71e-08
Thus using significance testing with α = .05, we would retain width but drop color.f from the model.
Question 7
Fig. 2 Crab probability models
The equations we need for the four different "colored" females are given above as log odds. To convert these to a probability scale I use the relationship
Thus we define the following functions.
> med.light<-function(x) 1/(1+exp(-(coef(out3)[1] + coef(out3)[2]*x)))
> others<-function(x,k) 1/(1+exp(-(coef(out3)[1] + coef(out3)[2]*x + coef(out3)[k])))
where k = 3, 4, or 5.
> plot(crabs$width, pres.abs, col=crabs$color-1, cex=.8, xlab='width', ylab='probability')
> lines(seq(21,34,.1), med.light(seq(21,34,.1)), col=1)
> lines(seq(21,34,.1), others(seq(21,34,.1),3), col=2)
> lines(seq(21,34,.1), others(seq(21,34,.1),4), col=3)
> lines(seq(21,34,.1), others(seq(21,34,.1),5), col=4)
> legend(30,.5, c('medium light', 'medium', 'medium dark', 'dark'), col=1:4, cex=rep(.8,4), lty=rep(1,4), bty='n')
> mtext(side=3, line=.5, text='Probability of satellite male being present\nas a function of female width and color')
Question 8
The only sensible test to carry out is some variant of the Hosmer-Lemeshow test. With multiple predictors partitioning the data according to levels of the predictors has too many possibilities. Following what we've done before, I divide the estimated logits into deciles and use these to partition the observed and expected frequencies of zeros and ones.
> y.cuts<-cut(predict(out3), quantile(predict(out3), seq(0,1,.1)), include.lowest=TRUE)
> y.1<-tapply(fitted(out3),y.cuts,sum)
> y.n<-tapply(fitted(out3),y.cuts,length)
> y.0<-y.n-y.1
> Ei<-rbind(y.0,y.1)
> Ei
[-2.89,-0.782] (-0.782,-0.222] (-0.222,0.154] (0.154,0.433]
y.0 14.624349 9.849763 9.122673 6.984058
y.1 4.375651 6.150237 8.877327 10.015942
(0.433,0.714] (0.714,1] (1,1.36] (1.36,1.74] (1.74,2.12] (2.12,4.36]
y.0 6.022493 5.048826 4.100546 2.990691 2.139013 1.117588
y.1 10.977507 11.951174 12.899454 15.009309 14.860987 15.882412
Five of the 20 categories are too small. Rather than pool the existing categories I'm going to redefine the endpoints slightly. If I extend the right hand endpoint of the category (1, 1.36] I should be able to bring this one up to 5 and then I'll pool the rest.
> new.cat<-c(quantile(predict(out3), seq(0,1,.1))[1:7],1.56,4.4)
> y.cuts<-cut(predict(out3), new.cat,include.lowest=TRUE)
> y.1<-tapply(fitted(out3),y.cuts,sum)
> y.n<-tapply(fitted(out3),y.cuts,length)
> y.0<-y.n-y.1
> Ei<-rbind(y.0,y.1)
> Ei
[-2.89,-0.782] (-0.782,-0.222] (-0.222,0.154] (0.154,0.433]
y.0 14.624349 9.849763 9.122673 6.984058
y.1 4.375651 6.150237 8.877327 10.015942
(0.433,0.714] (0.714,1] (1,1.56] (1.56,4.4]
y.0 6.022493 5.048826 5.372703 4.975136
y.1 10.977507 11.951174 18.627297 40.024864
That's not perfect but now there are only two out of 16 categories that are too small, which is under the 20% value. I finish the analysis.
> Oi<-table(pres.abs,y.cuts)
> sum((Oi-Ei)^2/Ei)
[1] 2.245275
> 1-pchisq(sum((Oi-Ei)^2/Ei),dim(Oi)[2]-2)
[1] 0.8958155
So we fail to find evidence of lack of fit (p-value is large). As a second goodness of fit test I try the variant of the Hosmer-Lemeshow test in the Design library.
> library(Design)
> out.h<-lrm( pres.abs ~ width+color.f,data=crabs,x=TRUE,y=TRUE)
> residuals.lrm(out.h,type='gof')
Sum of squared errors Expected value|H0 SD
31.6457412 31.5393772 0.3184185
Z P
0.3340382 0.7383507
The reported p-value of 0.738 agrees with our result above. We have no evidence of lack of fit.
Question 9
Part 1
I obtain sensitivity and specificity from the ROCR library and look for the spot where the graphs cross.
> library(ROCR)
> pred<-prediction(fitted(out3), pres.abs)
> performance(pred, 'tpr', 'tnr')->testy
> tempmat<-cbind(testy@x.values[[1]], testy@y.values[[1]], testy@alpha.values[[1]])
> tempmat[abs(tempmat[,1]-tempmat[,2])<.02,]->submat
> submat
[,1] [,2] [,3]
[1,] 0.6935484 0.6756757 0.6467219
[2,] 0.6774194 0.6936937 0.6395257
Fig. 3 Maximizing sensitivity and specificity
If I average the values of c (3rd column) on either side of the changeover that will be a close approximation.
> mean(submat[,3])
[1] 0.6431238
To see if this looks reasonable, I plot sensitivity and specificity and indicate this cut-off with a vertical line.
> plot(testy@alpha.values[[1]], testy@x.values[[1]], type='n', xlab='c',
ylab='sensitivity or specificity')
> lines(testy@alpha.values[[1]], testy@y.values[[1]])
> lines(testy@alpha.values[[1]], testy@x.values[[1]],col=2)
> legend(.8,.8, c('sensitivity', 'specificity'), lty=c(1,1), col=1:2, cex=c(.9,.9), bty='n')
> abline(v=mean(submat[,3]), col=4, lty=2)
> text(.65, .2,paste('c = ', round(mean(submat[,3]), 3)), pos=2, col=4, cex=.9)
Part 2
Area under the curve is part of the output from the lrm function from the Design package. I examine it below.
> out.h
Logistic Regression Model
lrm(formula = pres.abs ~ width + color.f, data = crabs, x = TRUE,
y = TRUE)
Frequencies of Responses
FALSE TRUE
62 111
Obs Max Deriv Model L.R. d.f. P C Dxy
173 4e-05 38.3 4 0 0.771 0.543
Gamma Tau-a R2 Brier
0.546 0.251 0.272 0.183
Coef S.E. Wald Z P
Intercept -11.38519 2.8735 -3.96 0.0001
width 0.46796 0.1055 4.43 0.0000
color.f=3 0.07242 0.7399 0.10 0.9220
color.f=4 -0.22380 0.7771 -0.29 0.7733
color.f=5 -1.32992 0.8525 -1.56 0.1188
The coefficient labeled C, the concordance index, is also the AUC. The reported value is 0.771. Thus if we were to pair all observed ones with all observed zeros in all possible ways and then for each pair guess which of the two is a "one" based on which one has the higher predicted probability according to the model, then we would be right 77.1% of the time.
Part 3
I use the cv.binary function in the DAAG package. Following the hint I add the generated variables to the the data file and refit the model.
> crabs$pres.abs<-pres.abs
> crabs$color.f<-color.f
> out3a<-glm(pres.abs~width+color.f,data=crabs,family=binomial)
> cv.binary(out3a)
Fold: 8 4 5 3 6 10 7 2 9 1
Internal estimate of accuracy = 0.734
Cross-validation estimate of accuracy = 0.705
Fig. 4 ROC curve
What's reported is not the AUC but a number that is close in spirit. It reports the fraction of times the predicted probability when rounded to zero or one, matches the observed value. So we see that this 0.734 for full data set, but only 0.705 under cross-validation. Note: we can easily check this interpretation as follows.
> sum(round(fitted(out3))==pres.abs)/length(pres.abs)
[1] 0.734104
Part 4
I use the ROC function from the Epi package to generate the ROC curve.
> library(Epi)
> attach(crabs)
> ROC(form=pres.abs~width+color.f, plot="ROC")
Part 5
To superimpose the two ROC curves I extract the true positive rate 'TPR' and false positive rate 'FPR' from the two models in question using functions from the ROCR package.
> pred<-prediction(fitted(out3), pres.abs)
> performance(pred, 'tpr', 'fpr')->testy
> pred2<-prediction(fitted(out0.width), pres.abs)
> performance(pred2, 'tpr', 'fpr')->testy2
> plot(testy)
> par(new=TRUE)
> plot(testy2, col=2, axes=FALSE, xlab='', ylab='')
> legend(0,1,c('color and width','just width'), col=1:2, lty=c(1,1), bty='n', cex=c(.9,.9))
Alternatively I can draw the curves myself using the type='s' option.
> par(new=FALSE)
> plot(testy@x.values[[1]], testy@y.values[[1]], type='s', xlab='False positive rate',
ylab='True positive rate')
> par(new=TRUE)
> plot(testy2@x.values[[1]], testy2@y.values[[1]], type='s', xlab='False positive rate',
ylab='True positive rate', col=2)
> legend(0,1,c('color and width','just width'), col=1:2, lty=c(1,1), bty='n', cex=c(.9,.9))
> par(new=FALSE)
Notice this last plot is more honest because it is a true step function.
Fig. 5 ROC curves for two models—two versions of the same plot
Question 10
The evidence for the ordinal nature of the relationship is somewhat weak. It appears, except for the transition from medium light to medium, that as the crab gets darker it is less likely to have satellite males present. This follows from the fact β2 and β3 are negative with β3 < β2. On the other hand β1 > 0 but only slightly. The the intercepts in the logistic regression model, which are
have estimates that rank in the following order.
Question 11
The evidence for the ordinal nature of the relationship is somewhat weak. It appears, except for the transition from medium light to medium, that as the crab gets darker it is less likely to have satellite males present. This follows from the fact β2 and β3 are negative with β3 < β2. On the other hand β1 > 0
> color.o<-ordered(crabs$color)
> contrasts(color.o)
.L .Q .C
2 -0.6708204 0.5 -0.2236068
3 -0.2236068 -0.5 0.6708204
4 0.2236068 -0.5 -0.6708204
5 0.6708204 0.5 0.2236068
> model.linear<-glm(pres.abs~width+color.o,data=crabs,family=binomial)
> summary(model.linear)
Call:
glm(formula = pres.abs ~ width + color.o, family = binomial,
data = crabs)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.1124 -0.9848 0.5243 0.8513 2.1413
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -11.75552 2.74185 -4.287 1.81e-05 ***
width 0.46796 0.10554 4.434 9.26e-06 ***
color.o.L -0.95837 0.58032 -1.651 0.0986 .
color.o.Q -0.58927 0.47472 -1.241 0.2145
color.o.C -0.09867 0.33738 -0.292 0.7699
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 225.76 on 172 degrees of freedom
Residual deviance: 187.46 on 168 degrees of freedom
AIC: 197.46
Number of Fisher Scoring iterations: 4
Since none of the contrasts are significant (although the linear trend is closest to being significant), we don't find any evidence for a linear, quadratic or cubic trend. A positive result here would have been a bit suspicious anyway since we have no reason to believe that the color categories are equally spaced.
Question 12
Helmert contrasts the current level of the categorical variable with the mean of all preceding categories.
> contrasts(color.f)<-'contr.helmert'
> contrasts(color.f)
[,1] [,2] [,3]
2 -1 -1 -1
3 1 -1 -1
4 0 2 -1
5 0 0 3
> model.helmert<-glm(pres.abs~width+color.f,data=crabs,family=binomial)
> summary(model.helmert)
Call:
glm(formula = pres.abs ~ width + color.f, family = binomial,
data = crabs)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.1124 -0.9848 0.5243 0.8513 2.1413
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -11.75552 2.74185 -4.287 1.81e-05 ***
width 0.46796 0.10554 4.434 9.26e-06 ***
color.f1 0.03621 0.36995 0.098 0.922
color.f2 -0.08667 0.16742 -0.518 0.605
color.f3 -0.31986 0.13966 -2.290 0.022 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 225.76 on 172 degrees of freedom
Residual deviance: 187.46 on 168 degrees of freedom
AIC: 197.46
Number of Fisher Scoring iterations: 4
This time the last contrast is significant. From the coding we see that this is a contrast of the last group (darkest crabs) with the mean of the three lighter groups. The negative coefficient means that for a fixed size darker female crabs are less likely to have satellite males than other female crabs.
Question 13
Based on the Helmert contrasts we saw that the only contrast close to being significant was the "dark" category against the rest. I try dichotomizing "dark" versus everyone else.
> dark<-crabs$color==5
> dark
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[13] FALSE TRUE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE
[25] TRUE TRUE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
[37] FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[49] FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
[61] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[73] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE
[85] FALSE TRUE FALSE TRUE TRUE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
[97] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[109] FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE
[121] FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
[133] FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[145] FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE
[157] FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
[169] FALSE FALSE FALSE TRUE FALSE
> model.dark<-glm(pres.abs~width+factor(dark),data=crabs,family=binomial)
> summary(model.dark)
Call:
glm(formula = pres.abs ~ width + factor(dark), family = binomial,
data = crabs)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.0821 -0.9932 0.5274 0.8606 2.1553
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -11.6790 2.6925 -4.338 1.44e-05 ***
width 0.4782 0.1041 4.592 4.39e-06 ***
factor(dark)TRUE -1.3005 0.5259 -2.473 0.0134 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 225.76 on 172 degrees of freedom
Residual deviance: 187.96 on 170 degrees of freedom
AIC: 193.96
Number of Fisher Scoring iterations: 4
So now color is statistically significant and based on the sign of the coefficient we see that darker females are less likely to have satellite males. Of course this has been a fishing expedition. It would have been nice if we could have justified this last step with our knowledge of the visual perception of horseshoe crabs, that, e.g., they are less attracted to dark objects or that dark objects are more difficult for them to perceive. As a final step I compare the AIC of the all the color models that have been fit.
> sapply(list(out3,model.linear,model.helmert,model.dark),AIC)
[1] 197.4570 197.4570 197.4570 193.9579
Notice that changing how the contrasts are defined does not change the AIC.
| Jack Weiss Phone: (919) 962-5930 E-Mail: jack_weiss@unc.edu Address: Curriculum in Ecology, Box 3275, University of North Carolina, Chapel Hill, 27516 Copyright © 2006 Last Revised--April 12, 2006 URL: http://www.unc.edu/courses/2006spring/ecol/145/001/docs/solutions/assign9.htm |