Assignment 9—Solution

Setting up the data

I read in the data and create the presence-absence variable.

> crabs<- read.table('http://www.unc.edu/courses/2006spring/ecol/145/001/data/midterm/crabs.txt', header=TRUE)
> pres.abs<-crabs$num.satellites>0

Question 1

Fig. 1   Restricted cubic spline plot of weight

There are four variables that are candidates for predictors.

> names(crabs)
[1] "color" "spine" "width" "num.satellites" "weight"

In assignment 8 we learned that width was linear in the logit. I examine weight next. I use the rcspline.plot function from the Design library to regress the logit against a restricted cubic spline transformation of weight.

> library(Design)
> rcspline.plot(y=pres.abs, x=crabs$weight, nk=5, m=20)
> mtext(side=1, line=2.5, 'weight')

Fig. 1 shows that the linearity assumption looks quite good. The Wald test for linearity is far from being statistically significant, so we have no basis for rejecting the assumption of linearity based on the data. The LR statistic and association Wald test both indicate a significant relationship between the log odds of a satellite male being present and weight.

The remaining variables that are available, spine and color, are categorical variables. It's worth noting that technically these are ordinal categorical variables in that the categories are arranged monotonically. For color the categories are ordered in levels of increasing "darkness" of color. For spine the categories are ordered in increasing damage to the two spines. The actual numeric values assigned to these categories have no meaning. All we really know is the order of the categories not how much "bigger" one category is than another. We will investigate later whether the monotonicity of the categories can be used to our advantage, but for the moment will will just treat these as generic categorical variables and enter them into the model as factors. Note: It is important not to enter them as numeric variables because that would treat the totally arbitrary categorical labels as being somehow meaningful.

I use the default dummy coding scheme generated by R for the factors.

> glm(pres.abs~width+weight+spine.f+color.f, data=crabs, family=binomial)->out1
> summary(out1)

Call:
glm(formula = pres.abs ~ width + weight + spine.f + color.f,
    family = binomial, data = crabs)

Deviance Residuals:
    Min      1Q Median     3Q    Max
-2.1977 -0.9424 0.4849 0.8491 2.1198

Coefficients:
              Estimate Std. Error z value Pr(>|z|)
(Intercept) -8.0650134  3.9285518  -2.053   0.0401 *
width        0.2631279  0.1952986   1.347   0.1779
weight       0.0008258  0.0007038   1.173   0.2407
spine.f2    -0.0959809  0.7033698  -0.136   0.8915
spine.f3     0.4002868  0.5027043   0.796   0.4259
color.f3    -0.1029024  0.7825912  -0.131   0.8954
color.f4    -0.4888642  0.8531183  -0.573   0.5666
color.f5    -1.6086658  0.9355326  -1.720   0.0855 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 225.76 on 172 degrees of freedom
Residual deviance: 185.20 on 165 degrees of freedom
AIC: 201.2

Number of Fisher Scoring iterations: 4    

From the summary output it would appear nothing is statistically significant. The tests of the categorical variables are in fact tests of specific contrasts, so a more honest picture can be obtained using likelihood ratio tests comparing models in which either color or spine are omitted.

> glm(pres.abs~width+weight+color.f, data=crabs, family=binomial)->out1.nospine
> glm(pres.abs~width+weight+spine.f, data=crabs, family=binomial)->out1.nocolor
> anova(out1.nospine, out1, test='Chisq')

Analysis of Deviance Table

Model 1: pres.abs ~ width + weight + color.f
Model 2: pres.abs ~ width + weight + spine.f + color.f
  Resid. Df Resid. Dev Df Deviance P(>|Chi|)
1       167    186.211
2       165    185.202  2    1.009     0.604

> anova(out1.nocolor,out1,test='Chisq')
Analysis of Deviance Table

Model 1: pres.abs ~ width + weight + spine.f
Model 2: pres.abs ~ width + weight + spine.f + color.f
  Resid. Df Resid. Dev Df Deviance P(>|Chi|)
1       168    192.798
2       165    185.202  3    7.596     0.055

So, categorical spine and categorical color do not significantly improve the model when the other variables are already present.

Question 2

The troubling thing is that we already know from Assignment 8 is that there is a highly significant relationship between presence-absence of satellite males and the width of the female. Yet when we fit the full model, this relationship goes undetected. The explanation must lie in the presence of the other variables. The summary function reports a variable-added-last test, i.e., it tests width when the other three variables are already in the model. We can get some insight into this by applying the anova function to the model. The anova function does a variables added in order series of likelihood ratio tests.

> anova(out1,test='Chisq')
Analysis of Deviance Table
Model: binomial, link: logit
Response: pres.abs
Terms added sequentially (first to last)

        Df Deviance Resid. Df Resid. Dev P(>|Chi|)
NULL                      172    225.759
width    1   31.306       171    194.453 2.204e-08
weight   1    1.561       170    192.892 0.212
spine.f  2    0.094       168    192.798 0.954
color.f  3    7.596       165    185.202 0.055

Now we see that width is highly significant as we've seen before. That's because the test being reported is for width when nothing else is in the model. The test for weight is a test when width is already present. Suppose we reverse the order of weight and width in the model formula.

> glm(pres.abs~weight+width+spine.f+color.f, data=crabs, family=binomial)->out1.5
> anova(out1.5,test='Chisq')

Analysis of Deviance Table
Model: binomial, link: logit
Response: pres.abs
Terms added sequentially (first to last)

        Df Deviance Resid. Df Resid. Dev P(>|Chi|)
NULL                      172    225.759
weight   1   30.021       171    195.737 4.273e-08
width    1    2.845       170    192.892 0.092
spine.f  2    0.094       168    192.798 0.954
color.f  3    7.596       165    185.202 0.055

When weight is added first its seen to be a statistically significant addition to the model, but now width is not. Clearly the presence of width and weight in the same model is messing things up.

Question 3

I drop weight entirely and refit the model

> glm(pres.abs~width+spine.f+color.f, data=crabs, family=binomial)->out2
> summary(out2)

Call:
glm(formula = pres.abs ~ width + spine.f + color.f, family = binomial,
    data = crabs)

Deviance Residuals:
    Min      1Q Median     3Q    Max
-2.1206 -0.9723 0.5076 0.8750 2.1158

Coefficients:
             Estimate Std. Error z value Pr(>|z|)
(Intercept) -11.09953    2.97706  -3.728 0.000193 ***
width         0.45624    0.10779   4.233 2.31e-05 ***
spine.f2     -0.05782    0.70308  -0.082 0.934453
spine.f3      0.37703    0.50191   0.751 0.452540
color.f3     -0.14340    0.77838  -0.184 0.853830
color.f4     -0.52405    0.84685  -0.619 0.536030
color.f5     -1.66833    0.93285  -1.788 0.073706 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 225.76 on 172 degrees of freedom
Residual deviance: 186.61 on 166 degrees of freedom
AIC: 200.61

Number of Fisher Scoring iterations: 4

Now width is statistically significant. It was the presence of weight alone that was causing the trouble. I check the correlation of width and weight.

> cor(crabs$width,crabs$weight)
[1] 0.8868715

They're highly correlated. This of course makes perfect sense. Wider crabs are bigger crabs and hence are heavier crabs. Width and weight are essentially measuring the same thing and clearly both should not be in the same model.

Question 4

From what we've seen clearly width should be in the model. All that remains is to decide whether spine.f, color.f, or both should be in the model with width. We can assess this parsimoniously by fitting two additional models: a model with width and spine.f, and a model with width and color.f and compare these to a model with just width.

> glm(pres.abs~width+spine.f, data=crabs, family=binomial)->out3.spine
> glm(pres.abs~width+color.f, data=crabs, family=binomial)->out3.color
> glm(pres.abs~width, data=crabs, family=binomial)->out0.width
> anova(out3.spine,out0.width,test='Chisq')

Analysis of Deviance Table

Model 1: pres.abs ~ width + spine.f
Model 2: pres.abs ~ width
  Resid. Df Resid. Dev Df Deviance P(>|Chi|)
1       169    194.425
2       171    194.453 -2   -0.028     0.986

> anova(out3.color,out0.width,test='Chisq')
Analysis of Deviance Table

Model 1: pres.abs ~ width + color.f
Model 2: pres.abs ~ width
  Resid. Df Resid. Dev Df Deviance P(>|Chi|)
1       168    187.457
2       171    194.453 -3   -6.996     0.072

So neither test is significant at the 5% level, although the test for color is close. If we are strict "5-percenters", we'd stop with a model that contains only width.

Question 5

Following the implied "blessing" I use the stepAIC function starting with the main effects model, choosing a model with all possible interactions as the maximal model, and a model with only an intercept as the minimal model. The main effects model was called out2 above.

> library(MASS)
> stepAIC(out2, scope=list(upper=~width*spine.f*color.f, lower=~1))

Start: AIC= 200.61
pres.abs ~ width + spine.f + color.f

                  Df Deviance    AIC
- spine.f          2   187.46 197.46
<none>                 186.61 200.61
+ width:color.f    3   181.64 201.64
- color.f          3   194.43 202.43
+ spine.f:color.f  6   177.60 203.60
+ width:spine.f    2   186.41 204.41
- width            1   208.83 220.83

Step: AIC= 197.46
pres.abs ~ width + color.f

                Df Deviance    AIC
<none>               187.46 197.46
- color.f        3   194.45 198.45
+ width:color.f  3   183.08 199.08
+ spine.f        2   186.61 200.61
- width          1   212.06 220.06

Call: glm(formula = pres.abs ~ width + color.f, family = binomial, data = crabs)

Coefficients:
(Intercept)   width color.f3 color.f4 color.f5
  -11.38519 0.46796  0.07242 -0.22380 -1.32992

Degrees of Freedom: 172 Total (i.e. Null); 168 Residual
Null Deviance: 225.8
Residual Deviance: 187.5 AIC: 197.5

Following the logic of the iterations we see at the first step spine.f is dropped because it yields a model with the lowest AIC. At the next iteration this model is retained because dropping or adding any term permitted by the marginality principle causes the AIC to rise. The final model includes width and color.f.

Interpreting the parameter estimates

Since the coefficient of width is positive we see increases in width crease the probability of a satellite male being present. Formally,

> glm(pres.abs~width+color.f, data=crabs, family=binomial)->out3
> coef(out3)

 (Intercept)      width   color.f3    color.f4    color.f5
-11.38519276 0.46795598 0.07241694 -0.22379766 -1.32991913

> exp(coef(out3)[2])
   width
1.596727

The odds go up by a factor of 1.6 for every 1 cm increase in female carapace width. The color variable is a bit more complicated to interpret. The coding scheme is given below.

> contrasts(color.f)
  3 4 5
2 0 0 0
3 1 0 0
4 0 1 0
5 0 0 1

The color "medium light" is the baseline color. Our generic model for the log odds of a satellite male being present is the following.

where are the dummy variables for color. Using the contrasts above we can write equations for each of the four colored females.

If we have two females of exactly the same width but different colors we interpret the color coefficients as a difference of log odds of a satellite male being present where the comparison female is a medium light female.

A similar relationship holds for medium dark and dark females, once again with medium light females as the reference group. Thus each coefficient is the log odds ratio of satellite male being present for two crabs of the same width where one is medium, medium dark, or dark, and the other is medium light in color.

> exp(coef(out3)[3])
color.f3
1.075104

So the ratio of odds of a satellite male being present is 1.08 for medium females versus medium light females of the same size, roughly the same.

> exp(coef(out3)[4])
color.f4
0.7994769

So the ratio of odds of a satellite male being present is 0.8 for medium dark females versus medium light females of the same size, a decrease.

> exp(coef(out3)[5])
color.f5
0.2644987

So the ratio of odds of a satellite male being present is 0.26 for dark females versus medium light females of the same size, a sizable decrease.

Question 6

There are just two three nested models to consider in addition to the model of question 5.

Model 0: intercept only
Model 1: width only
Model 2: color.f only
Model 3: width + color.f

Because color is a categorical variable with four categories, the proper way to test it is to use a likelihood ratio test. The univariate Wald tests reported in the summary table do not test the construct color, but instead the contrast specified in the individual dummy variables.

> model0<-glm(pres.abs~1,data=crabs,family=binomial)
> model1<-glm(pres.abs~width,data=crabs,family=binomial)
> model2<-glm(pres.abs~color.f,data=crabs,family=binomial)
> model3<-glm(pres.abs~width+color.f,data=crabs,family=binomial)
> anova(model2,model3,test='Chisq')

Analysis of Deviance Table

Model 1: pres.abs ~ color.f
Model 2: pres.abs ~ width + color.f
  Resid. Df Resid. Dev Df Deviance P(>|Chi|)
1       169    212.061
2       168    187.457  1   24.604 7.041e-07

> anova(model1,model3,test='Chisq')
Analysis of Deviance Table

Model 1: pres.abs ~ width
Model 2: pres.abs ~ width + color.f
  Resid. Df Resid. Dev Df Deviance P(>|Chi|)
1       171    194.453
2       168    187.457  3    6.996     0.072

> anova(model0,model3,test='Chisq')
Analysis of Deviance Table

Model 1: pres.abs ~ 1
Model 2: pres.abs ~ width + color.f
  Resid. Df Resid. Dev Df Deviance P(>|Chi|)
1       172    225.759
2       168    187.457  4   38.301 9.71e-08

Thus using significance testing with α = .05, we would retain width but drop color.f from the model.

Question 7

Fig. 2   Crab probability models

The equations we need for the four different "colored" females are given above as log odds. To convert these to a probability scale I use the relationship

Thus we define the following functions.

> med.light<-function(x) 1/(1+exp(-(coef(out3)[1] + coef(out3)[2]*x)))
> others<-function(x,k) 1/(1+exp(-(coef(out3)[1] + coef(out3)[2]*x + coef(out3)[k])))

where k = 3, 4, or 5.

> plot(crabs$width, pres.abs, col=crabs$color-1, cex=.8, xlab='width', ylab='probability')
> lines(seq(21,34,.1), med.light(seq(21,34,.1)), col=1)
> lines(seq(21,34,.1), others(seq(21,34,.1),3), col=2)
> lines(seq(21,34,.1), others(seq(21,34,.1),4), col=3)
> lines(seq(21,34,.1), others(seq(21,34,.1),5), col=4)
> legend(30,.5, c('medium light', 'medium', 'medium dark', 'dark'), col=1:4, cex=rep(.8,4), lty=rep(1,4), bty='n')
> mtext(side=3, line=.5, text='Probability of satellite male being present\nas a function of female width and color')

Question 8

The only sensible test to carry out is some variant of the Hosmer-Lemeshow test. With multiple predictors partitioning the data according to levels of the predictors has too many possibilities. Following what we've done before, I divide the estimated logits into deciles and use these to partition the observed and expected frequencies of zeros and ones.

> y.cuts<-cut(predict(out3), quantile(predict(out3), seq(0,1,.1)), include.lowest=TRUE)
> y.1<-tapply(fitted(out3),y.cuts,sum)
> y.n<-tapply(fitted(out3),y.cuts,length)
> y.0<-y.n-y.1
> Ei<-rbind(y.0,y.1)
> Ei

    [-2.89,-0.782] (-0.782,-0.222] (-0.222,0.154] (0.154,0.433]
y.0      14.624349        9.849763       9.122673      6.984058
y.1       4.375651        6.150237       8.877327     10.015942
    (0.433,0.714] (0.714,1]  (1,1.36] (1.36,1.74] (1.74,2.12] (2.12,4.36]
y.0      6.022493  5.048826  4.100546    2.990691    2.139013    1.117588
y.1     10.977507 11.951174 12.899454   15.009309   14.860987   15.882412

Five of the 20 categories are too small. Rather than pool the existing categories I'm going to redefine the endpoints slightly. If I extend the right hand endpoint of the category (1, 1.36] I should be able to bring this one up to 5 and then I'll pool the rest.

> new.cat<-c(quantile(predict(out3), seq(0,1,.1))[1:7],1.56,4.4)
> y.cuts<-cut(predict(out3), new.cat,include.lowest=TRUE)
> y.1<-tapply(fitted(out3),y.cuts,sum)
> y.n<-tapply(fitted(out3),y.cuts,length)
> y.0<-y.n-y.1
> Ei<-rbind(y.0,y.1)
> Ei

    [-2.89,-0.782] (-0.782,-0.222] (-0.222,0.154] (0.154,0.433]
y.0      14.624349        9.849763       9.122673      6.984058
y.1       4.375651        6.150237       8.877327     10.015942
    (0.433,0.714] (0.714,1]  (1,1.56] (1.56,4.4]
y.0      6.022493  5.048826  5.372703   4.975136
y.1     10.977507 11.951174 18.627297  40.024864

That's not perfect but now there are only two out of 16 categories that are too small, which is under the 20% value. I finish the analysis.

> Oi<-table(pres.abs,y.cuts)
> sum((Oi-Ei)^2/Ei)

[1] 2.245275
> 1-pchisq(sum((Oi-Ei)^2/Ei),dim(Oi)[2]-2)
[1] 0.8958155

So we fail to find evidence of lack of fit (p-value is large). As a second goodness of fit test I try the variant of the Hosmer-Lemeshow test in the Design library.

> library(Design)
> out.h<-lrm( pres.abs ~ width+color.f,data=crabs,x=TRUE,y=TRUE)
> residuals.lrm(out.h,type='gof')

Sum of squared errors Expected value|H0        SD
           31.6457412        31.5393772 0.3184185
        Z         P
0.3340382 0.7383507

The reported p-value of 0.738 agrees with our result above. We have no evidence of lack of fit.

Question 9

Part 1

I obtain sensitivity and specificity from the ROCR library and look for the spot where the graphs cross.

> library(ROCR)
> pred<-prediction(fitted(out3), pres.abs)
> performance(pred, 'tpr', 'tnr')->testy
> tempmat<-cbind(testy@x.values[[1]], testy@y.values[[1]], testy@alpha.values[[1]])
> tempmat[abs(tempmat[,1]-tempmat[,2])<.02,]->submat
> submat

           [,1]     [,2]       [,3]
[1,] 0.6935484 0.6756757 0.6467219
[2,] 0.6774194 0.6936937 0.6395257

Fig. 3    Maximizing sensitivity and specificity

If I average the values of c (3rd column) on either side of the changeover that will be a close approximation.

> mean(submat[,3])
[1] 0.6431238

To see if this looks reasonable, I plot sensitivity and specificity and indicate this cut-off with a vertical line.

> plot(testy@alpha.values[[1]], testy@x.values[[1]], type='n', xlab='c',
ylab='sensitivity or specificity')
> lines(testy@alpha.values[[1]], testy@y.values[[1]])
> lines(testy@alpha.values[[1]], testy@x.values[[1]],col=2)
> legend(.8,.8, c('sensitivity', 'specificity'), lty=c(1,1), col=1:2, cex=c(.9,.9), bty='n')
> abline(v=mean(submat[,3]), col=4, lty=2)
> text(.65, .2,paste('c = ', round(mean(submat[,3]), 3)), pos=2, col=4, cex=.9)

Part 2

Area under the curve is part of the output from the lrm function from the Design package. I examine it below.

> out.h

Logistic Regression Model

lrm(formula = pres.abs ~ width + color.f, data = crabs, x = TRUE,
   y = TRUE)

Frequencies of Responses
FALSE TRUE
   62  111

Obs Max Deriv Model L.R. d.f. P     C   Dxy
173     4e-05       38.3    4 0 0.771 0.543
Gamma Tau-a    R2 Brier
0.546 0.251 0.272 0.183

               Coef   S.E. Wald Z      P
Intercept -11.38519 2.8735  -3.96 0.0001
width       0.46796 0.1055   4.43 0.0000
color.f=3   0.07242 0.7399   0.10 0.9220
color.f=4  -0.22380 0.7771  -0.29 0.7733
color.f=5  -1.32992 0.8525  -1.56 0.1188

The coefficient labeled C, the concordance index, is also the AUC. The reported value is 0.771. Thus if we were to pair all observed ones with all observed zeros in all possible ways and then for each pair guess which of the two is a "one" based on which one has the higher predicted probability according to the model, then we would be right 77.1% of the time.

Part 3

I use the cv.binary function in the DAAG package. Following the hint I add the generated variables to the the data file and refit the model.

> crabs$pres.abs<-pres.abs
> crabs$color.f<-color.f
> out3a<-glm(pres.abs~width+color.f,data=crabs,family=binomial)
> cv.binary(out3a)

Fold: 8 4 5 3 6 10 7 2 9 1
Internal estimate of accuracy = 0.734
Cross-validation estimate of accuracy = 0.705

Fig. 4     ROC curve

What's reported is not the AUC but a number that is close in spirit. It reports the fraction of times the predicted probability when rounded to zero or one, matches the observed value. So we see that this 0.734 for full data set, but only 0.705 under cross-validation. Note: we can easily check this interpretation as follows.

> sum(round(fitted(out3))==pres.abs)/length(pres.abs)
[1] 0.734104

Part 4

I use the ROC function from the Epi package to generate the ROC curve.

> library(Epi)
> attach(crabs)
> ROC(form=pres.abs~width+color.f, plot="ROC")

Part 5

To superimpose the two ROC curves I extract the true positive rate 'TPR' and false positive rate 'FPR' from the two models in question using functions from the ROCR package.

> pred<-prediction(fitted(out3), pres.abs)
> performance(pred, 'tpr', 'fpr')->testy
> pred2<-prediction(fitted(out0.width), pres.abs)
> performance(pred2, 'tpr', 'fpr')->testy2
> plot(testy)
> par(new=TRUE)
> plot(testy2, col=2, axes=FALSE, xlab='', ylab='')
> legend(0,1,c('color and width','just width'), col=1:2, lty=c(1,1), bty='n', cex=c(.9,.9))

Alternatively I can draw the curves myself using the type='s' option.

> par(new=FALSE)
> plot(testy@x.values[[1]], testy@y.values[[1]], type='s', xlab='False positive rate',
ylab='True positive rate')
> par(new=TRUE)
> plot(testy2@x.values[[1]], testy2@y.values[[1]], type='s', xlab='False positive rate',
ylab='True positive rate', col=2)
> legend(0,1,c('color and width','just width'), col=1:2, lty=c(1,1), bty='n', cex=c(.9,.9))
> par(new=FALSE)

Notice this last plot is more honest because it is a true step function.

         

Fig. 5  ROC curves for two models—two versions of the same plot

Question 10

The evidence for the ordinal nature of the relationship is somewhat weak. It appears, except for the transition from medium light to medium, that as the crab gets darker it is less likely to have satellite males present. This follows from the fact β2 and β3 are negative with β3 < β2. On the other hand β1 > 0 but only slightly. The the intercepts in the logistic regression model, which are have estimates that rank in the following order.

Question 11

The evidence for the ordinal nature of the relationship is somewhat weak. It appears, except for the transition from medium light to medium, that as the crab gets darker it is less likely to have satellite males present. This follows from the fact β2 and β3 are negative with β3 < β2. On the other hand β1 > 0

> color.o<-ordered(crabs$color)
> contrasts(color.o)

          .L   .Q         .C
2 -0.6708204  0.5 -0.2236068
3 -0.2236068 -0.5  0.6708204
4  0.2236068 -0.5 -0.6708204
5  0.6708204  0.5  0.2236068

> model.linear<-glm(pres.abs~width+color.o,data=crabs,family=binomial)
> summary(model.linear)

Call:
glm(formula = pres.abs ~ width + color.o, family = binomial,
data = crabs)

Deviance Residuals:
    Min      1Q Median     3Q    Max
-2.1124 -0.9848 0.5243 0.8513 2.1413

Coefficients:
             Estimate Std. Error z value Pr(>|z|)
(Intercept) -11.75552    2.74185  -4.287 1.81e-05 ***
width         0.46796    0.10554   4.434 9.26e-06 ***
color.o.L    -0.95837    0.58032  -1.651 0.0986 .
color.o.Q    -0.58927    0.47472  -1.241 0.2145
color.o.C    -0.09867    0.33738  -0.292 0.7699
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 225.76 on 172 degrees of freedom
Residual deviance: 187.46 on 168 degrees of freedom
AIC: 197.46

Number of Fisher Scoring iterations: 4

Since none of the contrasts are significant (although the linear trend is closest to being significant), we don't find any evidence for a linear, quadratic or cubic trend. A positive result here would have been a bit suspicious anyway since we have no reason to believe that the color categories are equally spaced.

Question 12

Helmert contrasts the current level of the categorical variable with the mean of all preceding categories.

> contrasts(color.f)<-'contr.helmert'
> contrasts(color.f)

  [,1] [,2] [,3]
2   -1   -1   -1
3    1   -1   -1
4    0    2   -1
5    0    0    3

> model.helmert<-glm(pres.abs~width+color.f,data=crabs,family=binomial)
> summary(model.helmert)

Call:
glm(formula = pres.abs ~ width + color.f, family = binomial,
data = crabs)

Deviance Residuals:
    Min      1Q Median     3Q    Max
-2.1124 -0.9848 0.5243 0.8513 2.1413

Coefficients:
             Estimate Std. Error z value Pr(>|z|)
(Intercept) -11.75552    2.74185 -4.287 1.81e-05 ***
width         0.46796    0.10554  4.434 9.26e-06 ***
color.f1      0.03621    0.36995  0.098 0.922
color.f2     -0.08667    0.16742 -0.518 0.605
color.f3     -0.31986    0.13966 -2.290 0.022 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 225.76 on 172 degrees of freedom
Residual deviance: 187.46 on 168 degrees of freedom
AIC: 197.46

Number of Fisher Scoring iterations: 4

This time the last contrast is significant. From the coding we see that this is a contrast of the last group (darkest crabs) with the mean of the three lighter groups. The negative coefficient means that for a fixed size darker female crabs are less likely to have satellite males than other female crabs.

Question 13

Based on the Helmert contrasts we saw that the only contrast close to being significant was the "dark" category against the rest. I try dichotomizing "dark" versus everyone else.

> dark<-crabs$color==5
> dark

  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [13] FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE
 [25]  TRUE  TRUE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE
 [37] FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [49] FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE
 [61] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [73] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE
 [85] FALSE TRUE  FALSE  TRUE  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE
 [97] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[109] FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE
[121] FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE
[133] FALSE TRUE  FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[145] FALSE TRUE  FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE
[157] FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE
[169] FALSE FALSE FALSE  TRUE FALSE

> model.dark<-glm(pres.abs~width+factor(dark),data=crabs,family=binomial)
> summary(model.dark)

Call:
glm(formula = pres.abs ~ width + factor(dark), family = binomial,
data = crabs)

Deviance Residuals:
    Min      1Q Median     3Q    Max
-2.0821 -0.9932 0.5274 0.8606 2.1553

Coefficients:
                 Estimate Std. Error z value Pr(>|z|)
(Intercept)      -11.6790     2.6925  -4.338 1.44e-05 ***
width              0.4782     0.1041   4.592 4.39e-06 ***
factor(dark)TRUE  -1.3005     0.5259  -2.473 0.0134 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 225.76 on 172 degrees of freedom
Residual deviance: 187.96 on 170 degrees of freedom
AIC: 193.96

Number of Fisher Scoring iterations: 4

So now color is statistically significant and based on the sign of the coefficient we see that darker females are less likely to have satellite males. Of course this has been a fishing expedition. It would have been nice if we could have justified this last step with our knowledge of the visual perception of horseshoe crabs, that, e.g., they are less attracted to dark objects or that dark objects are more difficult for them to perceive. As a final step I compare the AIC of the all the color models that have been fit.

> sapply(list(out3,model.linear,model.helmert,model.dark),AIC)
[1] 197.4570 197.4570 197.4570 193.9579

Notice that changing how the contrasts are defined does not change the AIC.

Course Home Page


Jack Weiss
Phone: (919) 962-5930
E-Mail: jack_weiss@unc.edu
Address: Curriculum in Ecology, Box 3275, University of North Carolina, Chapel Hill, 27516
Copyright © 2006
Last Revised--April 12, 2006
URL: http://www.unc.edu/courses/2006spring/ecol/145/001/docs/solutions/assign9.htm