Lecture 39 (lab 10)—March 28, 2006
What was covered?
- R topic covered
- Accessing the "slots" of the new S language objects
- Building the systematic component of a habitat suitability model
(multiple logistic regression)
- Checking the structural form of a logistic regression model
- Stepwise methods (Ack!)
- Principle of marginality
- Model calibration
- Sensitivity and specificity
- ROC curves
- Area under the curve (AUC)
- Cross-validation of logistic regression models
R functions and commands demonstrated
- cv.binary (from DAAG) gives cross-validation measures of predictive accuracy for regression with a binary response
- cv.glm (from boot) a cross-validation function for generalized linear models
- performance (from ROCR) calculates model calibration statistics from a prediction object
- prediction (from ROCR) creates a prediction object from which calibration statistics can be calculated
- rcspline.plot (from Design) plots a restricted cubic spline function for a specified logistic regression model
- ROC (from Epi) produces a ROC curve plus diagnostics for a logistic regression model
- stepAIC (from MASS) does stepwise regression using AIC for model assessment
R packages used
- boot contains the cv.glm function for cross-validation
- DAAG contains the frogs data set and the cv.binary function
- Design contains the lrm and rcspline.plot functions
- Epi contains the ROC function
- MASS contains the stepAIC function
- ROCR contains the prediction and performance functions
Multiple logistic regression
- Reload the frogs data from last time.
> #obtain data
> library(DAAG)
> data(frogs)
> names(frogs)
[1] "pres.abs" "northing" "easting" "altitude" "distance"
[6] "NoOfPools" "NoOfSites" "avrain" "meanmin" "meanmax" )
- Last time we examined the relationship between the presence-absence, pres.abs, of frogs and the variable meanmin, the average minimum spring temperature. This time we fit a multiple logistic regression model using all the available variables in the data set. I begin by fitting a model in which all variables are included (excluding northing and easting) in their raw form.
> fullmodel<-glm(pres.abs~altitude + distance + NoOfPools + NoOfSites + avrain + meanmin + meanmax, data=frogs, family=binomial)
> summary(fullmodel)
Call:
glm(formula = pres.abs ~ altitude + distance + NoOfPools + NoOfSites +
avrain + meanmin + meanmax, family = binomial, data = frogs)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.7215 -0.7590 -0.2237 0.8320 2.6789
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.105e+02 1.388e+02 0.796 0.42587
altitude -3.086e-02 4.076e-02 -0.757 0.44901
distance -4.800e-04 2.055e-04 -2.336 0.01949 *
NoOfPools 2.986e-02 9.276e-03 3.219 0.00129 **
NoOfSites 4.364e-02 1.061e-01 0.411 0.68077
avrain -1.140e-02 5.995e-02 -0.190 0.84920
meanmin 4.899e+00 1.564e+00 3.133 0.00173 **
meanmax -5.660e+00 5.049e+00 -1.121 0.26224
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 279.99 on 211 degrees of freedom
Residual deviance: 198.74 on 204 degrees of freedom
AIC: 214.74
Number of Fisher Scoring iterations: 6
- Having fit the full model it is not always clear what to do next. When using significance testing to select models one standard recommendation is to carry out model reduction in such a way that the number of tests you perform is minimized. In this spirit I try dropping en masse all variables that don't meet 0.05 significance level. This removes altitude, NoOfSites, avrain, and meanmax. To assess whether doing this was a "good move", I carry out a likelihood ratio test (actually an analysis of deviance test) comparing the full model to the reduced model using the anova function with the test='Chisq' option.
# fit reduced model
> model1<-glm(pres.abs~distance+NoOfPools+meanmin, data=frogs, family=binomial)
# compare full and reduced models
> anova(fullmodel,model1,test='Chisq')
Analysis of Deviance Table
Model 1: pres.abs ~ altitude + distance + NoOfPools + NoOfSites + avrain +
meanmin + meanmax
Model 2: pres.abs ~ distance + NoOfPools + meanmin
Resid. Df Resid. Dev Df Deviance P(>|Chi|)
1 204 198.738
2 208 216.104 -4 -17.365 0.002
- Whoops. The test says we went too far. The full model is significantly "better" than the reduced model. I try a different reduced model, this time dropping only the three least significant variables (rather than all four). The result is that meanmax is added back to the model.
> model2<-glm(pres.abs~distance+NoOfPools+meanmin+meanmax, data=frogs, family=binomial)
> anova(fullmodel, model2, test='Chisq')
Analysis of Deviance Table
Model 1: pres.abs ~ altitude + distance + NoOfPools + NoOfSites + avrain +
meanmin + meanmax
Model 2: pres.abs ~ distance + NoOfPools + meanmin + meanmax
Resid. Df Resid. Dev Df Deviance P(>|Chi|)
1 204 198.738
2 207 199.627 -3 -0.889 0.828
- Now we've clearly got it right. The LR test says that the three variables included in the full model, but not in model2 are unnecessary (they don't significantly improve the fit in the sense of increasing the likelihood, or equivalently, decreasing the deviance). So what happened? Let's look at the coefficient summary table.
> summary(model2)
Call:
glm(formula = pres.abs ~ distance + NoOfPools + meanmin + meanmax,
family = binomial, data = frogs)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.7457 -0.7637 -0.2000 0.8570 2.9552
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 14.0074032 4.9264854 2.843 0.004465 **
distance -0.0005138 0.0001875 -2.741 0.006132 **
NoOfPools 0.0285643 0.0089138 3.204 0.001353 **
meanmin 5.6230647 1.2140742 4.632 3.63e-06 ***
meanmax -2.3717579 0.6246508 -3.797 0.000146 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 279.99 on 211 degrees of freedom
Residual deviance: 199.63 on 207 degrees of freedom
AIC: 209.63
Number of Fisher Scoring iterations: 6
- Observe that meanmax which was not significant at the .05 level in the full model is now highly significant. How can this be? When this kind of thing happens it usually means that there are redundant variables in the data set—variables that are appear to be conceptually different but which in fact are essentially measuring the same thing. Statistically this often, but not always, manifests itself in the two variables being highly correlated. Highly correlated variables are essentially interchangeable in a model and including them both at the same time dilutes their individual effects.
- So with which variable is meanmax highly correlated? The variables that are no longer in model2 are altitude, NoOfSites, and avrain. If we examine pair-wise correlations of meanmax with each of these variables we see the following.
> cor(frogs$meanmax, cbind(frogs$altitude, frogs$avrain, frogs$NoOfSites))
[,1] [,2] [,3]
[1,] -0.996557 -0.8186997 0.1576344
Observe that the correlation of meanmax with either altitude or avrain is quite high (and negative). This suggests that we could replace meanmax with either of these two variables and not change things very much. I try this with each variable in turn.
> model2.1<-glm(pres.abs~distance + NoOfPools + meanmin + avrain, data=frogs, family=binomial)
> model2.2<-glm(pres.abs~distance + NoOfPools + meanmin + altitude, data=frogs, family=binomial)
> summary(model2.1)
Call:
glm(formula = pres.abs ~ distance + NoOfPools + meanmin + avrain,
family = binomial, data = frogs)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.7866 -0.7884 -0.2708 0.8342 2.8787
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.036e+01 4.844e+00 -4.204 2.62e-05 ***
distance -5.346e-04 1.779e-04 -3.005 0.002654 **
NoOfPools 2.763e-02 8.704e-03 3.174 0.001501 **
meanmin 2.515e+00 4.926e-01 5.106 3.30e-07 ***
avrain 7.966e-02 2.320e-02 3.435 0.000594 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 279.99 on 211 degrees of freedom
Residual deviance: 202.78 on 207 degrees of freedom
AIC: 212.78
Number of Fisher Scoring iterations: 6
> summary(model2.2)
Call:
glm(formula = pres.abs ~ distance + NoOfPools + meanmin + altitude,
family = binomial, data = frogs)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.7468 -0.7798 -0.2253 0.8882 3.1010
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -5.427e+01 1.428e+01 -3.801 0.000144 ***
distance -5.187e-04 1.878e-04 -2.761 0.005756 **
NoOfPools 2.707e-02 8.717e-03 3.105 0.001900 **
meanmin 6.039e+00 1.412e+00 4.277 1.90e-05 ***
altitude 2.235e-02 6.336e-03 3.527 0.000420 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 279.99 on 211 degrees of freedom
Residual deviance: 201.76 on 207 degrees of freedom
AIC: 211.76
Number of Fisher Scoring iterations: 6
- Observe that in each model all individual Wald tests are significant. In particular we conclude that each of altitude, avrain, and meanmax are significant additions to a model already containing distance, NoOfPools and meanmin. So which of these variables should we choose since each variable when added by itself (but not with the others) is statistically significant? We could compare them using AIC.
> sapply(list(model2,model2.1,model2.2),AIC)
[1] 209.6275 212.7838 211.7601
Based on AIC model2, the model in which meanmax is added, rates best.
- It's worth noting that the converse of our conclusion above relating correlation to statistical significance is not true, i.e., just because a variable is a significant contributor to a model does not imply that it is uncorrelated with the rest of the variables in the model. For example, model2 contains both meanmin and meanmax as significant contributors, yet it's rather obvious that mean springtime minimum temperatures and mean springtime maximum temperatures must be related. In fact they are.
> cor(frogs$meanmin, frogs$meanmax)
[1] 0.9462741
We'll examine later whether keeping such highly correlated variables in the same model is an asset or a liability.
Assessing the structural form of the model
The proper structural form for meanmin
> library(Design)
> rcspline.plot(y=frogs$pres.abs, x=frogs$meanmin, nk=5, m=20)
> mtext(side=1, line=2.5, 'meanmin')
- The first two arguments of rcspline.plot are self-explanatory: the response variable and the predictor for the logistic regression model.
- The third argument, nk, is the number of knots. Knots are anchor points for the cubic spline. Two will automatically be located at the minimum and maximum x-values. The remaining locations are estimated. Values of 3, 4, or 5 are typical choices for nk (the default is 5). If a model fails to converge with a specified value you will need to reduce it.
- The last argument, m, is not required. It specifies the number of observations to use in calculating grouped empirical logits. This is useful for assessing model fit.
The structural form for meanmax
- I next examine the structural form of the meanmax predictor.
> rcspline.plot(y=frogs$pres.abs, x=frogs$meanmax, nk=5, m=20)
> mtext(side=1, line=2.5, 'meanmax')
- The output suggests a quadratic model is appropriate for meanmax also. Notice that three of the empirical logits plot outside of the confidence band and that the confidence band is very wide (compared to meanmin). The fit of a model containing only a spline function of meanmax is not particularly good.
- I try adding quadratic terms for meanmin and meanmax to model2 that we obtained above.
> model2a<-glm(pres.abs~distance + NoOfPools + meanmin + meanmax + I(meanmin^2) + I(meanmax^2), data=frogs, family=binomial)
> summary(model2a)
Call:
glm(formula = pres.abs ~ distance + NoOfPools + meanmin
+ meanmax + I(meanmin^2) + I(meanmax^2),
family = binomial, data = frogs)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.9675 -0.7414 -0.2758 0.7764 2.8252
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.112e+02 6.673e+01 -1.666 0.095623 .
distance -5.295e-04 1.914e-04 -2.767 0.005659 **
NoOfPools 3.173e-02 9.576e-03 3.314 0.000921 ***
meanmin -5.460e+00 9.722e+00 -0.562 0.574382
meanmax 1.793e+01 1.158e+01 1.549 0.121424
I(meanmin^2) 1.640e+00 1.578e+00 1.040 0.298567
I(meanmax^2) -7.144e-01 4.196e-01 -1.702 0.088673 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 279.99 on 211 degrees of freedom
Residual deviance: 195.01 on 205 degrees of freedom
AIC: 209.01
Number of Fisher Scoring iterations: 6
- The individual Wald tests, which are variable-added-last tests, argue that adding quadratic terms is not necessary. I carry out a likelihood ratio test to test simultaneously whether the coefficients of one or more of the quadratic term is different from zero.
> anova(model2, model2a, test='Chisq')
Analysis of Deviance Table
Model 1: pres.abs ~ distance + NoOfPools + meanmin + meanmax
Model 2: pres.abs ~ distance + NoOfPools + meanmin + meanmax + I(meanmin^2) + I(meanmax^2)
Resid. Df Resid. Dev Df Deviance P(>|Chi|)
1 207 199.627
2 205 195.012 2 4.616 0.099
- This test is not significant either. Finally I examine the AIC values.
> sapply(list(model2,model2a),AIC)
[1] 209.6275 209.0118
So the AIC values are barely distinguishable from each other. So, contrary to what we saw in the spline plots the statistical evidence for including quadratic terms is fairly weak. Why the discrepancy? One problem is that in fitting the spline models we are treating the variables in isolation. We've already seen that meanmax and meanmin are highly correlated.
- There is one additional possibility to consider. It's possible that the presence of the other variable in the model makes the quadratic term unnecessary. I try fitting models quadratic in meanmin but with meanmax absent, and vice versa.
> model2b<-glm(pres.abs ~ distance + NoOfPools + meanmin +I(meanmin^2), family = binomial, data = frogs)
> model2c<-glm(pres.abs ~ distance + NoOfPools + meanmax +I(meanmax^2), family = binomial, data = frogs)
> sapply(list(model2,model2b,model2c),AIC)
[1] 209.6275 213.8477 222.8593
The model linear in both meanmin and meanmax still ranks best. So contrary to our conclusions in lab 9 and the spline plots, we don't need quadratic terms.
The structural form for distance
- The spline plot suggests that an appropriate structural form for distance is logarithmic (more accurately the negative of a logarithm that is then translated upward). Observe that setting the number of knots to 5 led in estimation problems so it was necessary to reduce nk to 4.) The linearity lack of fit test is highly significant. When I log-transform distance and retry the spline plot, the displayed spline plot is much-improved and linearity lack of fit test is no longer significant (Fig. 3). Observe that most of the empirical logits are outside the confidence bands though.
> rcspline.plot(y=frogs$pres.abs,x=frogs$distance,nk=5,m=20)
singular information matrix in lrm.fit (rank= 4 ). Offending variable(s):
Error in array(x, c(length(x), 1), if (!is.null(names(x))) list(names(x), :
attempt to set an attribute on NULL
> rcspline.plot(y=frogs$pres.abs,x=frogs$distance,nk=4,m=20)
> mtext(side=1,line=2.5,'distance')
#log transform
> rcspline.plot(y=frogs$pres.abs,x=log(frogs$distance),nk=4,m=20)
> mtext(side=1,line=2.5,'log(distance)')

Fig. 3 Spline plots for untransformed and log-transformed distance
- I next use AIC to compare a full model with untransformed distance and one with log-transformed distance.
> model3<-glm(pres.abs~log(distance) + NoOfPools + meanmin + meanmax, data=frogs, family=binomial)
> sapply(list(model2,model3),AIC)
[1] 209.6275 205.1332
- Log-transforming distance yields a sizeable reduction in AIC.
The structural form for NoOfPools
- The spline plot shown below suggests that the structural form for NoOfPools is rather complicated, although the test of linearity lack of fit is not significant. In the Master's thesis from which these data are derived, the author used a log transformation, so I try that also.
> rcspline.plot(y=frogs$pres.abs, x=frogs$NoOfPools, nk=5, m=20)
> mtext(side=1, line=2.5, 'NoOfPools')
#check log transform since used in thesis
> rcspline.plot(y=frogs$pres.abs, x=log(frogs$NoOfPools), nk=5, m=20)
> mtext(side=1, line=2.5, 'log(NoOfPools)')

Fig. 4 Spline plot for untransformed and log-transformed NoOfPools
- There is not much evidence that the log-transformation helped. If I fit a full model with log-transformed NoOfPools I can compare the result using AIC to a model in which NoOfPools is untransformed.
> model3b<-glm(pres.abs~log(distance) + log(NoOfPools) + meanmin + meanmax, data=frogs, family=binomial)
> sapply(list(model3,model3b), AIC)
[1] 205.1332 207.6561
- So the AIC indicates the model with log-transformed NoOfPools is worse. Notice that this contradicts the conclusions derived from the printed AIC values displayed in Fig 4 where the log-transformed variable yields the lower AIC. Once again what's relevant is how the variables behave in the full model, not by themselves.
Including interaction terms—stepwise methods
- If you read the description of the model-building process in virtually any journal article presenting a new habitat suitability model, you'll discover that the authors probably used some automatic variable selection routine for screening candidate models. This is the case even though such a protocol would never be recommended by a statistician. For logistic regression the standard automatic variable selection protocol is typically stepwise regression. Here's a representative sample of commentary by statisticians on building models in this way.
- "Stepwise variable selection has been a very popular technique for many years, but if this procedure had just been proposed as a statistical method, it would most likely be rejected because it violates every principle of statistical estimation and hypothesis testing." (Harrell, 2001, p. 56)
- "These three procedures [forward selection, backward elimination, and stepwise regression] have been shown to fit complicated models to completely random data, and although widely used they have no theoretical basis." (Davison, 2003, p. 400).
- "An unfocused search through many possible models (sometimes referred to as a 'fishing expedition') increases the likelihood of capitalizing on chance and finding a model which represents only a spurious relationship." (Judd and McClelland, 1989, p. 204)
- "The data analyst knows more than the computer." (Henderson and Velleman, 1981)
- So I present the following description of automatic variable selection with a degree of trepidation. There is an automatic variable selection program in the MASS package called stepAIC. While it shares many of the flaws of other automatic variable selection routines, it is better than these in two important ways.
- It uses AIC rather than significance testing to select models. However, it's worth noting that Burnham and Anderson (2003) would strongly object to this use of AIC.
- It obeys the principle of marginality, something virtually no other stepwise program does.
- The principle of marginality can be summarized as follows.
- In models involving integer powers, higher power terms are always marginal to lower power terms. The principle of marginality applied here holds that if a higher power term is in the model, all lower power terms must also be in the model. Thus if x3 is a term in the model, then x2, x and an intercept must also be in the model.
- The same holds for interactions. If the interaction term x1x2 is in the model, then both x1 and x2 must also be in the model.
- Factor variables are added and dropped en masse. It is never the case that significant dummy variables are retained and nonsignificant dummy variables are dropped if the dummy variables derive from the same categorical construct.
- The stepAIC function adheres to all three principles of marginality. Having said that in every other respect it is a totally mindless procedure. I illustrate its use next.
- stepAIC requires three inputs.
- A model to start from. It is to this model that terms will be added and dropped.
- An upper bound in model complexity. This will typically list the highest order interaction that one wishes to consider as important.
- A lower bound on model complexity—how simple you are willing to let the model become.
- In what follows I use model3, our four variable main effects model as the starting point. I let the four-factor interaction of all four variables be the most complicated terms to consider. I consider a model with only an intercept to be the simplest model. I specify this information using the scope argument as shown below. I suppress much of the voluminous output that is produced and list only the first step of model building and the last step where the final model is displayed.
> library(MASS)
>
stepAIC(model3, scope=list(upper=~log(distance)*NoOfPools*meanmin*meanmax, lower=~1))
Start: AIC= 205.13
pres.abs ~ log(distance) + NoOfPools + meanmin + meanmax
Df Deviance AIC
+ log(distance):meanmin 1 189.05 201.05
+ log(distance):meanmax 1 191.55 203.55
+ log(distance):NoOfPools 1 192.03 204.03
+ meanmin:meanmax 1 192.79 204.79
<none> 195.13 205.13
+ NoOfPools:meanmax 1 194.94 206.94
+ NoOfPools:meanmin 1 194.97 206.97
- NoOfPools 1 205.34 213.34
- log(distance) 1 209.73 217.73
- meanmax 1 211.42 219.42
- meanmin 1 219.97 227.97
<output deleted>
Call: glm(formula = pres.abs ~ log(distance) + NoOfPools + meanmin + meanmax + log(distance):meanmin + meanmin:meanmax, family = binomial, data = frogs)
Coefficients:
(Intercept) log(distance) NoOfPools meanmin
-50.71310 3.20271 0.03128 25.38805
meanmax log(distance):meanmin meanmin:meanmax
0.82240 -1.22059 -0.86433
Degrees of Freedom: 211 Total (i.e. Null); 205 Residual
Null Deviance: 280
Residual Deviance: 183 AIC: 197
- At the beginning of the output we see the reported history of the model-fitting process. First stepAIC tried adding the interaction of log(distance) and meanmin to the baseline model. Reported on that line is the AIC for the model in which this term has been added. The next line shows the effect of adding the interaction of log(distance) and meanmax to the baseline model. Each new model obtained by adding a variable not currently in the model is considered in turn. At the end of the first step stepAIC also experiments with dropping variables that are currently in the model.
- Observe that at no time did stepAIC try to add a 3-factor or 4-factor interaction to the model. That's because there are no two-factor interactions yet in the model and by the principle of marginality the component two-factor interactions are required to be present in the model before a 3-factor interaction is considered.
- The first stage ends by selecting the model with the lowest AIC, in this case the baseline model plus the interaction term log(distance):meanmin. If this model is different from the baseline model, the new model replaces the baseline model. The procedure then continues by attempting to add and delete variables to this new model. If at any point the model with the lowest AIC is the current model, then the algorithm stops and the current model statistics are displayed.
- Observe that at the final stage all of the original main effects are still in the model but two additional two-factor interactions have been added. The final AIC of 197 is 8 units lower than the AIC of model3.
- My feeling is that if you use such a protocol to select a model you should be able to concoct biologically reasonable interpretations for any of the new effects that have been added.
- Here you would need to interpret the two new interactions, log(distance) with meanmin as well as meanmin with meanmax. For example, the model says that meanmin when log(distance) equals 0 has a positive effect on the presence of Corroboree frogs in the habitat (coefficient of 25.38805) but with each additional increase in log(distance) this effect is diminished by –1.2209. Since the largest value of log(distance) in the data set is about 10, this means the effect of meanmin when the nearest extant population is at maximum distance is diminished by roughly half from what it would be at log(distance) = 0. This latter value, by the way, does not occur in the data set.
- Perhaps the meanmax interaction with meanmin is more interesting. It would appear as if this interaction can potentially change sign of the effect of meanmax. But in fact since meanmin ranges only between 2 and 4 in the data set, the effect of meanmax on the logit will always be negative over the available range of meanmin.
- I fit this final model with the two interaction terms and examine coefficient estimates and significance tests.
> model4<-glm(formula = pres.abs ~ log(distance) + NoOfPools + meanmin + meanmax + log(distance):meanmin + meanmin:meanmax, family = binomial, data = frogs)
> summary(model4)
Call:
glm(formula = pres.abs ~ log(distance) + NoOfPools + meanmin +
meanmax + log(distance):meanmin + meanmin:meanmax, family = binomial,
data = frogs)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.2805 -0.6141 -0.2764 0.6753 2.4552
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -50.71310 24.52294 -2.068 0.038641 *
log(distance) 3.20271 1.29544 2.472 0.013425 *
NoOfPools 0.03128 0.00954 3.279 0.001043 **
meanmin 25.38805 6.77238 3.749 0.000178 ***
meanmax 0.82240 1.65155 0.498 0.618517
log(distance):meanmin -1.22059 0.40565 -3.009 0.002621 **
meanmin:meanmax -0.86433 0.37292 -2.318 0.020462 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 279.99 on 211 degrees of freedom
Residual deviance: 183.03 on 205 degrees of freedom
AIC: 197.03
Number of Fisher Scoring iterations: 5
- Observe that even though the predictor meanmax is not significant, the principle of marginality states that it must remain in the model because its interaction with meanmin is currently in the model.
- Because I have no basis for interpreting the interactions that have been included, I stick with model3 in most of the model calibration calculations that follow.
Goodness of Fit
- When there are many regressors in a model, goodness of fit tests analogous to the Hosmer-Lemeshow test are the only reasonable options. Since we found the Hosmer-Lemeshow test to not be significant when there was only a single predictor meanmin (albeit quadratic in meanmin) it's hard to imagine that there would be any lack of fit with this more complicated model. I follow the protocol outlined in lab 9 and use deciles to cut the estimated logits
> y.cuts<-cut(predict(model3), quantile(predict(model3), seq(0,1,.1)), include.lowest=TRUE)
> y.1<-tapply(fitted(model3),y.cuts,sum)
> y.n<-tapply(fitted(model3),y.cuts,length)
> y.0<-y.n-y.1
> Oi<-table(frogs$pres.abs,y.cuts)
> Ei<-rbind(y.0,y.1)
> Oi
y.cuts
[-5.95,-3.84] (-3.84,-2.53] (-2.53,-1.87] (-1.87,-1.25] (-1.25,-0.686] (-0.686,-0.186]
0 22 17 20 18 17 13
1 0 4 1 3 4 8
y.cuts
(-0.186,0.277] (0.277,0.647] (0.647,1.4] (1.4,6.99]
0 12 9 2 3
1 9 12 19 19
> Ei
[-5.95,-3.84] (-3.84,-2.53] (-2.53,-1.87] (-1.87,-1.25] (-1.25,-0.686] (-0.686,-0.186]
y.0 21.7878765 20.0209707 18.854612 17.396798 14.913822 12.535826
y.1 0.2121235 0.9790293 2.145388 3.603202 6.086178 8.464174
(-0.186,0.277] (0.277,0.647] (0.647,1.4] (1.4,6.99]
y.0 10.35678 8.158018 5.987189 2.988109
y.1 10.64322 12.841982 15.012811 19.011891
- As we can see the expected frequencies, Ei, are small in 5 of the 20 cells. This is probably too many for the asymptotic chi-squared distribution to be reliable. I could switch to quintiles but that would lead to a gigantic drop in degrees of freedom, so I elect to merge categories. Unfortunately, because of the way the degrees of freedom for the Hosmer-Lemeshow test are defined I can't just merge the Y = 1 categories. I have to merge the corresponding Y = 0 categories too. So, I elect to combine the first four categories.
> Ei.new<-cbind(Ei[,1]+Ei[,2]+Ei[,3]+Ei[,4],Ei[,5:10])
> Ei.new
(-1.25,-0.686] (-0.686,-0.186] (-0.186,0.277] (0.277,0.647] (0.647,1.4] (1.4,6.99]
y.0 78.060257 14.913822 12.535826 10.35678 8.158018 5.987189 2.988109
y.1 6.939743 6.086178 8.464174 10.64322 12.841982 15.012811 19.011891
- So now we have only one category, out of 14, that is too small. That's within the 20% margin. I merge the observed categories in the same way and carry out the Pearson test. The degrees of freedom for the test is the number of categories (now 7) minus 2 yielding 5.
> Oi.new<-cbind(Oi[,1]+Oi[,2]+Oi[,3]+Oi[,4],Oi[,5:10])
> sum((Oi.new-Ei.new)^2/Ei.new)
[1] 5.596732
> 1-pchisq(sum((Oi.new-Ei.new)^2/Ei.new),5)
[1] 0.3474554
- So the lack of fit is not significant. As an alternative goodness of fit test I also carry out the modified Hosmer-Lemeshow test that is in the Design package.
> out.h<-lrm( pres.abs ~ log(distance) + NoOfPools + meanmin + meanmax,data=frogs,x=TRUE,y=TRUE)
> residuals.lrm(out.h,type='gof')
Sum of squared errors Expected value|H0 SD Z
31.14031539 32.31699406 0.51671399 -2.27723399
P
0.02277226
The results from this test disagree with our conclusions above. The p-value suggests that there may be lack of fit.
- If I switch to the model that includes the two 2-factor interactions, the model selected by stepAIC, our conclusions about lack of fit change.
> out.h2<-lrm( pres.abs ~ log(distance) + NoOfPools + meanmin + meanmax+log(distance):meanmin+meanmin:meanmax, data=frogs, x=TRUE, y=TRUE)
> residuals.lrm(out.h2,type='gof')
Sum of squared errors Expected value|H0 SD Z
29.2178830 29.4623453 0.4771416 -0.5123474
P
0.6084079
- Redoing the Hosmer-Lemeshow test with this new model, I obtain the following.
> y.cuts<-cut(predict(model4), quantile(predict(model4), seq(0,1,.1)),
include.lowest=TRUE)
> y.1<-tapply(fitted(model4),y.cuts,sum)
> y.n<-tapply(fitted(model4),y.cuts,length)
> y.0<-y.n-y.1
> Oi<-table(frogs$pres.abs,y.cuts)
> Ei<-rbind(y.0,y.1)
> Ei
[-4.19,-3.33] (-3.33,-2.7] (-2.7,-2.09] (-2.09,-1.58] (-1.58,-0.876]
y.0 21.3972477 20.0584396 19.146099 18.113064 16.233116
y.1 0.6027523 0.9415604 1.853901 2.886936 4.766884
(-0.876,-0.294](-0.294,0.191] (0.191,1.01] (1.01,1.67] (1.67,8.3]
y.0 13.345059 10.60849 7.546274 4.535015 2.017197
y.1 7.654941 10.39151 13.453726 16.464985 19.982803
- Once again we need to merge categories, albeit a few more than last time. I'm left with six categories and hence 6–2 = 4 degrees of freedom.
> Ei.new<-cbind(Ei[,1]+Ei[,2]+Ei[,3]+Ei[,4], Ei[,5:8], Ei[,9]+Ei[,10])
> Ei.new
(-1.58,-0.876] (-0.876,-0.294] (-0.294,0.191] (0.191,1.01]
y.0 78.71485 16.233116 13.345059 10.60849 7.546274 6.552212
y.1 6.28515 4.766884 7.654941 10.39151 13.453726 36.447788
> Oi.new<-cbind(Oi[,1]+Oi[,2]+Oi[,3]+Oi[,4], Oi[,5:8], Oi[,9]+Oi[,10])
> 1-pchisq(sum((Oi.new-Ei.new)^2/Ei.new),4)
[1] 0.3352193
- So even though the two-factor interactions are difficult to interpret and were obtained in an unmotivated, unscientific way, the goodness of fit tests suggest we may need to use this model instead of the main effects model.
Model Calibration
Sensitivity and specificity
- Various functions in the ROCR package can be used to compute model sensitivity and specificity as well as draw ROC curves. The process begins by creating what's called a prediction object. To do this the prediction function in the ROCR package is supplied the fitted values from the model and a vector of presence/absence information.
> library(ROCR)
> pred<-prediction(fitted(model3), frogs$pres.abs)
- The performance function is then used to extract the statistics of interest. For sensitivity we need 'tpr', for true positive rate, and specificity or 'tnr', for true negative rate.
> performance(pred, 'tpr', 'tnr')->testy
- If you print the contents of the performance object testy you'll see that the elements occupy locations called "slots" and the elements of the slots are actually list elements. To access the elements of slots you need to use R's @ notation and to extract the list elements you need the double bracket notation. From the printout we see the slots are called x.values, y.values, and alpha.values referring respectively to "True negative rate", "True positive rate", and "Cutoff". To access the cutoffs, for example, I would need to specify
testy@alpha.values[[1]].
The following code plots specificity and sensitivity against the cutoffs.
> plot(testy@alpha.values[[1]], testy@x.values[[1]], type='n', xlab='c',
ylab='sensitivity or specificity')
> lines(testy@alpha.values[[1]], testy@y.values[[1]])
> lines(testy@alpha.values[[1]], testy@x.values[[1]],col=2)
> legend(.8,.8, c('sensitivity', 'specificity'), lty=c(1,1), col=1:2, cex=c(.9,.9), bty='n')
- The value of c where the sensitivity and specificity functions cross is the value of c that maximizes sensitivity and specificity simultaneously. We can estimate this value of c as follows.
> tempmat<-cbind(testy@x.values[[1]], testy@y.values[[1]], testy@alpha.values[[1]])
> tempmat[abs(tempmat[,1]-tempmat[,2])<.02,]->submat
> submat
[,1] [,2] [,3]
[1,] 0.7894737 0.7721519 0.4381193
[2,] 0.7819549 0.7721519 0.4374489
[3,] 0.7819549 0.7848101 0.4350874
[4,] 0.7819549 0.7974684 0.4332186
- From the output above we see that the specificity is constant at the point where the graphs cross (between rows 2 and 3). Thus we can use linear interpolation to estimate this value. (Otherwise I'd just use the midpoint of the interval containing the crossing point.)
> submat[2,3] + (submat[2,1] - submat[2,2])/(submat[3,2] - submat[2,2]) * (submat[3,3] - submat[2,3])
[1] 0.4356201
- So the sensitivity and specificity are jointly maximized when c = 0.4356201 is chosen as the cut-off. We would now use this cut-off in the usual way. If the estimated probability is greater than this value we should predict a corroboree frog to be present, otherwise we would predict it to be absent.
ROC curves
- A number of functions in different packages can be used to produce ROC curves in R. One choice is to use the ROCR package. To produce a ROC curve we need to extract the sensitivity ('tpr') and 1 – specificity, which is the false positive rate 'fpr'. Again we use the performance function to obtain these statistics from the variable pred that was created above. If we then plot this performance object we get a ROC curve.
> perf <- performance(pred,"tpr","fpr")
> plot(perf)
- The ROC function in the Epi package also produces ROC curves. It adds quite a bit of extra detail to the graph.
> library(Epi)
> attach(frogs)
> ROC(form=pres.abs~log(distance)+NoOfPools+meanmin+meanmax, plot="ROC")

Fig. 6 ROC curves from the ROCR and Epi packages, respectively
- The point indicated on the ROC curve produced by the ROC function of Epi is the place where the sum of specificity and sensitivity is maximized. Note that this is a different criterion than the one we considered above and not surprisingly yields a different answer. The value denoted by lr.eta, 0.398, is the cut-off where this sum is maximized.
- PV+ is the predictive value positive, the probability that the Corroboree frog is present at a site given that the test says it should be there. PV– is the predictive value negative, the probability that the Corroboree frog is absent at a site given that the test says it should not be there. In terms of the confusion matrix shown below we have

| |
Observed |
|
| Yi = 1 |
Yi = 0 |
Predicted
(using decision rule) |
 |
A |
C |
A+C |
 |
B |
D |
B+D |
| |
A+B |
C+D |
A+B+C+D |
You should contrast this with the definition of sensitivity and specificity given in Lecture 37.
Area under the curve (AUC)
- The ROC function from the Epi package prints the AUC statistic as part of the printed output on the ROC curve graph shown above. The value reported in Fig. 6 is 0.856.
- We can also obtain the AUC as part of the output from the lrm function in the Design library.
> lrm(pres.abs~log(distance) + NoOfPools + meanmin + meanmax, data=frogs)
Logistic Regression Model
lrm(formula = pres.abs ~ log(distance) + NoOfPools + meanmin +
meanmax, data = frogs)
Frequencies of Responses
0 1
133 79
Obs Max Deriv Model L.R. d.f. P C Dxy Gamma
212 1e-05 84.85 4 0 0.856 0.712 0.712
Tau-a R2 Brier
0.334 0.45 0.147
Coef S.E. Wald Z P
Intercept 19.64372 5.289504 3.71 0.0002
distance -0.80747 0.223489 -3.61 0.0003
NoOfPools 0.02728 0.009059 3.01 0.0026
meanmin 5.50919 1.228085 4.49 0.0000
meanmax -2.40129 0.634164 -3.79 0.0002
- In the output the value labeled C is the concordance coefficient. Recall from Lecture 37 that numerically the concordance coefficient is equal to AUC.
Cross-validation
- One of the problems with the statistics described so far is that the same data set is being used to fit the model as is being used to test the model. One would suspect that the statistics one obtains in this way will be much larger than would be obtained if brand new data were used. Since ideally one would like to use all the data to build the model, this doesn't leave any data to test the model. A way to get around this is cross-validation.
- In cross-validation a data set is repeatedly divided into parts where one part is used to fit the model and the other part is used to test the model. A popular way of doing this is 10-fold cross-validation. Here the data set is randomly divided into 10 parts, called folds. Nine of the folds are combined and used for fitting the model and the remaining fold is used for testing the model. This is then repeated a total of 10 times so that each fold gets to be used once for testing. At each iteration a statistic of interest is calculated and the results are averaged over the 10 runs.
- The cv.binary function in the DAAG package carries out 10-fold cross-validation on a specified model. It calculates what it calls an "estimate of accuracy". This is just the fraction of predicted probabilities that when rounded, round to the observed value 0 or 1.
> cv.binary(model3)
Fold: 8 10 3 5 2 1 9 7 4 6
Internal estimate of accuracy = 0.792
Cross-validation estimate of accuracy = 0.778
- Here we see that the cross-validation estimate is fairly close, albeit smaller, than the estimate obtained when the same data are used to build and test the model (what's called the internal estimate of accuracy in the output).
- The boot package also has a cross-validation function called cv.glm. This function requires that we first specify a cost function. For a cost function I employ the same "estimate of accuracy" statistic used by the cv.binary function but now formulated as a cost (making it an estimate of inaccuracy). The cost function is one of the arguments to cv.glm along with the data set, the model, and the number of folds desired, K. The cross-validation misclassification error is contained in the returned list component $delta.
> cost<-function(r,pi=0) mean(abs(r-pi)>0.5)
> cv.glm(frogs,model3,cost,K=10)->out
> names(out)
[1] "call" "K" "delta" "seed"
> out$delta
1 1
0.2169811 0.2215646
- The second value in out$delta is a bias-corrected version of the first component. Using either one we can expect to be inaccurate in our predictions 22% of the time with new data. This essentially matches the results obtained with the cv.binary function.
Cross-validation for ROC curves
- We can also obtain a cross-validation estimate of the ROC curve using the functions from the ROCR package. For this we actually need both the test data and the predictions that are obtained with each of the individual folds of the K-fold cross-validation. To obtain these quantities I modify the cv.binary function so that it returns the folds and predictions rather than just the summary statistics. I also add an argument that allows specification of a seed to reset the random stream so that the same folds can be obtained repeatedly if desired. In the listing below new code and code that I've altered from the original cv.binary function is shown in red. I also deleted the last 15 lines of the cv.binary function since they're not needed here.
my.cvfunc<-function (obj = frogs.glm, rand = NULL, nfolds = 10, print.details = TRUE, seed=NULL)
{
data <- obj$data
m <- dim(data)[1]
if (is.null(seed))
{
if (is.null(rand))
rand <- sample(nfolds, m, replace = TRUE)
}
else {
set.seed(seed)
if (is.null(rand))
rand <- sample(nfolds, m, replace = TRUE)
}
form <- formula(obj)
yvar <- all.vars(form)[1]
obs <- data[, yvar]
ival <- unique(rand)
fam <- obj$family$family
hat <- predict(glm(form, data, family = fam), type = "response")
cvhat <- rep(0, length(rand))
if (print.details)
cat("\nFold: ")
my.out<-vector("list",nfolds)
my.y<-vector("list",nfolds)
for (i in ival) {
if (print.details)
cat("", i)
if (i%%20 == 0)
cat("\n")
here <- i != rand
i.glm <- glm(form, data = data[here, ], family = fam)
cvhat[!here] <- predict(i.glm, newdata = data[!here,
], family = fam, type = "response")
my.out[[i]]<-cvhat[!here]
my.y[[i]]<-data[!here,yvar]
}
list(my.out,my.y)
}
> testit<-my.cvfunc(model3,seed=10)
- Now when I create the performance object with the performance function it generates the TPR and FPR values for each cross-validation fold. When I plot this performance object I request that the average value and boxplots be displayed.
> pred<-prediction(testit[[1]], testit[[2]])
> perf<-performance(pred,"tpr", "fpr")
> plot(perf,col="grey82",lty=3)
> plot(perf,lwd=1,avg="vertical", spread.estimate="boxplot", add=TRUE)
> abline(0,1,col=4,lty=3)
> mtext('Model 3',side=3,line=.5, font=2)
- I next repeat this for model4.
> testit<-my.cvfunc(model4,seed=10)
> pred<-prediction(testit[[1]], testit[[2]])
> perf<-performance(pred,"tpr","fpr")
> plot(perf,col="grey82", lty=3)
> plot(perf,lwd=1, avg="vertical", spread.estimate="boxplot", add=TRUE)
> abline(0,1,col=4,lty=3)
> mtext('Model 4',side=3,line=.5,font=2)
- The results are shown in Fig 7. Each plot is little bit dishonest in that it ignores the fact that observations in one box plot are not independent of observations in the other box plots. The effect is to make the variability look larger than it really is, because the tendency is to treat the bottom of end of the whiskers as defining a ROC curve, when in fact the bottom edge could just as well be coming from many different ROC curves. Having said this it does appear that the cross-validation ROC curves are consistently higher for Model 4 than for Model 3.

Fig. 7 Cross-validation ROC curves for models 3 and 4
Cited References
- Burnham, K. P. and Anderson, D. R. 2002. Model Selection and Multimodel Inference. Springer-Verlag, New York.
- Davison, A. C. 2003. Statistical Models. Cambridge University Press, Cambridge, United Kingdom.
- Harrell, Frank E. 2001. Regression Modeling Strategies. Springer-Verlag, New York.
- Henderson, H. and Velleman, P. F. 1981. Building multiple regression models interactively. Biometrics 37: 391–411.
- Judd, C. M. and McClelland, G. H. 1989. Data Analysis: A Model-Comparison Approach. Harcourt Brace Jovanovich, San Diego, CA.
Course Home Page