Due Date: Friday, April 7, 2006
The file crabs.txt contains the data analyzed by Brockmann (1996). This is a space-delimited text file in which the variable names appear in the first row.
In a study of nesting horseshoe crabs each female horseshoe crab had a male crab resident in her nest. The study investigated factors affecting whether the female crab had any other males, called satellites, residing nearby. Explanatory variables are the female crab's color, spine condition, weight, and carapace width. The variable num.satellites records the number of satellite males.
This assignment is a continuation of Assignment 8. In Assignment 8 you investigated the relationship between the presence/absence of satellite males and the width of the female crab using a logistic regression model. In this assignment you will use a multiple logistic regression model to investigate the relationship between the presence/absence of satellite males and all the remaining variables in the crabs data set. The available variables are the following.
The goal is to find a "best" model relating the presence/absence of satellite males to the condition of the female as measured by these four variables.
Question 1 Fit a logistic regression model that uses all the variables as main effects, i.e., you need not consider the possibility of variable interactions at this point. Think long and hard about the variables color and spine before you blindly include them in the model.
Question 2 In light of your analysis in Assignment 8 is there anything troubling about your answer to Question 1? What do you think may be going on?
Question 3 Refit the model of Question 1 but this time without the weight variable. Examine the output from the summary function and answer question 1 again for this new model. What's changed? Explain why this change has occurred.
Question 4 In light of your answer to Question 3, we will no longer include weight among the set of predictors. Using the remaining three variables find a best main effects logistic regression model for these data. Be sure to justify the steps you go through in declaring this model to be best.
Question 5 Now consider a model that includes interactions among the three variables. You have my blessing at this point to use an automated variable selection routine if you wish. What model do you come up with? Interpret the parameter estimates of all the predictors that occur in your final model.
Question 6 If you compare the model you obtained in Question 5 against various nested simpler models using appropriate significance tests, which model would you conclude is best ?
Question 7 Graph your final model from Question 5 on a probability scale.
Hint 1: If you've done everything correctly so far then you should be able to interpret your final model as a set of models. Each model in the set predicts the presence/absence of satellite males for a different kind of female crab. Graph the models for these different females on the same graph. In the end you should have multiple logistic curves plotted together in which the different curves are distinguished by different line types and/or colors. Be sure to include a legend for this plot.
Hint 2: A nice touch might be to use different colors or symbols for the presence/absence values based on the different types of female crabs that are recognized by your final model.
Question 8 Using an appropriate goodness of fit test, test the fit of your final model.
Question 9 In this question we explore model calibration using the model obtained in Question 5.
1. Suppose you decide to use your model to predict whether a female has satellite males nearby. You will predict satellite males to be present if
where
is probability for female i obtained from the logistic regression model and c is a cutoff value to be chosen. What value of c should you use if your goal is to maximize both the specificity and sensitivity of your decision rule?
2. Obtain the value of AUC, area under the curve, for your logistic regression model. Interpret the number you obtain.
3. Carry out a 10-fold cross-validation and report the AUC you would expect to obtain with new data.
Hint 3: Both of the cross-validation functions we considered require that all variables used in the model are also part of the data set. You need to add any variable you created, presence-absence, factor variables, etc., to the data set. So if you created a variable pres.abs you will need to add it to crabs as follows
crabs$pres.abs <- pres.abs
Repeat this for any other variable used in the model that is not in the original data set before doing the cross-validation.
4. Plot the ROC curve for your logistic regression model.
5. Do a second ROC plot but this time include both the ROC curve for the final model obtained in Question 5 and the ROC curve for the final model obtained in Assignment 8. Use different colors for the two curves. What can you conclude from this plot?
Hint 4 : Ordinarily the plot function replaces the current plot in the graph window with the new plot. This behavior can be overridden by changing the setting of the graph parameter new to TRUE using the par function.Thus if you issue the command, par(new=TRUE), then all subsequent graphics commands will be added to the current graph (rather than replacing it). This will allow you to add a second ROC curve using the plot command. Be sure to turn this setting off by issuing the command, par(new=FALSE), when you are done.
Hint 5 : There are two ways you can generate the ROC curve. One way is to use the plot function and just plot different performance objects resetting new to TRUE between runs and using a different color for the second plot call. The second choice is to extract the fpr and tpr from the performance object for each model using the slot notation like you did to plot sensitivity and specificity. These become then just ordinary x- and y-variables in a call of the plot function. To get the stairstep pattern preferred for a ROC curve, set type='s' in the plot function. You will still need to reset new to TRUE between plot calls. You will not want to use the ROC function from Epi unless you also turn off all the extra text it prints on the graph.
Question 10 Since the variable color represents shades of darkness, it might be treated as an ordinal variable. What evidence do you have from your logistic regression results to suggest that perhaps the log odds of a satellite male being present has an ordinal relationship to the variable color?
Question 11 Thus far in this course we have not discussed regression models with ordinal predictors, but we will correct that omission now. We can carry out a test for trend with respect to the categories of a categorical variable by declaring that variable to be ordinal using the ordered function of R. The ordered function uses orthogonal polynomials to decompose an ordinal variable into linear, quadratic, cubic, etc. components. The components are identified in the output by the suffices .L, .Q, .C, etc. The assumption being made is that although the values of the categories are not known, the spacing between the categories is equal. Fit a logistic regression model using width and ordinal color as predictors. Is there any evidence for linear, quadratic, or cubic trends?
Question 12 A second way of handling ordinal data is to use Helmert contrasts. We discussed and interpreted this coding scheme in Lecture 26. Unlike ordered factors, here we assume the categories are ordered, but not necessarily equally spaced. Declare color to be a factor with Helmert contrasts and fit the model again using width along with this new color factor variable as predictors. Based on the output from the summary function, what can you conclude about color and its effect on the presence/absence of satellite males this time? Interpret the color results as best you can.
Hint 6 : As explained in Lecture 26, if color.f is the factor version of the variable color then you can change its default contrasts to Helmert as follows: contrasts(color.f)<-'contr.helmert'
Question 13 Based on what you observed in Question 12 dichotomize color into two groups. Fit a logistic regression model that includes width and dichotomized color as predictors. How does this model compare to your model of Question 5?
Brockmann, H. J. 1996. Satellite male groups in horseshoe crabs, Limulus polyphemus. Ethology 102: 1–21.
| Jack Weiss Phone: (919) 962-5930 E-Mail: jack_weiss@unc.edu Address: Curriculum in Ecology, Box 3275, University of North Carolina, Chapel Hill, 27516 Copyright © 2006 Last Revised--April 8, 2006 URL: http://www.unc.edu/courses/2006spring/ecol/145/001/docs/assignments/assign9.htm |