Problem 1
Data
The file crabs.txt contains the data analyzed by Brockmann (1996). This is a space-delimited text file in which the variable names appear in the first row.
Background
In a study of nesting horseshoe crabs each female horseshoe crab had a male crab resident in her nest. The study investigated factors affecting whether the female crab had any other males, called satellites, residing nearby. Explanatory variables are the female crab's color, spine condition, weight, and carapace width. The response outcome of interest for each female crab is her number of satellites. For this problem we will use the female's width alone as a predictor. The variable width in the file crabs.txt records a female's carapace width in centimeters and the variable num.satellites records the number of satellite males.
The Problem
- Estimate how the variance in the number of satellite males varies with the mean using carapace width to group your data.
- Superimpose in a scatter plot of the variance versus the mean potential models for the mean-variance relationship drawn from appropriate probability models.
- Based on your answer to question 2, what probability model might be most appropriate as a probability generating mechanism for these data?
- Fit your probability model to the data. Obtain your answer by minimizing the negative loglikelihood of your data using your chosen probability model. I would like you to fit two models.
- A model without predictors.
- A model in which you allow the mean number of satellites to vary as a function of the female's carapace width.
- Compare the two models you found in question 4 using AIC.
- Refit the two models of question 4 using an appropriate generalized linear model. Use a significance test to assess whether female carapace width is a significant predictor of the number of satellite males.
- Hard: Check the fit of the best of the two models you've fit.
Hints
- In fitting model 4.2 I would recommend parameterizing your loglikelihood so that the mean is the exponential of your linear predictor. This should assist in getting convergence.
- If model 4.2 turns out to be the best (hint) you will need to include the female's carapace width in your calculation of the expected frequencies. The easiest way to do this is to use the model results from question 6 and make use of either the fitted function (returns the means) or the predict function (returns the values of the linear predictor) applied to your model object. Then use these values in calculating the expected probabilities for each observed value of carapace width.
- Goodness of fit. We haven't done one like this. It's far more complicated than anything we've handled thus far. Hopefully these hints will get you on the right track.
- Each observed value of carapace width will generate a set of probabilities. Let X be a random variable giving the number of satellite males. Then since there are 173 female crabs in the data set you will, e.g., obtain 173 estimates of P(X = 0). They won't all be different because some females have the same carapace width. To obtain the expected probability that X = 0 you will need to average these 173 values.
- You will need to obtain the average separately for all possible values of X. The simplest way to do this is with the sapply function. The first argument to sapply should be the different values of X at which you wish to compute the expected probability. The second argument should be a generic function which for a specific value of X calculates the average probability under your probability model for all 173 crabs.
- Don't forget to include the tail probability when you do the expected probabilities.
- When you tabulate the observed frequency counts for the number of satellite males, some of the intermediate categories may be missing. Be sure to deal with that before you do the goodness of fit test.
Problem 2
Data
The file hurricanes.csv contains the data analyzed by Haggard et al. (1973). This is a comma-delimited text file in which the variable names appear in the first row.
Background
Although hurricanes generally strike only the eastern and southern coastal regions of the United States, they do occasionally sweep inland before completely dissipating. The U.S. Weather Bureau confirms that in the period from 1900 to 1969 a total of 36 hurricanes moved as far as the Appalachians. The file hurricanes.csv lists the maximum 24-hour precipitation recorded for those 36 storms during the time they were over the mountains.
The Problem
- Fit a gamma distribution to the hurricane precipitation data.
- Superimpose the distribution estimated in question 1 on top of a histogram of the data.
- Use a generalized linear model to fit the same distribution you did in question 1. Find one thing from the output of the glm fit (you may need to use a function to extract it) that matches one thing from your output of fitting the distribution directly in question 1.
Hints
- In question 1 you could muck around guessing values for the parameters, or you can be systematic about it. Recall that the scale and shape parameters occur in formulas for the mean and variance of the gamma distribution. You can solve these equations for scale and shape in terms of the mean and variance and use these as your initial estimates.
- In question 1 if you get an error message that says something to the effect that the shape parameter must be strictly positive, try replacing your shape parameter by the exponential of your shape parameter in your negative loglikelihood function.
- Also recall that the rate parameter in R's various gamma functions is in fact what we called the scale parameter. What R calls the scale parameter is the reciprocal of ours.
- In question 2 to superimpose a probability density function on top of a histogram you will need to specify the option probability=TRUE (or equivalently freq=FALSE) as an argument to the hist function. This scales the vertical axis as density. Otherwise the scales won't match when you try to overlay the estimated gamma density function.
- In question 3 to see options for the family argument of glm, type help('family').
- For the "one thing" I ask for in question 3, you might start with the output you get in question 1, which is pretty limited. You should be able to extract something from the glm output that comes close to matching something in this output.
Problem 3
Data
The file gallfly.txt contains the data appearing in Finney and Varley (1955). This is a space-delimited text file in which the variable names appear in the first row.
Background
The female knapweed gall-fly (Urophora jaceana) lays its eggs in batches in the unopened flowers of the black knapweed (Centaurea nemoralis). The second instar larva hatches from the egg and produces a hard gall-cell. Numbers of eggs and gall-cells (at different dates) were counted in flower heads in two years 1935 and 1936 (Finney and Varley, 1955). The table below summarizes the results. The numbers in the table reflect the number of plants showing the counts given in column 1.
| Number |
1935 |
1936 |
Eggs |
Gall-cells |
Eggs |
Gall-cells |
1 |
29 |
287 |
22 |
90 |
2 |
38 |
272 |
18 |
96 |
3 |
36 |
196 |
18 |
57 |
4 |
23 |
79 |
11 |
26 |
5 |
8 |
29 |
9 |
10 |
6 |
5 |
20 |
6 |
4 |
7 |
5 |
2 |
3 |
5 |
8 |
2 |
0 |
0 |
0 |
9 |
1 |
1 |
1 |
1 |
10 |
0 |
0 |
0 |
0 |
11 |
0 |
0 |
0 |
0 |
12 |
1 |
0 |
0 |
0 |
The Problem
1. Using only the egg data from 1935, find a distribution that is appropriate for these data, estimate its parameter(s), and check its fit.
2. Does the same distribution work for the 1935 gall-cell counts?
3. Is there any evidence of a change in distribution of eggs and gall-cells between 1935 and 1936? Based on your answer to question 2 you may want to answer this question separately for eggs and gall-cells.
Hints
- Observe that no data for the zero category were obtained. An appropriate distribution is one in which zero counts are not possible. While we never explicitly fit such a distribution in class, one such distribution was discussed in a different context in lecture 18. I'm not really interested in you doing an exhaustive search to find the best model for these data. I just want you to fit something that is appropriate. If you're still stuck on what to use, check out the title of the paper that I reference as the source of these data.
- Depending on how you evaluate the fit in question 1, you may find the result less than ideal. Since this exam is already long enough you have my blessing to continue with this distribution regardless of how the fit turns out. (That's assuming you chose a distribution that is at least appropriate in the first place!)
- In answering questions 2 and 3 the methodology that you used in Assignment 5 would be appropriate.