Due Date: Monday, May 8, 2006
Instructions: Submit your answers to the questions along with whatever code you used to obtain your answers.
The file problem1.csv contains the data for this problem. This is a comma-delimited text file in which the variable names appear in the first row.
Three shells were collected from one stream (Boone) and four shells from a second stream (Buffalo). Multiple sections of each shell were then analyzed and a carbon isotope composition obtained for each shell section. The goal is to determine if there is a systematic difference in the carbon isotope composition of shells from the two different streams.
The file WaterUsageData.csv contains the amount of water used over time during the production of landscape trees. This is a comma-delimited text file in which the variable names appear in the first row.
These are the same data you analyzed in Assignment 11. See the description contained there. The goal this time is to determine if water usage varies with respect to tree SPECIES (1 and 2) and tree AGE (1 and 2), or some combination of the two.
Find the most parsimonious model that accounts for all the relevant differences in water usage among tree species (SPECIES) and tree age classes (AGE).
The file problem3.csv contains the data for this problem. This is a comma-delimited text file in which the variable names appear in the first row.
Dengue fever is endemic in many parts of the world and is spreading. The data here consist of the daily admission records of hospitals and clinics throughout Sri Lanka over a many year period. There are only two variables.
The goal of the analysis is to determine if the appearance of diseased individuals at clinics is in any way related to the lagged temperature variable.
The file problem4.csv contains the data for this problem. This is a comma-delimited text file in which the variable names appear in the first row.
The North Carolina Division of Marine Fisheries (NCDMF) permits the mechanical harvesting of the hard clam Mercenaria mercenaria during an annual midwinter "season". The harvesting, called clam kicking, involves the use of a 17- to 45-foot specially outfitted boat that uses its propeller to backwash clams into a trawl that is dragged along the bottom. Prompted by declining catches and complaints about the highly disruptive nature of this method of harvesting, the NCDMF in Fall 2001 instituted a rotation plan thereby creating a part-time Marine Protected Area (MPA) in a portion of Core Sound. According to the plan a portion of Core Sound is closed for two years (leaving the remainder of the Sound open to fishing). The hope is that the temporary hiatus in fishing would permit the clam stock biomass to replenish itself and even export larvae to areas open to fishing while at the same time allowing the sea bottom to recover.
To assess whether part-time MPAs are a viable strategy, a study was begun in Spring 2001 to assess the impact of this rotation plan on the hard clam population in Core Sound. Samples were taken in the Spring and Fall of each year for a three-year period. The Spring sample immediately followed the closing of the midwinter clam-kicking season while the Fall sample immediately preceded the opening of the next season. The first two samples were taken before the rotation plan was put into place and thus occurred at a time when all of Core Sound was open to clam-kicking. Samples were also taken in the Spring and Fall of 2002 and 2003 during which a portion of Core Sound was closed to clam-kicking. Hereafter these two treatment areas will be denoted CC (Core "control" for the portion of Core Sound that remained open to fishing during all three years) and CE (Core "experimental" for the portion of Core Sound that was open during the first year of sampling but then was closed for the final four sample periods).
A commercial clam-kicking vessel was leased to in order to obtain the samples. During each sampling period twelve or more five-minute trawls were carried out in each treatment area and the total number of legal-sized clams obtained from each trawl was recorded. This information can be found in the variable LEGAL.CLAMS. The location of the trawl is recorded as TREATMENT (CE or CC). An effort was made to sample as much of the harvestable area as possible. Because the equipment and sampling protocol were kept the same throughout the study, it is assumed that the individual trawls are suitable replicates of the two treatment areas for that sampling period.
The goal is to develop a suitable model of the number of clams harvested during the course of this study such that the model adequately describes the data and at the same time addresses the question of whether the NCDMF rotation plan is a viable means of protecting the clam population.
1. Begin by plotting the data over time. Your plot should exhibit the following features.
2. Choose an appropriate probability generating mechanism for these data. I don't expect you to do an exhaustive search here but you do need to do some kind of preliminary analysis to justify your choice of probability distribution.
3. The most complicated model imaginable would be one that constructed separate models for each treatment area at each sampling period. If k parameters are required in each of these models this would mean estimating a total of 12k parameters. This is excessive and would exhaust most of the available degrees of freedom for carrying out a goodness of fit test. Based on the following observations drawn from an inspection of your plot in Question 1 we will try to reduce this number.
Observations
Using some of these observations begin by fitting a model that includes the following:
- an intercept,
- a treatment effect,
- a seasonality effect,
- a linear effect due to date (treating the sample dates as equally spaced and coded, e.g., 0, 1, 2, 3, 4, 5),
- a date by treatment interaction, and
- a season by treatment interaction.
Next use the observations listed above to try to simplify this model as much as possible. Carry out your simplifications by fitting a sequence of models in which various terms are eliminated. Stop when either significance testing or an appropriate information theoretic criterion dictates that further simplification is unwarranted. Do not use an automatic model building program for this!
4. Carry out an appropriate goodness of fit test for your final model. Does the model fit?
5. Redo your plot from Question 1 except do not connect the treatment means by line segments. Instead superimpose your final model from Question 3 on the plot as two lines representing the predictions for the two treatments. Based on your plot how well does it appear that the model predicts the observed means?
6. What are your conclusions? Does the rotation plan appear to work?
Suppose the data set is clams and I've computed the expected probabilities using my model for count categories 0 to 1000 and have stored these in the vector p.actual. As an illustration, suppose I decide to have a minimum expected frequency of at least 7 in each group (I'm not necessarily recommending this value!) Dividing 7 by the number of observations yields the minimum expected probability in each cell. The function is used as follows.
> get.breaks(7/dim(clams)[1],p.actual)
[1] -1 6 10 13 16 19 22 25 28 31 35 39 43 48 53 59 66 74 83
[20] 95 110 131 164The first reported value –1 is just a filler. The remaining reported values are the last count that should be placed in that group. Thus the first group should consist of counts x = 0 through x = 6, the second group counts x = 7 through x = 10, etc. For example, x = 0 to x = 6 corresponds to the first 7 observations of p.actual. I sum them and then multiply by the total number of observations.
> sum(p.actual[1:7])
[1] 0.03904472
> sum(p.actual[1:7])*dim(clams)[1]
[1] 7.223273So we see that sum of the expected frequencies do exceed 7 as desired. If I drop one observation we obtain a total less than 7.
> sum(p.actual[1:6])*dim(clams)[1]
[1] 5.404796The sum of the expected frequencies for the next group also exceeds 7.
> sum(p.actual[8:11])*dim(clams)[1]
[1] 9.085657Now of course you don't want to use the values in this way. It would be too tedious. Just use the cut function with these as the category endpoints (plus one additional value for the upper limit) to create groups. Finally use tapply to sum the values of p.actual in each group. You can then repeat this with the observed counts except you'll first need to fill in zeros for the count categories that don't occur. One elegant way to do this is create a vector of zeros where there is a zero for each possible count. Use merge to combine this vector with the vector of observed counts, merging by the count category number. This will create missing values, NA, for counts that were not observed. Finally use ifelse to replace these with zeros.
Jack Weiss |