Final Exam

Due Date: Monday, May 8, 2006

Instructions: Submit your answers to the questions along with whatever code you used to obtain your answers.

Problem 1

Data 

The file problem1.csv contains the data for this problem. This is a comma-delimited text file in which the variable names appear in the first row.

Background 

Three shells were collected from one stream (Boone) and four shells from a second stream (Buffalo). Multiple sections of each shell were then analyzed and a carbon isotope composition obtained for each shell section. The goal is to determine if there is a systematic difference in the carbon isotope composition of shells from the two different streams.

The Problem

  1. Fit two different models such that when these two models are compared in an appropriate fashion, the question posed above can be answered statistically. Carry out a statistical significance test using the two models to answer the researcher's question.
  2. Carry out a second statistical test that answers this same question but only requires you to fit a single model. Explain why this test and the test in question 1 are yielding different conclusions.
  3. Produce a single graph that best summarizes this experiment. Choose a graph that displays as much information as possible without becoming overly busy and confusing.
  4. Finally, as hard as this might be for you to do, I would like you to carry out an incorrect statistical analysis that addresses the researcher's question. The analysis I'm looking for is not just incorrect but egregiously incorrect. Answer the following two questions about your incorrect analysis.
    • What's wrong with it?
    • Why are its conclusions so different from the correct analyses you did in questions 1 and 2?

Hint

  1. I'm not trying to be deliberately obscure in question 4. The analysis I'm thinking of is perhaps the most obviously incorrect analysis one might come up with. This incorrect analysis has been done so often in ecology that a famous paper published in the 1980s, a paper which to date has been cited nearly 3000 times, coined a somewhat contentious term to describe exactly this sort of bogus analysis.

Problem 2

Data 

The file WaterUsageData.csv contains the amount of water used over time during the production of landscape trees. This is a comma-delimited text file in which the variable names appear in the first row.

Background 

These are the same data you analyzed in Assignment 11. See the description contained there. The goal this time is to determine if water usage varies with respect to tree SPECIES (1 and 2) and tree AGE (1 and 2), or some combination of the two.

The Problem

Find the most parsimonious model that accounts for all the relevant differences in water usage among tree species (SPECIES) and tree age classes (AGE).

Hint

  1. A systematic approach to deciding which effects to model as random and how to include level-2 predictors in the model is preferable to just fitting a lot of models. One way to be systematic is to follow the approach I took in class when I analyzed the kestrel data set. There I fit a number of preliminary individual models using the nlsList function that then became input to the intervals function. The plot that was produced was especially useful in quantifying parameter variability among level-2 units. If you choose to go this route you would replace nlsList with lmList.

Problem 3

Data 

The file problem3.csv contains the data for this problem. This is a comma-delimited text file in which the variable names appear in the first row.

Background 

Dengue fever is endemic in many parts of the world and is spreading. The data here consist of the daily admission records of hospitals and clinics throughout Sri Lanka over a many year period. There are only two variables.

  1. The variable DISEASE records whether any new cases of dengue fever were diagnosed on a particular day (DISEASE=0) or not (DISEASE=1). Notice that this is the reverse of our usual coding scheme. So if you want to model the probability of disease you will want to subtract these values from 1.
  2. The column labeled temp is a smoothing (average over adjacent records) of the daily high temperature for the day ten days prior to the date of the current disease record. Dengue fever is transmitted by a mosquito and ten days corresponds roughly to the minimal developmental time in going from egg to adult. In fact a number of different lag times and smoothing windows were considered. What is reported here is what was determined to be the optimal choice. It is generally thought that temperature may be one of the controlling factors in the mosquito's life cycle. The large amount of missing data in the temp column reflects the absence of suitable temperature data for calculating the smoothed ten day lag.

The goal of the analysis is to determine if the appearance of diseased individuals at clinics is in any way related to the lagged temperature variable.

The Problem

  1. Build the best model you can that relates the disease presence-absence variable (DISEASE) to the temperature variable (temp).
  2. Carry out whatever tests you deem appropriate to demonstrate the quality of your best model.

Hint

  1. Be sure to demonstrate that you have correctly specified the structural relationship between the variables in your model.

Problem 4

Data 

The file problem4.csv contains the data for this problem. This is a comma-delimited text file in which the variable names appear in the first row.

Background 

The North Carolina Division of Marine Fisheries (NCDMF) permits the mechanical harvesting of the hard clam Mercenaria mercenaria during an annual midwinter "season". The harvesting, called clam kicking, involves the use of a 17- to 45-foot specially outfitted boat that uses its propeller to backwash clams into a trawl that is dragged along the bottom. Prompted by declining catches and complaints about the highly disruptive nature of this method of harvesting, the NCDMF in Fall 2001 instituted a rotation plan thereby creating a part-time Marine Protected Area (MPA) in a portion of Core Sound. According to the plan a portion of Core Sound is closed for two years (leaving the remainder of the Sound open to fishing). The hope is that the temporary hiatus in fishing would permit the clam stock biomass to replenish itself and even export larvae to areas open to fishing while at the same time allowing the sea bottom to recover.

To assess whether part-time MPAs are a viable strategy, a study was begun in Spring 2001 to assess the impact of this rotation plan on the hard clam population in Core Sound. Samples were taken in the Spring and Fall of each year for a three-year period. The Spring sample immediately followed the closing of the midwinter clam-kicking season while the Fall sample immediately preceded the opening of the next season. The first two samples were taken before the rotation plan was put into place and thus occurred at a time when all of Core Sound was open to clam-kicking. Samples were also taken in the Spring and Fall of 2002 and 2003 during which a portion of Core Sound was closed to clam-kicking. Hereafter these two treatment areas will be denoted CC (Core "control" for the portion of Core Sound that remained open to fishing during all three years) and CE (Core "experimental" for the portion of Core Sound that was open during the first year of sampling but then was closed for the final four sample periods).

A commercial clam-kicking vessel was leased to in order to obtain the samples. During each sampling period twelve or more five-minute trawls were carried out in each treatment area and the total number of legal-sized clams obtained from each trawl was recorded. This information can be found in the variable LEGAL.CLAMS. The location of the trawl is recorded as TREATMENT (CE or CC). An effort was made to sample as much of the harvestable area as possible. Because the equipment and sampling protocol were kept the same throughout the study, it is assumed that the individual trawls are suitable replicates of the two treatment areas for that sampling period.

The goal is to develop a suitable model of the number of clams harvested during the course of this study such that the model adequately describes the data and at the same time addresses the question of whether the NCDMF rotation plan is a viable means of protecting the clam population.

The Problem

1.  Begin by plotting the data over time. Your plot should exhibit the following features.

2.  Choose an appropriate probability generating mechanism for these data. I don't expect you to do an exhaustive search here but you do need to do some kind of preliminary analysis to justify your choice of probability distribution.

3.  The most complicated model imaginable would be one that constructed separate models for each treatment area at each sampling period. If k parameters are required in each of these models this would mean estimating a total of 12k parameters. This is excessive and would exhaust most of the available degrees of freedom for carrying out a goodness of fit test. Based on the following observations drawn from an inspection of your plot in Question 1 we will try to reduce this number.

Observations

Using some of these observations begin by fitting a model that includes the following:

Next use the observations listed above to try to simplify this model as much as possible. Carry out your simplifications by fitting a sequence of models in which various terms are eliminated. Stop when either significance testing or an appropriate information theoretic criterion dictates that further simplification is unwarranted. Do not use an automatic model building program for this!

4.  Carry out an appropriate goodness of fit test for your final model. Does the model fit?

5.  Redo your plot from Question 1 except do not connect the treatment means by line segments. Instead superimpose your final model from Question 3 on the plot as two lines representing the predictions for the two treatments. Based on your plot how well does it appear that the model predicts the observed means?

6.  What are your conclusions? Does the rotation plan appear to work?

Hints

  1. For your plot in Question 1 you will probably want to place your legend outside of the plotting region. By default all plotting is clipped to the plot region. This can be expanded to include the entire figure region by issuing the command par(xpd=TRUE). You should issue this just before specifying the legend and then resetting xpd to FALSE after the legend is drawn. When plotting outside the plot region the x- and y-coordinate system just continues in the natural fashion.
  2. The text function can be used to place text at various points inside the plotting region.
  3. Ideally Question 6 would be addressed with a formal analysis. For the exam I'm just asking you to eyeball your graph from Question 5 and interpret it in light of the research question of interest. I will include a more formal analysis as part of the solution to this problem on the class web site.
  4. Question 4: You carried out an appropriate goodness of fit test that could be applied to this problem in the Midterm. One difference here is that you will have a lot categories to combine in groups (essentially all of them) to meet the minimum expected frequency criteria. To simplify things I've written a function to assist you. This function takes two arguments: (1) a minimum probability value you wish to attain in each group, and (2) a vector of expected probabilities for the values 0 to n. I demonstrate its use below.

Suppose the data set is clams and I've computed the expected probabilities using my model for count categories 0 to 1000 and have stored these in the vector p.actual. As an illustration, suppose I decide to have a minimum expected frequency of at least 7 in each group (I'm not necessarily recommending this value!) Dividing 7 by the number of observations yields the minimum expected probability in each cell. The function is used as follows.

> get.breaks(7/dim(clams)[1],p.actual)
 [1] -1   6  10  13 16 19 22 25 28 31 35 39 43 48 53 59 66 74 83
[20] 95 110 131 164

The first reported value –1 is just a filler. The remaining reported values are the last count that should be placed in that group. Thus the first group should consist of counts x = 0 through x = 6, the second group counts x = 7 through x = 10, etc. For example, x = 0 to x = 6 corresponds to the first 7 observations of p.actual. I sum them and then multiply by the total number of observations.

> sum(p.actual[1:7])
[1] 0.03904472
> sum(p.actual[1:7])*dim(clams)[1]
[1] 7.223273

So we see that sum of the expected frequencies do exceed 7 as desired. If I drop one observation we obtain a total less than 7.

> sum(p.actual[1:6])*dim(clams)[1]
[1] 5.404796

The sum of the expected frequencies for the next group also exceeds 7.

> sum(p.actual[8:11])*dim(clams)[1]
[1] 9.085657

Now of course you don't want to use the values in this way. It would be too tedious. Just use the cut function with these as the category endpoints (plus one additional value for the upper limit) to create groups. Finally use tapply to sum the values of p.actual in each group. You can then repeat this with the observed counts except you'll first need to fill in zeros for the count categories that don't occur. One elegant way to do this is create a vector of zeros where there is a zero for each possible count. Use merge to combine this vector with the vector of observed counts, merging by the count category number. This will create missing values, NA, for counts that were not observed. Finally use ifelse to replace these with zeros.

Course Home Page


Jack Weiss
Phone: (919) 962-5930
E-Mail: jack_weiss@unc.edu
Address: Curriculum in Ecology, Box 3275, University of North Carolina, Chapel Hill, 27516
Copyright © 2006
Last Revised--May 4, 2006
URL: http://www.unc.edu/courses/2006spring/ecol/145/001/docs/assignments/final
.htm