Final Exam (100 pts)

Due Date: Monday, May 7, 2007

Instructions: Submit your answers to the questions along with whatever code you used to obtain your answers.

Problem 1 (30 pts)

Data 

The file lakes.txt contains the data shown in Table 1.1, pp 8–9 of Manly’s book, sulfate (SO4) concentrations over several years for a number of lakes in Norway. It is a tab-delimited text file.

Background 

These data are used in various places throughout Manly's book.

The Problem

  1. In this first part ignore the structure of the data set and find the best model you can that relates SO4 concentration to year. List all models you considered and provide a statistical argument(s) for why your best model took the form it did.
  2. Produce a graph that accurately displays the structure of the data set. Your graphical display should include for each lake both a scatter plot of the data and a superimposed regression line (curve).
  3. Find the best model you can that relates SO4 concentration to year but this time also correctly accounts for the structure in the data set.
  4. Interpret the best model you found in part 3. What does the model actually say? Be as specific as possible.
  5. Produce suitable graph(s) to demonstrate that in addition to being structurally invalid your best model in part 1 failed to adequately account for the spatial correlation in the data set.
  6. Produce suitable graph(s) to demonstrate that your best model in part 3 has adequately accounted for the spatial correlation in the data set.

Note: You may wish to do parts 5 and 6 simultaneously by placing the appropriate graphs side by side for comparison.

Hints

  1. There are two forms of structure to these data—a deliberate structure that is imposed by the way the data were collected and a second structure that I would call incidental. The problem sets out to account for the deliberate structure directly and then at the very end checks to see if the incidental structure has also been inadvertently accounted for. To understand what the deliberate structure here is ask yourself the following question. Looking at the data why does it seem unlikely that a random sample of lakes was obtained at each time period? Since it appears that a random sample was only taken at the beginning of the study what's the structure?
  2. The graph I'm referring to in part 2 is much like the one you produced for the radon data in Assignment 10.
  3. I'm not asking you to fit any formal models in parts 5 and 6. The empirical estimate should be enough to answer the question.

Problem 2 (35 pts)

Data 

The file coverclass.txt is a space-delimited text file that contains the data for this problem. There are four variables:

The file scinames.txt lists the species codes again and provides the Latin binomials that correspond to the species codes. It is a tab-delimited text file.

Background 

The data consist of plant species cover values taken on two occasions on each of 28 fixed transects (technically relevés) on one of the Channel Islands off the coast of California. The first occasion was in 1983 before the island became a national park and the second occasion was in 2002 after a long period of feral animal removal. There were 25 plant species that provided suitable amounts of data for analysis and the cover values of these 25 species for all those transects on which they occurred are what appear in the file. Plant cover values are ordinal measures of areal cover more or less on a log base 2 scale. Not all species were found on each transect so sample sizes vary widely among the species.

The Problem

  1. Treat the different transects as replicates and carry out separate statistical tests for each species to determine if its cover value has changed significantly between 1983 and 2002. In the end you should have carried out 25 tests, one for each of the 25 species.
  2. Report unadjusted, Bonferroni-adjusted, and sequential Bonferroni adjusted results. Carry out the adjustments so that the experiment-wise error rate is held fixed at 0.05.
  3. Carry out an additional adjustment so that the false discovery rate is held fixed at 0.05.
  4. Explain conceptually the difference between unadjusted, Bonferroni-adjusted, and false discovery rate approaches to the multiple comparison problem. In light of your explanation interpret your answers in parts 2 and 3.
  5. Concoct a single graphical display that illustrates the change in cover value for each species as well as a measure of reliability for this change. Ideally the picture should be commensurate with the way in which you chose to answer part 1.

Hints

  1. There are at least four distinct methods you could use to answer part 1: a parametric method, a nonparametric method, a randomization-based method, and a model-based method (actually many different model-based methods). I really don't care which of these you use. Most people I've talked to seem to be using the parametric method to solve this problem. I suspect that if you were to use the parametric method for these data as your analysis in a publication most reviewers would probably reject it and tell you to use either methods 2 or 3 instead. Method 2 is just as easy to use here as the parametric method but you're probably unfamiliar with it. I haven't run through all the methods yet myself, but I'm going to guess that you would get basically the same results with any of these methods.
  2. The best way to deal with part 1 is to write a function that can then be used repeatedly for the different species. Here's a possible model for the function.
    • The first line of the function subsets the data so that only the current species is represented in the data.
    • The second line then carries out the desired test on the cover values for that species.
    • The third line of the function then collects the results you need to save from that test as the components of a vector.
  3. Outside of the function create a vector that contains all the unique species codes. By sapplying your function on this species code vector you can obtain the test results for all species as the columns of a matrix.
  4. Although detailed code for carrying out what's requested in questions 3 and 4 appears in the online notes for the last lecture on this topic, you could just as easily figure it out yourself.
  5. Either one of the graphical displays used from Assignment 8 would be appropriate for part 5. (If you use the approach from Question 1 in Assignment 8 you should not include the connecting line segments.) I recommend putting the species names on the y-axis and then rotating them so that they are written horizontally. The finished plot should be done in a way that a reader can look at the graph and immediately determine which species yielded statistically significant results (unadjusted) and which did not.
  6. A nice addition to the graphical display would be to report the sample size available for each species. This could appear on the right side of the graph opposite the corresponding species name that appears on the left side. [Bonus points maybe?]
  7. [added 4/4/07] There's too much going on in this problem for you to get bogged down in the first part so I've added an extensive hint on how to obtain p-values and confidence intervals for the individual species.

Problem 3 (25 pts)

Data 

The data for this problem are the same data you used in Assignment 10. The file radon.txt contains level-1 data and cty.txt contains level-2 data for this problem.

Background 

This problem revisits Assignment 10 from a Bayesian perspective.

The Problem

  1. Refit the model in Question 5 from Assignment 10 but this time as a Bayesian model using WinBUGS.
  2. Using the posterior distributions returned from your WinBUGS run, obtain point estimates, 50% confidence (credibility) intervals, and 95% confidence (credibility) intervals for mean radon levels (not log radon levels) in homes without basements separately by county for every county used in the analysis.
  3. Display the results from question 2 as a probability smear graph in which the counties are displayed separately but all appear in the same graph and can be readily compared.

Hints

Problem 4 (10 pts)

Data 

The file bycatch.csv contains the data shown in Manly's Table 3.12. This is the same data set you analyzed in Problem 2 of the midterm.

Background 

This problem revisits Problem 2 from the midterm (also question 1 on Assignment 9). In trying to simulate data using Manly’s bycatch model, it became clear in Assignment 9 that treating season as a categorical predictor of bycatch is a bad idea because doing so produces some highly unstable and unreliable point estimates. A possible solution (and arguably a more natural approach) is to include season as a random effect rather than a fixed effect. To do so we fit a model in which the intercept of the model is allowed to vary randomly across the different seasons.

The Problem

  1. Refit the bycatch model from the midterm as a random intercepts model in which the intercept is allowed to vary by season. The basic form of the model should be identical to what you did on the midterm except for the different way in which season is being handled.
  2. Provide statistical evidence that the random intercepts model is a good compromise between Manly’s approach and a model that ignores season altogether.

Hints

  1. The lme function of the nlme package does not fit Poisson regression mixed models. For that you need a different function from a different package. Look at the last section of the notes for Lecture 36 to see what I'm talking about.
  2. Except for a new term that accounts for the random intercepts for each season, the remainder of the syntax is virtually identical to what you specified in the glm function when you fit the Poisson regression model on the midterm.

Extra Credit (10 pts)

Data 

The data are the same as in Problem 4.

The Problem

Fit the model from Problem 4 on this exam as a Bayesian model. Report Bayesian credibility intervals for all estimated parameters.

Hints

  1. The function in WinBUGS that replaces dnorm is dpois and it has a single parameter called lambda, just like in R.
  2. Don't forget that the Poisson model uses a log link. So in your WinBUGS code the regression model will be for log(y.hat[i]), not y.hat[i]. Be sure to specify it as such in your code.
  3. This problem is really just a trivial modification of the random intercepts code you used in the previous problem. If you have even a vague understanding of this WinBUGS stuff, you should find this bonus question to be a 10-point gift.

Course Home Page


Jack Weiss
Phone: (919) 962-5930
E-Mail: jack_weiss@unc.edu
Address: Curriculum in Ecology, Box 3275, University of North Carolina, Chapel Hill, 27516
Copyright © 2007
Last Revised--May 3, 2007
URL: http://www.unc.edu/courses/2007spring/enst/562/001/docs/assignments/final.htm