Final Exam (100 pts)
Due Date: Monday, May 7, 2007
Instructions: Submit your answers to the questions along with whatever code you used to obtain your answers.
Problem 1 (30 pts)
Data
The file lakes.txt contains the data shown in Table 1.1, pp 8–9 of Manly’s book, sulfate (SO4) concentrations over several years for a number of lakes in Norway. It is a tab-delimited text file.
Background
These data are used in various places throughout Manly's book.
The Problem
- In this first part ignore the structure of the data set and find the best model you can that relates SO4 concentration to year. List all models you considered and provide a statistical argument(s) for why your best model took the form it did.
- Produce a graph that accurately displays the structure of the data set. Your graphical display should include for each lake both a scatter plot of the data and a superimposed regression line (curve).
- Find the best model you can that relates SO4 concentration to year but this time also correctly accounts for the structure in the data set.
- Interpret the best model you found in part 3. What does the model actually say? Be as specific as possible.
- Produce suitable graph(s) to demonstrate that in addition to being structurally invalid your best model in part 1 failed to adequately account for the spatial correlation in the data set.
- Produce suitable graph(s) to demonstrate that your best model in part 3 has adequately accounted for the spatial correlation in the data set.
Note:
You may wish to do parts 5 and 6 simultaneously by placing the appropriate graphs side by side for comparison.
Hints
- There are two forms of structure to these data—a deliberate structure that is imposed by the way the data were collected and a second structure that I would call incidental. The problem sets out to account for the deliberate structure directly and then at the very end checks to see if the incidental structure has also been inadvertently accounted for. To understand what the deliberate structure here is ask yourself the following question. Looking at the data why does it seem unlikely that a random sample of lakes was obtained at each time period? Since it appears that a random sample was only taken at the beginning of the study what's the structure?
- The graph I'm referring to in part 2 is much like the one you produced for the radon data in Assignment 10.
- I'm not asking you to fit any formal models in parts 5 and 6. The empirical estimate should be enough to answer the question.
Problem 2 (35 pts)
Data
The file coverclass.txt is a space-delimited text file that contains the data for this problem. There are four variables:
- PlotCode—an identifier of the relevé from which the data were collected,
- CHSSpCode—a 4-letter code identifying the species
- Coverclass.1983—the cover class value for that species in 1983
- Coverclass.2022—the cover class value for that species in 2002
The file
scinames.txt lists the species codes again and provides the Latin binomials that correspond to the species codes. It is a tab-delimited text file.
Background
The data consist of plant species cover values taken on two occasions on each of 28 fixed transects (technically relevés) on one of the Channel Islands off the coast of California. The first occasion was in 1983 before the island became a national park and the second occasion was in 2002 after a long period of feral animal removal. There were 25 plant species that provided suitable amounts of data for analysis and the cover values of these 25 species for all those transects on which they occurred are what appear in the file. Plant cover values are ordinal measures of areal cover more or less on a log base 2 scale. Not all species were found on each transect so sample sizes vary widely among the species.
The Problem
- Treat the different transects as replicates and carry out separate statistical tests for each species to determine if its cover value has changed significantly between 1983 and 2002. In the end you should have carried out 25 tests, one for each of the 25 species.
- Report unadjusted, Bonferroni-adjusted, and sequential Bonferroni adjusted results. Carry out the adjustments so that the experiment-wise error rate is held fixed at 0.05.
- Carry out an additional adjustment so that the false discovery rate is held fixed at 0.05.
- Explain conceptually the difference between unadjusted, Bonferroni-adjusted, and false discovery rate approaches to the multiple comparison problem. In light of your explanation interpret your answers in parts 2 and 3.
- Concoct a single graphical display that illustrates the change in cover value for each species as well as a measure of reliability for this change. Ideally the picture should be commensurate with the way in which you chose to answer part 1.
Hints
- There are at least four distinct methods you could use to answer part 1: a parametric method, a nonparametric method, a randomization-based method, and a model-based method (actually many different model-based methods). I really don't care which of these you use. Most people I've talked to seem to be using the parametric method to solve this problem. I suspect that if you were to use the parametric method for these data as your analysis in a publication most reviewers would probably reject it and tell you to use either methods 2 or 3 instead. Method 2 is just as easy to use here as the parametric method but you're probably unfamiliar with it. I haven't run through all the methods yet myself, but I'm going to guess that you would get basically the same results with any of these methods.
- The best way to deal with part 1 is to write a function that can then be used repeatedly for the different species. Here's a possible model for the function.
- The first line of the function subsets the data so that only the current species is represented in the data.
- The second line then carries out the desired test on the cover values for that species.
- The third line of the function then collects the results you need to save from that test as the components of a vector.
- Outside of the function create a vector that contains all the unique species codes. By sapplying your function on this species code vector you can obtain the test results for all species as the columns of a matrix.
- Although detailed code for carrying out what's requested in questions 3 and 4 appears in the online notes for the last lecture on this topic, you could just as easily figure it out yourself.
- Either one of the graphical displays used from Assignment 8 would be appropriate for part 5. (If you use the approach from Question 1 in Assignment 8 you should not include the connecting line segments.) I recommend putting the species names on the y-axis and then rotating them so that they are written horizontally. The finished plot should be done in a way that a reader can look at the graph and immediately determine which species yielded statistically significant results (unadjusted) and which did not.
- A nice addition to the graphical display would be to report the sample size available for each species. This could appear on the right side of the graph opposite the corresponding species name that appears on the left side. [Bonus points maybe?]
- [added 4/4/07] There's too much going on in this problem for you to get bogged down in the first part so I've added an extensive hint on how to obtain p-values and confidence intervals for the individual species.
Problem 3 (25 pts)
Data
The data for this problem are the same data you used in Assignment 10. The file radon.txt contains level-1 data and cty.txt contains level-2 data for this problem.
Background
This problem revisits Assignment 10 from a Bayesian perspective.
The Problem
- Refit the model in Question 5 from Assignment 10 but this time as a Bayesian model using WinBUGS.
- Using the posterior distributions returned from your WinBUGS run, obtain point estimates, 50% confidence (credibility) intervals, and 95% confidence (credibility) intervals for mean radon levels (not log radon levels) in homes without basements separately by county for every county used in the analysis.
- Display the results from question 2 as a probability smear graph in which the counties are displayed separately but all appear in the same graph and can be readily compared.
Hints
- I recommend that you start with the common pooling model, get that to work, and then gradually complexify things. Fit the separate intercepts model next and then the random intercepts model, gradually modifying your code as you go. Only after you get the random intercepts model to work would I try to add the log(uranium) variable as a level-2 predictor.
- In these preliminary models in which log(uranium) is not used, do not include it as part of the input data vector. WinBUGS expects to use every bit of data you give it and will complain if you give it things it doesn't use.
- Be careful when you create the county variable for your model. There are lots of county indicators in this data set but most of them are screwed up because of trailing spaces and the like. The county indicator you should use is one that has only 85 unique values. Once you find one that fills the bill in this regard, turn it into a factor, and then convert the factor into a numeric variable with the as.numeric function. The county variable needed in WinBUGS is one in which the counties are numbered 1 through 85 consecutively. This is necessary because the variable county is used in the code as an index variable.
- The log uranium variable needs to be a numeric vector with 85 values, one value per county. If you follow my example for doing this in the online notes using the tapply function the object you obtain turns out to be an array, not a vector. Convert it into a vector with the as.vector function. If you don't do this WinBUGS will give you an error message when it tries to read in the data.
- Pay close attention to the log window and whatever else is printed by WinBUGS. It's your key to figuring out what's going wrong and believe me things will go wrong.
- For part 2 keep in mind that what WinBUGS returns to you is a sample from the posterior distribution of each parameter. Thus obtaining credibility intervals for parameters is incredibly easy. There are no formulas to use. Just extract the desired quantiles from the posterior distribution. Now in part 2 of this question I ask for the credibility interval for the mean radon levels for houses without basements. In the code the mean of the log(radon) concentrations of houses is given by y.hat and depending on the house it is either a single parameter or the sum of two parameters. From the code, the formula for y.hat is the following.

- The second equation gives the mean log radon levels in each county for homes without basements. WinBUGS returns a vector of samples from the posterior distributions of a[j] and b separately. If we add these two vectors together component-wise and take the .025 and .975 quantiles of this sum we obtain a 95% credibility interval for the mean log radon levels in county j. Finally exponentiate the result to obtain a 95% credibility interval for radon levels for that county.
- Repeat this process for each of the 85 counties. The a[j] appear as columns in the $sims.matrix component returned by the bugs function. So the simplest way to get all the credibility intervals at once is to write a function that is a function of j where j corresponds to a column number in $sims.matrix.
- The function should first add b to a[j] to obtain the posterior distribution of the log mean.
- Next it should obtain the desired upper and lower quantiles from
this distribution using the quantile function.
- Finally it should exponentiate the result to get the credibility intervals for the geometric mean radon levels.
- These three steps can be done in one line. Repeat this to obtain a 50% interval, 95% interval, and the median for each county.
- Finally use the smear graph code from Assignment 8 to plot the results for part 3. With 85 counties it may not be realistic to try to get county labels in the picture.
Problem 4 (10 pts)
Data
The file bycatch.csv contains the data shown in Manly's Table 3.12. This is the same data set you analyzed in Problem 2 of the midterm.
Background
This problem revisits Problem 2 from the midterm (also question 1 on Assignment 9). In trying to simulate data using Manly’s bycatch model, it became clear in Assignment 9 that treating season as a categorical predictor of bycatch is a bad idea because doing so produces some highly unstable and unreliable point estimates. A possible solution (and arguably a more natural approach) is to include season as a random effect rather than a fixed effect. To do so we fit a model in which the intercept of the model is allowed to vary randomly across the different seasons.
The Problem
- Refit the bycatch model from the midterm as a random intercepts model in which the intercept is allowed to vary by season. The basic form of the model should be identical to what you did on the midterm except for the different way in which season is being handled.
- Provide statistical evidence that the random intercepts model is a good compromise between Manly’s approach and a model that ignores season altogether.
Hints
- The lme function of the nlme package does not fit Poisson regression mixed models. For that you need a different function from a different package. Look at the last section of the notes for Lecture 36 to see what I'm talking about.
- Except for a new term that accounts for the random intercepts for each season, the remainder of the syntax is virtually identical to what you specified in the glm function when you fit the Poisson regression model on the midterm.
Extra Credit (10 pts)
Data
The data are the same as in Problem 4.
The Problem
Fit the model from Problem 4 on this exam as a Bayesian model. Report Bayesian credibility intervals for all estimated parameters.
Hints
- The function in WinBUGS that replaces dnorm is dpois and it has a single parameter called lambda, just like in R.
- Don't forget that the Poisson model uses a log link. So in your WinBUGS code the regression model will be for log(y.hat[i]), not y.hat[i]. Be sure to specify it as such in your code.
- This problem is really just a trivial modification of the random intercepts code you used in the previous problem. If you have even a vague understanding of this WinBUGS stuff, you should find this bonus question to be a 10-point gift.
Course Home Page