Assignment 10
Due Date
Friday, April 20, 2007
Data
The file radon.txt contains level-1 data and cty.txt contains level-2 data for this problem.
The Problem
In this exercise we carry out some of the analyses described in Price et al. (1996), Lin et al. (1999), and Gelman and Hill (2006). Their object was to develop a model to predict indoor radon concentration for houses in the state of Minnesota. The authors argue that indoor radon measurements are lognormally distributed so we'll be working with log-transformed radon measurements coupled with a normal error distribution.
The authors carried out a 2-level hierarchical analysis using the county in which a home is located to define the data structure. In truth these data are only weakly structured. The data were obtained as a simple random sample (actually a stratified random sample but the stratification is not related to the grouping variable), not a cluster sample, and so the presumed structure is not a result of sampling. Furthermore we don't have repeated measurements of the same observational unit. Instead the concern is that radon observations taken from the same county are likely to be correlated due to their spatial proximity and possession of a similar geology. The authors measured variables at both the county and individual home level and used a multilevel model to account for differential variation at each level.
Question 1
- Read in the data from the file radon.txt. Subset the data set so that you are only working with observations from the state of Minnesota. Use the variable state (not state2) to select these observations. The variable state assigns homes from Indian reservations to a special category and thus they will not be part of our analysis.
- The variable activity records home radon levels in picoCuries per liter. There are a few cases where a value of zero was reported for activity. The actual radon levels are probably not truly zero but are just too small to be detected by the instrument used. Such measurements are called left-censored. While there are formal ways of dealing with censored data, for this analysis replace the zero values with 0.05—a value halfway between 0 and the smallest reported nonzero measurement, 0.10. Having made this change then log-transform the radon values.
- The variable floor records the lowest living area in the home at which radon measurements were made--0 for basement and 1 for first floor. Generate a lattice graph that plots log radon levels against floor level separately for each county in Minnesota. Each panel should show both a scatter plot of the data and a line segment that connects the mean log radon levels for homes with basements to the mean log radon levels of homes without basements.
Hints:
- Convert the floor variable to a factor and plot log radon levels against this factor. This will set the tick marks on the x-axis so that only the two levels of the factor are displayed.
- The county names are too long to display completely in the panel strips even when their size is reduced. If we use the first six letters of each county name we can uniquely identify each county. Use the substr function in R to pull off the first six letters of each county name (contained in the variable county) to create a county abbreviation variable. Use this county abbreviation variable as your grouping variable in the lattice graph.
- There are a total of 85 counties in the data set after removing the Indian reservations so a sensible display is to arrange the panels in 17 columns and 5 rows. This can be accomplished with the layout argument of the xyplot function by specifying layout=c(17,5).
- Not all counties have sample observations for both values of the variable floor. This prevents us from using the panel.abline function and lm to generate a regression line as was done in class. Instead proceed as follows. Within the panel function use tapply to obtain the mean of the y-variable for different levels of the x-variable. The means will be given the names 0 and 1 by tapply (or just one of these value when both levels of floor are not present in the sample from that county). In each panel the displayed locations of the tick marks on the x-axis are at x = 1 and x = 2. Use the panel.lines function to draw the line segments. For its x-argument extract the names from the tapply object using the names function, convert the result to numeric values using the as.numeric function, and add 1 to the result to yield values of 1 and 2 instead of 0 and 1. For the y-argument of panel.lines use the means calculated by tapply.
Question 2
Fit an unconditional means multilevel model to these data using log radon level as the response and the data structured by county. Obtain and identify the variance components and calculate the intraclass correlation coefficient. Is the evidence very strong that these data are structured?
Question 3
- Add floor as a level-1 predictor to the unconditional means model and fit a random intercepts model. Provide two separate pieces of evidence that the variable floor should be retained in the model.
- Offer a statistical argument (not a philosophical one) why intercepts should be treated as random across counties rather than fixed for these data. (Hint: compare the random intercept model to a complete pooling model using an appropriate statistic.)
- Calculate an R2 that is appropriate for the random intercepts model.
Question 4
Extend the model of Question 3 to the random slopes and intercepts model. Offer statistical evidence to show that random "slopes" are not necessary in this model.
Question 5
- Read the second data set cty.txt of county data into R. Subset it so that you are only working with the data for Minnesota counties. (Use the variable st to subset the data.)
- Merge the county-level Minnesota data to the household-level Minnesota data set created in Question 1. The key fields that link up the two data sets are cntyfips in the household data set and ctfips in the county data set. (Note: There are two counties in the county data set that do not appear in the household data set.)
- Add log(Uppm), a measurement of soil uranium levels at the county level, to the model from Question 3, the random intercepts model. Argue that this group-level variable should be retained in the model.
- Calculate an R2 that is appropriate for quantifying the importance of log(Uppm) as a predictor.
Cited References
Gelman, A. and J. Hill. 2006. Data Analysis Using Regression and Multilevel/Hierarchical Models. Cambridge University Press: New York.
Lin, C., A. Gelman, P. N. Price, and D. H. Krantz. 1999. Analysis of local decisions using hierarchical modeling, applied to home radon measurement and remediation. Statistical Science 14(3): 305–337.
Price, P. N., A. V. Nero, and A. Gelman 1996. Bayesian prediction of mean indoor radon concentrations in Minnesota counties. Health Physics 71(6): 922–936.
Course Home Page