## Lecture 34—Monday, April 9, 2007

### What was covered?

• The multiple testing problem—an example
• Analysis of structured data
• Examples of structured data
• The challenge of structured data
• Approaches to analyzing structured data

### The multiple testing problem—An example

• I compare the various approaches we discussed last time for minimizing the problem of inflated Type I error from carrying out many statistical tests in the course of a study.
• In the example that follows there are ten p-values obtained from ten separate statistical tests carried out on the same data set. These might be, for example, ten t-tests carried out on ten different responses all of which were measured on the same experimental units. I'm trying to make the situation as unambiguous as possible so that even the harshest critics of multiple testing corrections would agree that something needs to be done.
• The p-values are entered in increasing order. The variable numtests below records how many tests were carried out. Observe that without any correction for multiple testing we would declare the results from three of the tests statistically significant at α = .05.

# The ordered p-values
> pvals<-c(.01,.013,.014,.19,.35,.5,.63,.67,.75,.81)
> numtests<-length(pvals)

• For comparison I set up the threshold criteria for the three adjustments we discussed, the Bonferroni correction, the sequential Bonferroni correction (both Holm and Hochberg), and the false discovery rate criterion (FDR). These are
• Bonferroni: all p-values are compared to .
• Sequential Bonferroni: each p-value is compared in succession to its corresponding threshold value, . Comparisons are continued until the stopping rule kicks in. The order of comparison varies in the two versions as does the stopping rule.
• False discovery rate: each p-value is compared to its threshold, . All comparisons are made; there is no stopping rule.
• Keep in mind that α means something different in these three cases. For the various Bonferroni corrections α denotes the desired experimentwise type I error. In the FDR approach, α denotes the desired false discovery rate.

#three different thresholds, family-wise alpha=.05
> bonf<-rep(.05/numtests,numtests)
> holm<-.05/seq(numtests,1,-1)
> fdr<-(1:numtests)*.05/numtests
> cbind(pvals,bonf,holm,fdr)

pvals  bonf        holm   fdr
[1,] 0.010 0.005 0.005000000 0.005
[2,] 0.013 0.005 0.005555556 0.010
[3,] 0.014 0.005 0.006250000 0.015
[4,] 0.190 0.005 0.007142857 0.020
[5,] 0.350 0.005 0.008333333 0.025
[6,] 0.500 0.005 0.010000000 0.030
[7,] 0.630 0.005 0.012500000 0.035
[8,] 0.670 0.005 0.016666667 0.040
[9,] 0.750 0.005 0.025000000 0.045
[10,] 0.810 0.005 0.050000000 0.050

• I next compare each p-value against its threshold and explain the results below. A value of 0 means the tested condition is FALSE, while a value of 1 means it's TRUE.

#carry out tests
> b.test<- pvals<bonf
> holm.t<- pvals<holm
> fdr.t<- pvals<fdr
> cbind(pvals,b.test,holm.t,fdr.t)

pvals b.test holm.t fdr.t
[1,] 0.010      0      0     0
[2,] 0.013      0      0     0
[3,] 0.014      0      0     1
[4,] 0.190      0      0     0
[5,] 0.350      0      0     0
[6,] 0.500      0      0     0
[7,] 0.630      0      0     0
[8,] 0.670      0      0     0
[9,] 0.750      0      0     0
[10,] 0.810      0      0     0

• For the straight Bonferroni test we compare each p-value against the same threshold. As shown in the above output, none of the p-values are smaller than this threshold. Therefore we conclude that none of our tests is statistically significant at α = .05 (experimentwise).
• The sequential Bonferroni test has two flavors.
• For the Holm procedure we start with the smallest p-value and compare this value against its threshold. We stop our comparisons when the condition first tests FALSE and declare all previous TRUE outcomes as corresponding to statistically significant results. From the column labeled holm.t we see that the condition never tests TRUE. Hence none of the test results are statistically significant.
• For the Hochberg procedure we start with the largest p-value, perform each test in succession stopping when we first obtain a significant result (the condition first evaluates to TRUE). We then declare the test corresponding to that p-value statistically significant and all smaller p-values statistically significant (even though we don't test them). The same column labeled holm.t can be used for this. Notice that we never obtain a TRUE result. Hence none of the tests are statistically significant.
• For the False Discovery Rate approach we carry out all ten comparisons. We record the location in the list of the largest p-value to yield a TRUE condition (the 3rd p-value in the output above). We then declare this p-value and all smaller p-values (regardless of what their test showed) as statistically significant. So in our example we would declare the first three p-values to be statistically significant. By controlling the false discovery rate at 5% we are guaranteed that on average roughly 5% of our significant results are false. Since we have three significant results and 5% of 3 is a number much smaller than 1, we feel pretty confident that all three of our tests are statistically significant.

### Data sets with structure

• Examples of structured data sets
1. Structure can arise due to sampling constraints. In a cluster sample observations occur in groups and we select the groups rather than the individual observations. We may collect information on both the individuals and the clusters. For example in the midterm question households were obtained by sampling bomas. The midterm question only looked at variables measured at the household level, family size and landholdings, but other information in fact was collected at the boma level too.
2. Repeated measures data (also called longitudinal data). Here we have a random sample of individuals, sites, etc. that are then followed over time. Typically measurements are made on these units at regular time intervals. Typically there will be time invariant measurements also too measured on the individual or site. For instance, in a weight loss study we might measure blood pressure and the weight of individuals at each time period, but in addition we would probably have individual-level measurements such as gender, diet program, etc. that do not change during the course of the study. The goal here might be to relate blood pressure to body weight while controlling for other factors.
3. Structure can also arise by accident, not by design. Natural groupings may exist that were not part of the original design. Often such structure arises because the observations have a spatial dimension so that observational units that are close spatially tend to be more similar. For instance we may have radon measurements in houses scattered throughout a state. Measurements were taken at the lowest room level--either the basement or the first floor depending on the particulars of the house. In addition we might have geological information at the county level, for instance the average uranium content of soil. The goal is to understand and predict radon levels in unmeasured homes controlling for features such as geology.
• All of these are examples of multilevel data, more specifically 2-level data. In such data there is a natural hierarchy with one level nested in another. Typically measurements are made at both levels.
• The level-1 units are the units at the lowest level. In our examples these are households, measurement times, and homes respectively. Level-1 variables in our examples are family size, weight, and location of lowest room respectively.
• The level-2 units in our examples contain the level-1 units. These are bomas, people, and counties respectively. In the latter two examples there were variables measured exclusively at level 2, gender and weight loss program in the repeated measures study, uranium levels by county in the radon study.
• It is possible to have additional levels beyond two. Also it is not required that the levels be nested. In this course we will focus exclusively on nested, 2-level data sets.

### The relevance of data structure to statistical analysis

• What all these structured data sets have in common is observational heterogeneity. Using simpler language, the structure we've outlined causes different sets of observations to exhibit differing degrees of variability. We would expect level-1 observations occurring in the same level-2 unit to exhibit a different degree of variability than level-1 observations occurring in different level-2 units.
• For instance in a repeated measures study observations made on the same person would be expected to be more similar than observations coming from different people.
• In the radon study households from the same county may share a similar underlying geology and hence exhibit more similar radon levels.
• As the examples illustrate another way of viewing observational heterogeneity is in terms of correlation. The presence of observational heterogeneity means that some sets of observations are more or less correlated than others.
• Recall that all of the basic statistical procedures we've studied assume that observations in our sample are independent of each other. This was basic for instance to how we constructed the likelihood. The likelihood of the data is the joint probability of the obtaining the data we got. Our first step in writing down the likelihood was to re-express the joint probability of the data as a product of individual probabilities. In order to carry out this step we had to assume that the observations were independent.
• Thus the presence of observational heterogeneity (correlation) invalidates most of what we've done so far in this course.

### How should structured data be analyzed?

1. We could ignore the structure and proceed as if the observations were homogeneous and independent. Thus if we have n units (level-1) from each of m different clusters (level-2 units), we would treat this as a single random sample of size mn. Such an approach is almost always invalid. A simple example will illustrate the problem.
• Suppose we wish to study the chlorophyll content of oak leaves. For various reasons we decide we need a sample of 50 oak leaves for the analysis. Here are two possible sample designs we might employ.
• Take a sample of 50 leaves from a single oak tree, or
• Take a sample of 1 leaf from 50 different oak trees.
• Clearly if our goal is to say something about oak leaves in nature the second design would seem preferable. If on the other hand we cared a lot about a specific oak tree, then the first design might make sense. Now if it turns out that this single tree is highly representative of oak trees in general then we might also be able to say something about oak trees in general with the first design, but doing so requires additional assumptions. The point is that a sample size of 50 means something different in these two cases, and we'd be hesitant to treat the second design as being equivalent to the first.
• In truth the ideal design might actually be somewhere between these two extremes. If we felt that a single leaf might not be representative of a tree we might choose to take a sample of trees and then a sample of leaves from each tree. This would yield a structured hierarchical data set akin to a cluster sample.
• The problem of ignoring data structure is so rampant in the literature that a term was coined to describe this mistake and its consequences: pseudo-replication. To treat correlated data as if it were independent, structured data as if it had no structure, and heterogeneous data as if it were homogeneous is to be guilty of pseudo-replication.
2. We could aggregate everything to the highest level, level 2. In this approach we would average all the level-1 variables so that everything becomes a level-2 variable. This is called complete pooling. In our oak leaf example this would amount to averaging the chlorophyll content of leaves coming from the same tree and then treating the sample of ten trees as being the real sample.
• One problem with this approach is that it treats the level-1 units as being perfectly correlated. By reducing everything to level 2, we are throwing away information contained in the variability present at level 1.
• Complete pooling is a valid approach but it can lead to spurious conclusions if the goal is to understand causation. First by averaging away one of the levels it becomes impossible to detect cross-level interactions. In particular, individual effects occurring at level 1 may not translate into measurable effects when averaged at level 2. This can clearly happen when extremes at level-1 have opposing effects on the response.
3. At the opposite extreme from complete pooling is the unpooled approach. This is essentially equivalent to fitting models at level 1 separately for each of the level 2 units. This can be done by actually fitting separate models or by adding indicator categorical variable to the model that reference the identity of the level-2 units.
• If we take the second approach of adding categorical indicators of the level-2 unit to a single model then it is not possible to also include variables measured at level 2. One variable is a multiple of the other.
• The second approach can also become unwieldy if the number of level-2 units is large (producing lots and lots of coefficient estimates) and if some of the level-2 units do not have enough data to support estimating a separate level-1 model there can be estimation problems.
• If we include the level-2 variables in a single model without indicators then we have to hope that most of the variability in the level 2 units is accounted for by the level-2 predictors that we happened to have measured.
4. A fourth approach is the approach we will take in this course. It is essentially a hybrid of the complete pooling approach with the unpooled approach. It is variously called a multilevel model, a random effects model, a random coefficients model, a mixed model, a mixed effects model, or a hierarchical model. We'll go into the details next time.