Lecture 40—Wednesday, March 29, 2006
What was covered?
- The common pooling and unpooled regression models for structured data
- The random slopes and intercepts multilevel model
Terminology defined
Some inelegant approaches to analyzing structured data
- We begin a discussion of the mixed effect (multilevel) model approach to analyzing hierarchically structured data. To make things concrete I will specifically focus on one of the structured data examples we discussed in lecture 38. In example 2 we considered the problem of estimating a common regression model using data obtained from many different published studies, a problem generally referred to as meta-analysis.
- Recall the basic scenario. Data were obtained on the pelagic larval duration times (PLD) of 74 marine invertebrate and vertebrate species at various temperatures (T). Thus each study provided observations of the form (T, PLD) with different numbers of ordered pairs coming from different studies (species). For this exposition I will assume that a linear relationship of the form

is a reasonable one. The question of interest is whether the same values of β0 and β1 work for all species or whether different species require different values for the intercept and/or slope. (Note: In truth the real focus is on β1, but I'll consider the more general problem of both intercepts and slopes here.) Let's consider some possible (but flawed) approaches to this problem.
Common Pooling Model
- In the common pooling model we combine all the data and, ignoring the species label for our observations, fit a single ordinary regression model to all the observations simultaneously. Fig. 1 shows the result for these data.
- We could then evaluate the fit of the model by examining residual plots and calculating useful summary statistics such as R2. If we're satisfied with the diagnostics and fit can we reasonably conclude then that a single regression model holds for all species?
- First, as was noted in lecture 38, the common pooling model starts by making a very big assumption, namely that the obvious structure in the data does not matter. Ordinary regression assumes that the individual observations used in estimating the model are independent. Given that the observations came to us in groups (species) which in turn came from separate published studies it is unlikely the data are independent. In any case it would be a far better strategy to demonstrate empirically that independence is a reasonable assumption rather than merely assuming it without evidence.
- Second, there are no true models. ("All models are wrong, but some are useful", George Box.) Instead of fitting a single model and claiming it represents truth, we should be comparing this model to other models and demonstrate that this model is the best of those we've considered. If our model list is exhaustive enough, and if in particular it includes models that are consistent with competing and conflicting hypotheses, then we are in a better position to assert that our research hypothesis is true.
- An obvious model to consider as an alternative to the common pooling model is one that allows each species to have its own PLD versus temperature relationship. We consider that model next.
Separate Regressions (Unpooled) Model
- In a separate regressions (unpooled) model, we fit regression models separately to the data for each species. Thus we fit models of the following form.

where the first subscript denotes the species number and the second subscript the individual observation for that species. Fig. 2 displays the individual regression lines (respecting the range of the data) for the 74 different species in the data set.
- In order to compare the common pooling model and the separate regressions model we can use an information-theoretic measure such as AIC. To do this efficiently we can reformulate the separate regressions model so that it is written as a single equation as follows.
- Create 73 dummy variables of the following form.

Here i ranges from 2 to 74 and j ranges from 1 to ni , where ni is the number of observations available for species i.
- Using these dummy variables we can write the 74 separate regressions as a single equation as follows.

Recall that
for all j. So for instance species 1 has intercept β0 and slope β1, species 2 has intercept (β0 + α2) and slope (β1 + γ2), etc. In the usual fashion, the α and γ terms represent effects relative to the baseline species, species 1.
- Observe that the separate regressions model estimates 149 parameters (74 intercepts + 74 slopes + 1 variance)! Because the models are nested we can compare them with a likelihood ratio test (better yet given that these are normal models, with a partial F test). We can also use AIC to compare models in the usual fashion:
- AIC(common pooling model) = –2 logLik(common pooling model) + 2(3)
- AIC(separate regressions model) = –2 logLik(separate regressions model) + 2(149)
So if the separate regressions model increases the loglikelihood by an amount that is sufficient to compensate for the large number of estimated parameters, we should prefer it over the common pooling model.
Why these approaches are flawed and where do we go next?
- Both of these models are highly flawed. The separate regressions model is guilty of some serious overfitting with 149 estimated parameters. Furthermore any group with only two observations is fit perfectly by this model. The common pooling model on the other hand ignores the data structure and treats correlated observations as if they're independent
- A hybrid approach is in order. The separate regressions model is attractive in that it gets the data structure right. It correctly identifies the proper units of analysis as the individual species, not the individual observations on that species. It avoids the correlation problem by essentially returning a "score" for that species—a vector consisting of the intercept and slope.
- If our goal is to estimate a common slope for all the species, the separate regressions model offers a solution. The slopes returned for individual species are independent. We could average these slopes in some fashion to obtain an estimate of the common slope. Of course a simple average is not adequate because different amounts of data went into the estimation of each slope. We should trust a slope based on 8 observations more than one based on just two. Furthermore regression diagnostics such as standard errors and coefficients of determination give us information as to how good the individual lines are. We should make use of this information when we try to estimate a common slope.
- In the early days of multilevel modeling, the approach just described was actually carried out and was known as "slopes as outcomes" analysis. It recognized that hierarchical data consist of levels and did a separate analysis at each level—first returning individual slopes and intercepts at the individual observation level and then combining these in some fashion at the species level. The problem with it is that it is extremely ad hoc and it doesn't allow us to simultaneously work with the individual slopes and intercepts (which are correlated) to find a common line. The preferred approach today is multilevel modeling and is described next.
The multilevel model approach to hierarchical data
- The multilevel model approach to hierarchical data is a hybrid of the separate regressions model with the common pooling model. In this hybrid approach each model occupies a separate level of the analysis but are linked together so that both models are estimated simultaneously.
- The level-1 model corresponds to the separate regressions model and takes the following form.

where
where ni is the number of observations for species i and N is the total number of species. For this example the level-1 model might also be called the individual (temperature) measurements level model.
- Observe the subscript i that appears on β0 and β1. This indicates that each species (indexed by i) is allowed to have its own intercept and slope.
- The term
is the usual error term in ordinary linear regression and represents the random variation of individual observations about the regression line. The
at different temperatures and for different species are assumed independent.
- The level-2 model is a model for the individual slopes and intercepts that appear in the level-1 equation. It essentially forms a bridge between the common pooling model and the separate regressions model. For this example the level-2 model might also be called the species-level model. It takes the following form.

- The terms β0 and β1 are the values of the intercept and slope in the population and roughly correspond to the intercept and slope in the common pooling model. The terms
and
are the deviations of the intercept and slope for species i about these populations values. Thus they are error terms comparable to
in the level-1 equation. To distinguish the errors that appear at the different levels, the level-2 errors are also called random effects. The
are referred to as the intercept random effects and the
are referred to as the slope random effects.
- How do the multilevel model estimates of the individual slopes and intercepts differ from those obtained from the separate regressions model?
- The random effects,
and
, represent deviations of the individual slopes and intercepts,
and
, from the population values, β0 and β1, respectively.
- The values of the random effects are constrained in that they are assumed to arise from a common bivariate normal distribution with mean 0 and 2 × 2 variance-covariance matrix Σ (which is estimated from the data) as shown in the level-2 equation above. Thus unlike the separate regressions model, the slopes and intercepts of the individual regression lines in the multilevel model are not free to be anything. This is because there is a common population linear model about which the individual regression lines vary subject to constraints imposed by the covariance matrix of the bivariate normal probability model.
- The random effects from different species are independent of each other. Thus we have
,
,
and 

- If we drop just
from the level-2 equations then we have a random intercepts model. In this model all species have a regression line with the same slope (the slope of the population model) but each line is allowed to have a separate intercept. The random intercepts model is often a viable alternative to the random slopes and intercepts model. In fact, as we'll see later, there is evidence for these data that the random intercepts model is the more parsimonious choice.

- If we drop just
from the level-2 equations then we have a random slopes model. All the individual regression lines have a common intercept but different slopes. This model rarely makes sense.

Course Home Page