gllamm for complex problems

General notion

gllamm stands for Generalized Linear Latent And Mixed Models. The primary aims of this software is to provide a maximum likelihood framework for models with unobserved components, such as multilevel models, certain latent variable models, panel data models, or models with common factors.

The official website of the software is http://www.gllamm.org maintained by one of the authors, Sophia Rabe-Hesketh. She and other authors of gllamm, Anders Skrondal and Andrew Pickles, published a book "Generalized Latent Variable Modelling".

gllamm is implemented as a system of Stata modules. To install it on your Stata, type

net from http://www.gllamm.org
net install gllamm, replace

The website also details the current version of the software that can be checked with which Stata command.

Statistical background

Let us consider each of the letters of the word gllamm in turn.

The first letter g means the same as "g" in GLM, generalized linear models. They extend the linear regression concept to some other models such as binary dependent variable, multinomial or ordinal dependent variable, and Poisson regressions. This is done by noting that all of them share a similar form of the density of the response variable:

nice picture for exponential family here

There are many reasons why this is a nice form of the density, including existence of the sufficient statistic, straightforward maximum likelihood estimation, means and variances expressed as functions of theta and phi, and fit / outlier residual diagnostics. The important concepts that the user of gllamm must know to proceed further are link function and family of distributions. The link function is the function that maps the mean μ of the response to the linear predictor η=x'β: η = g(μ), and the inverse link is μ = g-1(η). It can be as simple as an identity link used in linear regression: η=μ, or something more involved as logit link:

nice picture for logit link

The family of distributions deals with the response variable, so it is Gaussian distribution for a linear regression / continuous response, binomial distribution for binary response (note that in this case a variety of links is possible: logit, probit, complementary log-log / Gompertz / extreme value), and Poisson for count variables (gllamm currently does not support the negative binomial link, which is another popular option for count outcomes). gllamm also supports mulinomial logit and ordered logit / probit distributions / links, which are not normally viewed as a part of GLM framework, but share some similarities with them.

For further reading, see McCullagh and Nelder (1989) as the basic classic textbook on the topic, and Hardin and Hilbe (2001) for Stata implementation.

The second letter l is responsible for linearity that is being assumed between the mean and the explanatory variables. As was mentioned earlier, the linearity of the model enters through the linear predictor η=x'β. That is not as restrictive as it may seem, since nothing precludes the researcher from using semiparametric specifications such as polynomial terms or splines. (For the multinomial model, the linear predictor would be the mean utility associated with a particular outcome; it will also give the predicted probability of that outcome through the link function.)

The middle lam part of gllamm stands for Latent And Mixed. The terminology of latent variables is more often used in social sciences for models with common factors, measurement models, and SEMs, while mixed models is a notion of multilevel/hierarchical models common in biostatistics. Any of those approaches implies that there are some unobserved variables that enter the linear predictor part additively. In the simplest case of a single random effect, this can be represented as

ηij = xij + ui
where i indexes groups, or clusters, of observations, and j indexes the observations within that cluster. The multilevel/hierarchical and mixed models can be written down as
ηij = xij + zij'ui
which may have a variety of meanings: gllamm is flexible enough to estimate all those (and some other) models; for the measurement and factor models, some special data handling tricks may have to be used --- see below.

Unfortunately, as long as the u's are unobserved, their contribution to the likelihood is represented as an integral over their distribution. The integrand terms are the conditional probabilities / densities / likelihoods from the GLM model for a given u. Hence each evaluation of the likelihood involves a lot of numerical integration steps, and during the iterative maximization, there will be hundreds of such calls. This is the main reason gllamm is so slow.

The numeric integration is performed by Gaussian quadrature, adaptive quadrature (Rabe-Hesketh, Skrondal and Pickles 2002), or semi-parametric discrete distribution similar (up to certain scaling conventions) to the discrete factor models in econometrics. A number of options is available to tune the performance and speed of gllamm.

Data preparation, syntax and options

The most basic type of a model gllamm is especially easy to apply is the repeated measurement / multiple observations / panel format: several observations per same individual, or several individuals treated as a cluster. Note that gllamm is indifferent to whether the data set is balanced on not. This is pretty much irrelevant in estimation it performs.

Multilevel models

Suppose the data set consists of measurements of standardized test scores score of students. The students are nested in classrooms class, the classrooms are in turn nested in schools school, and schools are nested in districts distr. Suppose also that the student characteristics available to the researcher are race black and other, and a measure of socio-economic status ses. The two simplistic models a researcher can run are
gllamm score , i(class school distr)
for the basic ANOVA model with nested random factors, or "unconditional hierarchical linear model" in Raudenbush and Bryk (2002) terminology. The output will show the variances of effects at each level (so that the researcher can see whether classrooms within schools are indeed different from one another, the schools within a district are similar to one another, and districts themselves are similar to one another); or a regression model with mutlilevel random effects
gllamm score black other ses, i(class school distr)
We have omitted the default options for family and link, which are Gaussian response and identity link, respectively. If the only data that we had were whether the student passed or failed the test pass, then the call would look like
gllamm pass black other ses, i(class school distr) link(logit) family(bin)

Random coefficients

Let us now ignore for simplicity all the levels except the students and schools. A simple enough 2-level model that is easy to estimate with gllamm is the random coefficient model that relaxes the assumption that the effect of race and socio-economic status are the same for all schools. They can be allowed to vary, and the way to indicate that is through the level-2 equations (i.e. equations for the second level units, here, schools; the students are level-1 units). In gllamm framework, one can specify those with eq command that was deemed obsolete as of Stata 6:
gen byte one = 1
eq cons : one
eq black : black
eq other : other
eq ses : ses
gllamm score black other ses, i(school) nrf(4) eqs(black other ses cons)
The added options are: nrf() that says we now have four random effects in the model; and eqs() shows what are the variables a particular random effect is multiplying. The first random effect multiplies the variable black, and thus it is the random component of the black coefficient that is shared by all students in the same school. The last random effect is a random intercept. Note that is now the responsibility of the researcher to indicate that intercepts can be random. Once eqs is specified, gllamm no longer thinks of the model in terms of random intercepts (and there are good reasons to behave like that).

Note that the variances and covariances of the random effects are not so straightforward to interpret as they were in the random intercept only models. It is easy to see (see Fig. 3.1 of Skrondal and Rabe-Hesketh (2004)) that they would depend on such rather arbitrary things as the centering of explanatory variables.

Now, what would be the ways to incorporate school-level characteristics into analysis? Suppose we have a measure like the number of computers per student comp. A straightforward way would be to include this variable directly into the linear predictor part, saying that a given number of computers affects the score directly:
gllamm score black other ses comp, i(school)
On the other hand, as long as it is a school-level characteristic, it may be affecting only the school-level variables. In the current model, such variable is the random effect of the school. The interpretation that can be suggested is that the number of computers per student is a proxy for overall quality of the school.
eq f1: comp one
gllamm score black other ses, i(school) geqs(f1) nocons
The option geqs shows the "generalized" equations, i.e., the regressions for latent variables. The interpretation of equation f1 is now different; comp and one are explanatory variables in that regression. Here we only have one random variable, the random effect of the school, and the second symbol in the equation name shows that it refers to the 1st latent variable in the model. The option nocons lets gllamm know that the constant terms should not be used in the regression: it enters the model through the constant term of the equation f1 (so an alternative way to identify the constant is to omit one from equation f1).

We can also think of a two-level model where the random coefficients may depend on the number of computers:
eq f1: comp
eq f2: comp
eq f3: comp
eq f4: comp
gllamm score black other ses, i(school) ///
  nrf(4) geqs(f1 f2 f3 f4) eqs(black other ses one)
Note again that there are two ways to deal with the fixed effects, and here we have chosen to specify them through the regression equation. If we did otherwise (including one into all of the four "generalized" equations), we would have to specify nocons option, as well as omit the variables from the regression equation:
eq f1: comp one
eq f2: comp one
eq f3: comp one
eq f4: comp one
gllamm score , i(school) nocons ///
  nrf(4) geqs(f1 f2 f3 f4) eqs(black other ses one)

Multiple equations

An important thing that is to be remembered about gllamm is that it assumes only one response variable. This works perfectly for the hierarchical linear models, but other types of models where there are several response variables require certain modifications of the data so that the model looks like a single dependent variable, however complicated that model would be. This is typically done with Stata's reshape command, or some manual work that is roughly equivalent to it. A part of that work is to create a variable that identifies equations; dummy variables corresponding to it; and interactions of actual regressors with those dummies.

Suppose the model consists of three equations:

y1 = β01 + β11x1 + β21x2 + λ1ui + &epsilon1;
y2 = β02 + β12x1 + β13x3 + λ2ui + &epsilon2;
y3 ~ Bernoulli ( Φ( β03 + β14x4) )
Then the appropriate steps in Stata may look as follows:
gen long id = _n
expand 3
bysort id : gen v = _n
g resp = .
replace resp = y1 if v==1
replace resp = y2 if v==2
replace resp = y3 if v==3
tab v , gen(s)
foreach x of varlist x1-x4 {
   forvalues i=1/3 {
     g `x'_`i' = s`i'*`x'
   }
}
Let us now incorporate the random structure into our gllamm model. The following would be a good starting point for your gllamm adventure:
eq het : s1 s2
eq load: s1 s2 s3
gllamm resp x1_1 x2_1 x1_2 x3_2 x4_3 s*, ///
   link(id id probit) lv(v) fam(gauss gauss bin) fv(v) ///
   s(het) i(id) eq(load) nocons
The first part specifies the regression part of the model, i.e. the correspondence between the response and the explanatory variable. lv() and fv() options indicate to gllamm that different observations may have different families and links depending on the value of the variable v. So, when v==1, the link is identity, and the family is gaussian. Note that at the same time resp is effectively regressed on x1_1, x2_1 and s1 which plays a role of the constant. All other explanatory variables are zero when v1==1. Also note that the variance (scale) of the response is modelled by equation het. As long as s* are dummy variables, this equation prescribes that there are three response variables in the model that are allowed to have three different variances. (Actually, as long as the scale of the probit equation is set to 1, there is no need to specify s3 in the het equation.) Three different combinations of s1, s2 and s3 correspond to three values of v. Note also that het equation does not contain a constant because each equation has its own separate constant s*.

Integration options

There are three integration methods provided: Gaussian quadrature, adaptive quadrature, and discrete distribution with freely estimated point masses. The degree of accuracy is controlled by the number of integration points, which is specified as nip(#, #, ...). Each number corresponds to the particular latent variable / random effect. It should be kept in mind that the computation time is roughly proportional to the product of those numbers of integration points, so highly multidimensional problems should be first tackled with a very low number of points, say 4 for each factor. If convergence can be achieved with that, the coefficient estimates should be saved, and gllamm can be restarted with more integration points:
gllamm ... , nrf(4) nip(4, 4, 4, 4) adapt
mat bb = e(b)
gllamm ... , nrf(4) nip(8, 8, 8, 8) adapt from(bb) copy
It almost always seems to be a good idea to specify adaptive integration rather than straightforward Gaussian integration. Even though some of the computational time is spent on finding the optimal reconfiguration of the integration points, the integrals come out to be more accurate.

The freely estimated point masses can be requested by ip(f). It is not quite the same as discrete factor models popular in econometrics. gllamm ..., ip(f) allocates masses in such a way that their mean is zero, and their variance is free, whilst the discrete factor models restrict the lowest value to be put at zero, and the largest value, at 1. Thus the two methods would have different means and variances which translates directly into the scales of the coefficient estimates.

High performance computing

As far as gllamm is so time consuming, the researcher needs to make sure that he uses the best computational resources available. Fortunately, UNC has a strong support for statistical and scientific computation. The three most powerful machines are StatApps Solaris cluster (24 SUN 1.05GHz processors, 48 GB of RAM), Baobab Linux cluster (352 processors , at least 2 Gb RAM per each processor), Yatta (IBM P690 with 32 CPUs and 128GB of Memory) and Chastity IRIX cluster (64 processors MIPS R14000 Processors @ 500MHz, 64 Gb of RAM). See High Performance Computing web-page for a description of the procedures required to obtain an account, upload your data and submit your job for processing.

Examples

Caution: The examples here are not meant to represent any reasonable analysis in terms of the substantive area, but rather the examples of syntax of gllamm that were verified to run without software errors.

Multilevel models with educational data

The analysis follows examples in Raudenbush and Bryk (2002) textbook. The data set contains math achievement results for 7185 students nested in 160 schools, along with their SES, and school-averaged SES.
use http://www.ats.ucla.edu/stat/paperexamples/singer/hsb12.dta

The first analysis is an ANOVA aimed at finding out whether there are differences between students within a school, and between schools.
mat b1 = [12, 2, 8]
gllamm math , i(school) from(b1) copy

The average score is 12.84, with a substantial variation within a school (s.d.=6.27) and between schools (s.d.=2.62). Can this variation be explained in any manner?

The first argument can be put forward is that the performance difference is due to the fact that quality of schools differs because they serve different neighborhoods. The latter can be measured by the average socio-economics status, and the way it is introduced into the model at the school level is through the school means (means-as-outcomes model):
g byte one = 1
eq f1 : one meanses
mat b2 = [2, 6, 12, 6]
constr 77 [mathach]_cons=0
gllamm math , i(school) geqs(f1)  diff adapt constr(77) from(b2) copy
The peculiar constraint given to gllamm serves the identification purpose. Without it, gllamm is going to estimate both the grand mean and the constant in the regression for the latent variable, which in this case is the school mean of the test.

The estimation results:

Notice that the individual student variance remained essentially the same, but the variance between schools has decreased dramatically, so that the R2 in the regression for the latent variable can be found as 1-2.593/6.841=0.62. The mean is a little bit different at 12.64, and the slope for the SES in the level-2 regression is 5.86. The formal tests of the difference between this model and means only model is either through a t-test on the regression slope, or the likelihood ratio test from two models.

Alternatively, one can say that the difference in performance can be explained by the student level characteristics. The students of different socio-demographic characteristics enjoy different returns to their characteristics in different schools; in other words, there are between-school differences in the slopes of individual student characteristics. The way the SES is recorded in this data set is in deviations from the school mean, so with cses variable, only students within the same school can be compared.
* Random coefficients model
eq cses: cses
eq one: one
gllamm math cses, i(school) eqs(one cses)  diff adapt nrf(2)

Due to the way the data are centered, the constant in the regression is the grand mean that we know to be about 12.64. The individual SES is on average strongly associated with higher math grades. The cses measurement are roughly normalized on 1, and the standard deviation within schools is usually between 0.6 and 0.7. The variability in this coefficient between schools is given by the variance of the second latent variable (note the reverse order of coefficients in the level-1 and level-2 regressions... that was an omission on the author's side!). Thus the confidence interval for the cses in a school from this population of schools can be constructed as 2.193 ± 1.96(0.6782+0.1279^2)1/2 = 2.193 ± 1.633 = (0.56, 3.826). Even in the most equitable schools, there still would be some differences between high and low SES students (about 2 point on math score scale).

In general, the variances and covariances of the random effects should be interpreted with a great caution because of lack of invariance to the change of the scale. In this case, due to centering of cses, the entries of the variance-covariance matrix are informative of the following. The variance of slopes is given by var(2), and we have already used it in the above calculations. The variance of intercepts conditional on zero value of cses is given by var(1), and in fact it is greater than what was observed before. The correlation between the random intercepts and the random slopes is insignificantly different from zero. If it were significantly positive, for instance, then we could say that the schools that are better in terms of the average score also tend to have higher returns to individual SES (which would probably be an undesirable result for educational policy: better schools seem to be better because they are providing a better training to their top students only).

The computational time for this whole example was about 5 hours on Baobab cluster.

Selectivity model: treatreg by gllamm

The analysis is based on the part of NLSY data available in Stata through
use http://www.stata-press.com/data/r8/nlswork.dta
This is a panel data set on 4148 individuals observed in 15 time periods between 1968 and 1988. The main preparation steps include
keep if year == 88
xi i.race
The first analysis is a straightforward use of Stata's existing tools. It took Stata 10 seconds to find the MLE.
Note that the link between the two equations (lambda) is not statistically significant, and the selectivity variable (collgrad) is not significant in the main regression, either. We'll see how it may pose difficulties for gllamm.

Preparing the data for gllamm:
* preparing for gllamm
mat bb = e(b)
* from treatreg estimation
expand 2
bysort id year : g byte var = _n
g resp = ln_wage if var == 1
replace resp = coll if var == 2
foreach x of varlist  age grade south _Irace_2 _Irace_3 tenure coll {
  g v1_`x' = `x'*(var == 1)
  g v2_`x' = `x'*(var == 2)
}
g v1 = (var == 1)
g v2 = (var == 2)
Call to gllamm:
gllamm resp v1_tenure v1__Irace_2 v1__Irace_3  v1_grade v1_coll v1 ///
    v2__Irace_2 v2__Irace_3 v2_south v2 , /// 
    fv(var) lv(var) link(id probit) fam(gaus bin) nip(4) adapt ///
    i(id) from(bb) copy nocons difficult

The main regression estimates are identical, but selection equation is a little bit different. Note that the variance of the random effect is insignificantly different from zero, which is an alternative indication of the independence of two equations. This time, MLE took a bit over 10 minutes (606 seconds).

On a second run, we shall use more integration points, and use the current estimates as the starting values:
mat bb1 = e(b)
gllamm resp v1_tenure v1__Irace_2 v1__Irace_3  v1_grade v1_coll v1 ///
    v2__Irace_2 v2__Irace_3 v2_south v2 , /// 
    fv(var) lv(var) link(id probit) fam(gaus bin) nip(8) copy ///
    adapt i(id) from(bb1) nocons difficult

The estimates have hardly changed. It took Stata and gllamm 384 sec. to figure it out.

The final model in this example is the discrete factor model. It was tried with a number of settings, and the best performing one was found to be the model with just two mass points.
gllamm resp v1_tenure v1__Irace_2 v1__Irace_3  v1_grade v1_coll v1  ///
   v2__Irace_2 v2__Irace_3 v2_south v2 ,  ///
   fv(var) lv(var) link(id probit) fam(gaus bin) ip(f) nip(2) ///
   i(id) nocons difficult 

Computation time was 230 seconds. Note that of the two mass points specified, the bulk of the probability mass is given to only one of them, which again is evidence that the two equations (selection and regression) are not particularly strongly related. Sometimes it is argued that the discrete factor model can serve as a source of identification when regressors are the same in both equations, but that's a very weak form of identification that cannot be relied upon exclusively.

Note also that the model is parameterized differently from the one in treatreg. The variance identification condition imposed by treatreg is that the overall variance of the error term in the selection equation is equal to one, while gllamm has to assume that the variance of the unique part of that error term is equal to one, and hence the total variance is but 1 + Var(u).

Sensitivity analysis for the number and location of point masses can be performed with gateaux derivative option.

References

Hardin, J., and Hilbe, J. (2001) Generalized Linear Models and Extensions. Stata Press, College Station, TX.

McCullagh, P., and Nelder, J. A. (1989) Generalized Linear Models. 2nd ed. Chapman and Hall, London.

Skrondal, A., Rabe-Hesketh, S., and Pickles, A. (2002). Reliable estimation of generalized linear mixed models using adaptive quadrature. Stata Journal, 2 (1), 1--21.

Rabe-Hesketh, S., and Skrondal, A. (2004) Generalized Latent Variable Modeling. CRC/Chapman and Hall.

Rabe-Heskteh, S., Skrondal, A., and Pickles, A. (2004) GLLAMM manual. U.C.Berkeley Division of Biostatistics Working Paper No. 160. Raudenbush, S., and Bryk, A. (2002) Hierarchical Linear Models, 2nd edition. Sage.

Questions? Comments?

Send an email to Stas Kolenikov, skolenik at unc.edu.