## Soci709 (former 209) Module 12 - HETEROSCEDASTICITY

(Also spelled heteroskedasticity.)
Resources
ALSM5e pp. 116--119, 421--431; ALSM4e pp. 112--115, 400--409.
STATA reference manual [R] regression diagnostics, [R] regress

### 1.  NATURE OF HETEROSCEDASTICITY

Heteroscedasticity refers to unequal variances of the error ei for different observations.  It may be visually revealed by a "funnel shape" in the plot of the residuals ei against the estimates ^Yi or against one of the independent variables Xk.  Effects of heteroscedasticity are the following
• heteroscedasticity does not bias OLS coefficient estimates
• heteroscedasticity means that OLS standard errors of the estimates are incorrect (often underestimated); therefore statistical inference is invalid
• heteroscedasticity means that OLS is not the best ( = most efficient, minimum variance) estimator of b

### 2.  FORMAL DIAGNOSTIC TESTS FOR HETEROSCEDASTICITY

There are many diagnostic tests for heteroscedasticity.  Tests vary with respect to the statistical assumptions required and their sensitivity to departure from these assumptions (robustness).

#### 1.  (Optional) Brown-Forsythe Test

##### Properties
This test is robust against even serious departures from normality of the errors.
##### Principle
Find out whether the error variance si2 increases or decreases with values of an independent variable Xk (or with values of the estimates ^Y) by the following procedure:
1. split the observations into 2 groups: one group with low values of Xk (or low values of ^Y) and another group with high values of Xk (or high values of ^Y)
2. calculate the median value of the residuals within each group, and the absolute deviations of the residuals from their group median
3. then do a t-test of the difference in the means of these absolute deviations between the two groups; the test statistic is distributed as t with (n-2) df where n is the total number of cases
An example is shown at the following link:
Exhibit:  brown-Forsythe test with the Afifi & Clark depression data

#### 2.  Breusch-Pagan aka Cook-Weisberg Test

##### Properties
This is a large sample test; it assumes normality of errors; it assumes si2 is a specific function of one or several Xk.
##### Principle
Compare the SSR from regressing ei2 on the Xk to SSE from regressing of Y on the Xk, with each SS divided by its df; resulting ratio is distributed as c2 with p-1 df.
This is a large-sample test that assumes that the logarithm of the variance s2 of the error term ei is a linear function of X.
The B-P test statistic is the quantity
c2BP = (SSR*/(p-1) / (SSE/n)2
where
SSR* is the regression sum of squares of the regression of e2 on the Xk
SSE is the error sum of squares of the regression of Y on the Xk
When n is sufficiently large and s2 is constant, c2BP is distributed as a chi-square distribution with 1 df.  Large values of c2BP lead to the conclusion that s2 is not constant.
##### B-P Test in STATA
STATA calls it the Cook-Weisberg test.  The test is obtained with the option hettest used after regress.  The STATA manual states
hettest [varlist] performs 2 flavors of the Cook and Weisberg (1983) test for heteroscedasticity.  This test amounts to testing t=0 in Var(e) = s2exp(zt).  If varlist is not specified, the fitted values are used for z.  If varlist is specified, the variables specified are used for z.
##### References
This test was developed independently by Breusch and Pagan (1979) and Cook and Weisberg (1983).
• Cook, R. D. and S. Weisberg.  1983.  "Diagnostics for Heteroscedasticity in Regression."  Biometrika 70:1-10.
• Breusch, T. S. and A. R. Pagan.  1979.  "A Simple Test for Heteroscedasticity and Random Coefficient Variation." Econometrica 47:1287-1294.

#### 3.  (Optional) Goldfeld-Quandt Test

##### Properties
Test does not assume a large sample.
##### Principle
Sort cases with respect to variable believed related to residual variance; omit about 20% middle cases; run separate regressions in the low group (obtain SSElow) and high group (obtain SSEhigh); test F-distributed ratio SSEhigh/SSElow with (N-d-2p)/2 df in both the numerator and the denominator (where N is the total number of cases, d is the number of omitted cases, and p is the total number of independent variables including the constant term).
##### Reference
• Wilkinson, Blank, and Gruber (1996:274-277).

### 3.  REMEDIAL APPROACH I: TRANSFORMING Y

If heteroscedasticity is found the first strategy is to try finding a transformation of Y that stabilizes the error variance.  One can try various transformations along the ladder of powers or estimate the optimal transformation using the Box-Cox procedure.  One variant of the Box-Cox procedure automatically finds the optimal transformation of Y given a multiple regression model with p independent variables.  (See STATA reference [R] boxcox.  Note that transforming Y can change the regression relationship with the independent variables Xk.

### 4.  (Optional) REMEDIAL APPROACH II: WEIGHTED LEAST SQUARES (WLS)

Weighted least squares is an alternative to finding a transformation that stabilizes Y.  However WLS has drawbacks (explained at the end of this section).  Because of this the robust standard errors approach explaine in Section 5 below has become more popular.

#### 1.  Principle of WLS

Unequal error variance implies that the variance-covariance matrix of the errors ei, s2{e} =
 s12 0 ... 0 0 s22 ... 0 ... ... ... ... 0 0 ... sn2
is such that the variance si2 of ei may be different for each observation.  Errors are still assumed uncorrelated across observations.  Hence the off-diagonal entries of s2{e} are zeroes and the matrix is diagonal.
Assume (for sake of argument) that the si2 are known.
Then the weighted least squares (WLS) criterion is to minimize
Qw = Si=1 to n wi(Yi - b0 - b1Xi1 - ... - bp-1Xi,p-1)2
where the weights wi=1/si2 are inversely proportional to the si2; thus WLS gives less weight to observations with large error variance, and vice-versa.

#### 2.  WLS in Practice

##### 1.  Estimating the si2
In practice the si2 (and the weights wi) are not known and must be estimated.  The general strategy for estimating the si2 (and wi) is
• estimate the regression of Y on the Xk with OLS and obtain the residuals ei; then
• ei2 is an estimator of si2
• |ei| (the absolute value of ei) is an estimator of si
• on the basis of visual evidence (residual plots), regress either ei2 (to estimate the variance function) or |ei| (to estimate the standard deviation function) on
• one Xk, or
• several Xk, or
• ^Y (from the OLS regression), or
• a polynomial function of any of the above
• the fitted value (estimate) from the regression is an estimate
• ^vi of the variance si2 (if dependent variable is ei2 ), or
• ^si of the standard deviation si (if dependent variable is |ei|)
• calculate the weights wi as either
• wi = 1/(^si)2 (if ^si was estimated), or
• wi = 1/^vi (if ^vi was estimated)
##### 2.  Estimating the WLS Regression
Having estimated the wi, the WLS regression can be done either
• using a WLS-capable program, by simply providing the program with a variable containing the weights, say w; the program automatically minimizes Qw; for example, in SYSTAT enter the command weight=w prior to the regression
• using OLS, by multiplying each variable (both dependent and independent, including the constant) by the square root of the wi corresponding to a given observation and running an OLS regression without a constant with the transformed data
These steps can be iterated more than once until the estimates converge (= Iteratively Reweighted Least Squares - IRLS).
##### Example 1
The following exhibits replicate the analysis of blood pressure as function of age in ALSM5e pp. <>; ALSM4e pp. <406-407>.
##### Example 2
The following exhibit carries out a WLS analysis of the depression model with the Afifi & Clark data.

#### 3.  Weighted Least Squares (WLS) as Generalized Least Squares (GLS)

In this section we show that WLS is a special case of a more general approach called Generalized Least Squares (GLS).
##### 1.  Matrix Representation of WLS
Assume the variance-covariance matrix of e, s2{e} as above, with diagonal elements si2 and zeros elsewhere.
The matrix W of weights wi = 1/si2 is defined as W =
 w1 0 ... 0 0 w2 ... 0 ... ... ... ... 0 0 ... wn
Then the WLS estimator of b, bW is given by
(X'WX)bW = X'WY   (normal equations)
bW = (X'WX)-1X'WY
Likewise one can show that
s2{bW} = s2(X'WX)-1
s2{bW} = MSEW(X'WX)-1
MSEW = Swi(Yi - ^Yi)2/(n - p)
The WLS estimates can also be obtained by applying OLS to the data transformed by the "square root" W1/2 of W, where W1/2 contains the square roots of the wi on the diagonal, and zeros elsewhere.
Since W1/2 is symmetric and W1/2W1/2 = W, it follows that
((W1/2X)'(W1/2X))-1(W1/2X)'(W1/2Y)
= (X'W1/2W1/2X)-1(X'W1/2W1/2Y)
= (X'WX)-1(X'WY) = bW
Thus one can obtain bW by multiplying Y and X by the square root of the weight and applying OLS to the transformed data.
##### 2.  WLS is a Special Case of Generalized Least Squares (GLS)
The standard regression model Y = Xb + e assumes that the variance-covariance matrix of the ei is scalar, that is E{ee'} = s2I.  Then the OLS estimator
b = (X'X)-1X'Y
has variance matrix
s2{b} = E{bb'} = E{(X'X)-1X'YY'X(X'X)-1}
s2{b} = (X'X)-1X'E{YY'}X(X'X)-1
s2{b} = (X'X)-1X'E{ee'}X(X'X)-1
When the error variance is the same for all observations (homoscedasticity) then the well-known result for OLS follows:
s2{b} = (X'X)-1X's2IX(X'X)-1  (because E{ee'} = s2I)
s2{b} = s2(X'X)-1X'X(X'X)-1
s2{b} = s2(X'X)-1  (after cancellation)
And the covariance matrix of errors is estimated as before as
s2{b} = MSE(X'X)-1  (estimating s2 as MSE)
and the OLS estimator b is the BLUE of b by the Gauss-Markov theorem.
When E{ee'} is not scalar, it must be represented as E{ee'} = W where W is a (positive definite) symmetric matrix.  Then OLS is no longer the BLUE of b.  Instead, Aitken's (or Generalized Least Squares) theorem states that the BLUE of b is
bGLS = (X'W-1X)-1X'W-1Y
where bGLS is termed the generalized least squares (GLS) estimator.
The matrix W is usually unknown.  When it is possible to estimate W from the data, the resulting estimator is
bEGLS = (X'^W-1X)-1X'^W-1Y
where ^W denotes the estimated matrix W. bEGLS is termed the estimated generalized least squares (EGLS) or feasible generalized least squares (FGLS) estimator.
It may be possible to derive a "square root" of ^W-1, i.e. a symmetric matrix ^W-1/2 such that (^W-1/2)(^W-1/2) = ^W-1.  Then an alternative procedure for EGLS estimation is to premultiply X and Y by ^W-1/2 and use OLS with the transformed data.
In practice, GLS (or EGLS/FGLS) is used when one has an a priori hypothesis concerning the structure of W.  For example
• in the heteroscedasticity case one assumes that W is a diagonal matrix with elements si2 repressenting the error variance for observation i.  Then one only has to estimate the n error variances si2 to estimate W.  One can see that WLS is a special case of EGLS, with ^W-1 = W.
• in regression models for time series data with a first order autoregressive error structure the entries of the W matrix decrease exponentially away from the diagonal (see Module 14).  On the basis of this systematic pattern one can estimate the matrix W and estimate b by EGLS.
• in regression models for panel data in which one has t observations over time on n individual units, one assumes that the error terms contains components that are specific to each unit and/or each time period.  Then W has a distinctive block-diagonal structure that can be reconstructed by estimating a small number of parameters.  Again one can estimate W and estimate b by EGLS.

#### 4.  Recommendations on WLS

The WLS approach to heteroscedasticity has at least two drawbacks.
1. WLS usually necessitates strong assumptions about the nature of the error variance, e.g. that it is a function of particular X variable or of ^Y.  Sometimes the assumption appears reasonable (e.g., error variance is proportional to population size, when the units are areal units); other times it is not.
2. WLS produces an alternative unbiased estimate of b; but the OLS estimate is also unbiased.  When bOLS and bWLS differ, which one should one choose?
Today researchers tend to prefer the robust standard errors approach to heteroscedasticity explained next.

### 5.  REMEDIAL APPROACH III:  ROBUST STANDARD ERRORS

The following discussion relies heavily on Long and Ervin (2000).

#### 1.  Principle of Robust Standard Errors

When heteroscedasticity is present transforming the variables or the use of WLS may be undesirable when
• a transformation of the variables that stabilizes the variances cannot be found
• a suitable transformation is found, but the resulting non-linear model is difficult to interpret substantively
• the weights to use in WLS cannot be found, as when the functional form of the heteroscedasticity is not known
The alternative strategy can be used even when the form of the heteroscedasticity is unknown.  It consists of
1. estimating b using OLS as usual
2. use a heteroscedasticity consistent covariance matrix (HCCM) to estimate the standard errors of the estimates; these standard errors are then called robust standard errors
There are 3 variants of the strategy, labelled HC1, HC2, and HC3.  To explain the principle of HCCM start with the usual multiple regression model
Y = Xb + e
where E{e} = 0 and E{ee'} = W is a positive definite matrix.
Then the covariance matrix of the OLS estimate b = (X'X)-1X'Y is
s2{b} = (X'X)-1X'WX(X'X)-1
When the errors are homoscedastic, W = s2I and the expression for s2{b} reduces to the usual
s2{b} = s2(X'X)-1
OLSCM = s2{b} = MSE(X'X)-1  (where MSE = Sei2/(n-p))
OLSCM denotes the usual OLS covariance matrix of estimates.

#### 2.  Huber-White Robust Standard Errors HC1

The basic idea of robust standard errors is that when the errors are heteroscedastic one can estimate the observation-specific variance si2 with the single observation on the residual as
^Wii = (ei - 0)2/1 = ei2
^W = diag{ei2}
HC1 = (n/(n-p)) (X'X)-1X'diag{ei2}X(X'X)-1
where n/(n-p) is a degree of freedom correction factor that becomes negligible for large samples.
HC1 is called the Huber-White estimator (after Huber 1967; White 1980) or the "sandwich" estimator because of the appearance of the formula.  (See it?)
HC1 is obtained in STATA using the robust option (e.g., regress y x1 x2, robust).

#### 3.  HC2

An alternative to HC1 proposed by MacKinnon and White (1985) is to use a better estimate of the variance of ei based on s2{ei} = s2(1 - hii) where hii represent the leverage of observation i (diagonal element of the hat matrix H); the alternative formula divides the squared residual by (1 - hii)
HC2 =  (X'X)-1X'diag{ei2/(1 - hii)}X(X'X)-1
HC2 is obtained in STATA using the hc2 option (e.g., regress y x1 x2, hc2).

#### 4.  HC3

A third possibility has a less straightforward theoretical motivation (Long and Ervin 2000; although compare the formula for HC3 with that for the deleted residual di in Module 10).  The idea is to "overcorrect" for high variance residuals by dividing the squared residual by (1 - hii)2.  This yields
HC3 = (X'X)-1X'diag{ei2/(1 - hii)2}X(X'X)-1
HC3 is obtained in STATA using the hc3 option (e.g., regress y x1 x2, hc3).

#### 5.  Relative Performance of HC1, HC2 and HC3 Robust Variance Estimators

Long and Erwin (2000) conclude from an extensive series of computer simulations that the HC3 gives the best results overall in small samples in the presence of heteroscedasticity of various forms.  They state
"1.  If there is an a priori reason to suspect that there is heteroscedasticity, HCCM-based tests should be used."
"2.  For samples less than 250, HC3 should be used; when samples are 500 or larger, other versions of the HCCM can also be used.  The superiority of HC3 over HC2 lies in its better properties when testing coefficients that are most strongly affected by heteroscedasticity."
"3.  The decision to correct for heteroscedasticity should not be based on the results of a screening test for heteroscedasticity."
"Given the relative costs of correcting for heteroscedasticity using HC3 when there is homoscedasticity and using OLSCM tests when there is heteroscedasticity, we recommend that HC3-based tests should be used routinely for testing individual coefficients in the linear regression model."

#### 6.  Example of Robust Standard Errors Estimation

The following exhibit shows the use of the HC1 (robust), HC2 (hc2) and HC3 (hc3) robust standard errors with STATA

### 6.  CONCLUSION: DEALING WITH HETEROSCEDASTICITY

Provisional guidelines for dealing with the possibility of heteroscedasticity are
1. look at the plot of OLS residuals against estimates; if there is a suggestion of a funnel shape use a test of heteroscedasticity; use the Breusch-Pagan a.k.a. Cook-Weisberg test as it is easy to do in STATA; use one of the other tests (modified Levene or Goldfeld-Quandt) if you have a reason to, such as a small sample or doubts about normality of errors
2. if there is heteroscedasticity look first for a reasonable transformation that might stabilize the variances of the errors, but without introducing problems of interpretation or upsetting the functional relationship of Y with the independent variables; if such a transformation is found it is a desirable solution
3. if a suitable transformation cannot be found, investigate the possibility of WLS; try estimating the variance function or the standard deviation function; if a convincing function is found (one that has substantial R2 and/or one that makes substantive sense, such as when the error variance is proportional to some measure of the size of the unit) then try WLS; otherwise, use the robust standard error approach instead (next)
4. if the transformation approach and the WLS approach do not seem promising, then use the robust standard errors approach; follow the recommendations of Long and Ervin (2000) to choose between HC1, HC2 and HC3, at least until someone comes up with evidence to the contrary; alternatively, adopt this approach right away after failing to find a good variance-stabilizing transformation, bypassing WLS