Sociology 709
Lecture S
Multicollinearity
Kennedy 205-212
As we learned from our earlier discussion of the Venn diagram from Kennedy, only independent variation in X is used in estimating the coefficient of X on Y in a multiple regression.
In other words, if both X and Z are highly correlated with each other, then the coefficients for X and Z will be determined by the minority of cases where they don’t vary together.
In the extreme case of perfect collinearity (i.e., x=2*z) then we could not estimate a separate effect for both X and Z.
In general, explanatory variables that are highly correlated, but not perfectly collinear, results in the problem of “multicollinearity.”
Consequences: the coefficients are unbiased, but the variance of the coefficients will be inflated. As a result, we are less certain that the result we got is close to the true value, although on average it will have the same mean as the true value (and as the number of cases increases, it will converge towards the true value).
Detecting multicollinearity:
Recall equation 6.2 on page 120 of Fox (or the equation on p.85 of Baum, which is the same)

The term
is the r-squared of the variable
on all the other explanatory variables. It tells us how correlated
with all these variables.
is
called the variance inflation factor (VIF) for
. It tells us how
much the standard error of
is being inflated by its correlation with the other
variables.
As suggested by Baum (p.85) a rule of thumb is that you have a problem with multicollinearity if the VIF for a variable is greater than 10.
To test for multicollinearity in Stata,
type “estat vif” after your regression.
I want to do two things in the rest of the lecture.
1) Give you an example of testing for multicollinearity in Stata
Here is the do file for the example: lecs_example.do
2) Show you the results from a
Files needed to run the simulation: lecs1.do, corxz.ado, lecs1.ado
1) Example
. clear
. prog drop _all
. set obs 5000
obs was 0, now 5000
.
. corxz .95 x z
Variable | Obs Mean
Std. Dev. Min Max
-------------+--------------------------------------------------------
x | 5000
-.0248928 .9983274 -4.046022
3.702668
z | 5000
-.0228308 .9960985 -3.73386
3.520801
(obs=5000)
| x z
-------------+------------------
x | 1.0000
z | 0.9489
1.0000
.
. gen w=invnorm(uniform())
. gen e4=invnorm(uniform())
.
. sum x z w
Variable | Obs Mean
Std. Dev. Min Max
-------------+--------------------------------------------------------
x | 5000
-.0248928 .9983274 -4.046022
3.702668
z | 5000
-.0228308 .9960985 -3.73386
3.520801
w | 5000
.0054158 .9852456 -3.393488
3.400945
.
. gen y=4*x+z+w+5*e4
.
.
. cor y x z w
(obs=5000)
| y
x z w
-------------+------------------------------------
y | 1.0000
x | 0.6912
1.0000
z | 0.6752
0.9489 1.0000
w | 0.1303
-0.0034 -0.0104 1.0000
.
.
. reg y x z w
Source | SS
df MS Number of obs
= 5000
-------------+------------------------------ F( 3,
4996) = 1662.12
Model | 125311.865 3
41770.6218 Prob > F
= 0.0000
Residual | 125554.128 4996
25.1309304
R-squared = 0.4995
-------------+------------------------------ Adj
R-squared = 0.4992
Total |
250865.994 4999 50.1832354 Root MSE =
5.0131
------------------------------------------------------------------------------
y | Coef. Std. Err. t P>|t|
[95% Conf. Interval]
-------------+----------------------------------------------------------------
x | 3.537127
.2250129 15.72 0.000
3.096003 3.978251
z | 1.448041
.2255274 6.42 0.000
1.005908 1.890174
w | .9645307
.0719836 13.40 0.000
.8234112 1.10565
_cons |
.0187231 .0709189 0.26
0.792 -.120309 .1577553
------------------------------------------------------------------------------
.
. estat vif
Variable | VIF
1/VIF
-------------+----------------------
z | 10.04
0.099614
x | 10.04
0.099624
w | 1.00
0.999468
-------------+----------------------
Mean VIF | 7.03
.
2) Simulations.
Overview: why do a
![]()
![]()
where x, z, w, and e are standard normal variables
and where x and z are correlated (but w is not correlated with x or z).
I will run 5 simulations. Each simulation consists of 500 replications.
|
Simulation |
Correlation between x and z |
# of cases |
# of replications |
|
1 |
.5 |
1000 |
500 |
|
2 |
.7 |
1000 |
500 |
|
3 |
.9 |
1000 |
500 |
|
4 |
.95 |
1000 |
500 |
|
5 |
.98 |
1000 |
500 |
ŕexplain the difference between # of cases and # of replications
ŕexplain the results
. simulate x=r(x) z=r(z) w=r(w), reps(500): lecs1, obs(1000) cor(.5)
. sum
Variable | Obs Mean
Std. Dev. Min Max
-------------+--------------------------------------------------------
x | 500
4.004177 .1921215 3.496746
4.538243
z | 500
1.009423 .181314 .4881066
1.654673
w | 500
.9945636 .1644668 .5140132
1.507911
. simulate x=r(x) z=r(z) w=r(w), reps(500): lecs1, obs(1000) cor(.7)
. sum
Variable | Obs Mean
Std. Dev. Min Max
-------------+--------------------------------------------------------
x | 500
3.999184 .2271602 3.332926
4.615234
z | 500
1.001303 .2179345 .4813399
1.650741
w | 500
.9940164 .1650191 .4928653
1.449148
. simulate x=r(x) z=r(z) w=r(w), reps(500): lecs1, obs(1000) cor(.9)
. sum
Variable | Obs Mean
Std. Dev. Min Max
-------------+--------------------------------------------------------
x | 500
4.025335 .3671216 2.832163
5.10611
z | 500
.9901153 .3612931 -.0343102
2.288233
w | 500
.99566 .1496341 .5517082
1.415757
.
. simulate x=r(x) z=r(z) w=r(w), reps(500): lecs1, obs(1000) cor(.95)
. sum
Variable | Obs Mean
Std. Dev. Min Max
-------------+--------------------------------------------------------
x | 500
3.99129 .4858451 2.849516
5.467155
z | 500
1.016175 .4906718 -.668867
2.224641
w | 500
1.005454 .1596497 .4580643
1.447566
. simulate x=r(x) z=r(z) w=r(w), reps(500): lecs1, obs(1000) cor(.98)
. sum
Variable | Obs Mean
Std. Dev. Min Max
-------------+--------------------------------------------------------
x | 500
3.972145 .7681128 1.66325
6.775041
z | 500
1.022473 .7831032 -1.780295
3.355815
w | 500
.998398 .1472098 .5489501
1.449503
.
. log close
log: C:\papers\soc709\lecs1.log
log type: text
closed on: 28 Mar 2007, 15:21:45
-------------------------------------------------------------------------------
.
end
of do-file
.