Sociology 709

Lecture S

Multicollinearity

 

 

Reading: Baum 84-87

Kennedy 205-212

 

As we learned from our earlier discussion of the Venn diagram from Kennedy, only independent variation in X is used in estimating the coefficient of X on Y in a multiple regression.

 

In other words, if both X and Z are highly correlated with each other, then the coefficients for X and Z will be determined by the minority of cases where they don’t vary together.

 

In the extreme case of perfect collinearity (i.e., x=2*z) then we could not estimate a separate effect for both X and Z.

 

In general, explanatory variables that are highly correlated, but not perfectly collinear, results in the problem of “multicollinearity.”

 

Consequences:  the coefficients are unbiased, but the variance of the coefficients will be inflated.  As a result, we are less certain that the result we got is close to the true value, although on average it will have the same mean as the true value (and as the number of cases increases, it will converge towards the true value).

 

Detecting multicollinearity:

 

Recall equation 6.2 on page 120 of Fox (or the equation on p.85 of Baum, which is the same)

 

The term is the r-squared of the variable on all the other explanatory variables.  It tells us how correlated with all these variables.

  is called the variance inflation factor (VIF) for .   It tells us how much the standard error of is being inflated by its correlation with the other variables.

 

As suggested by Baum (p.85) a rule of thumb is that you have a problem with multicollinearity if the VIF for a variable is greater than 10. 

 

To test for multicollinearity in Stata,

typeestat vif” after your regression.

 

I want to do two things in the rest of the lecture.

1) Give you an example of testing for multicollinearity in Stata

Here is the do file for the example: lecs_example.do

2) Show you the results from a Monte Carlo simulation of the impact of different degrees of collinearity on the results.

Files needed to run the simulation: lecs1.do, corxz.ado, lecs1.ado

 

1) Example

 

. clear

 

. prog drop _all

 

. set obs 5000

obs was 0, now 5000

 

.

. corxz .95 x z

 

    Variable |       Obs        Mean    Std. Dev.       Min        Max

-------------+--------------------------------------------------------

           x |      5000   -.0248928    .9983274  -4.046022   3.702668

           z |      5000   -.0228308    .9960985   -3.73386   3.520801

(obs=5000)

 

             |        x        z

-------------+------------------

           x |   1.0000

           z |   0.9489   1.0000

 

 

.

. gen w=invnorm(uniform())

 

. gen e4=invnorm(uniform())

 

.

. sum x z w

 

    Variable |       Obs        Mean    Std. Dev.       Min        Max

-------------+--------------------------------------------------------

           x |      5000   -.0248928    .9983274  -4.046022   3.702668

           z |      5000   -.0228308    .9960985   -3.73386   3.520801

           w |      5000    .0054158    .9852456  -3.393488   3.400945

 

.

. gen y=4*x+z+w+5*e4

 

.

.

. cor y x z w

(obs=5000)

 

             |        y        x        z        w

-------------+------------------------------------

           y |   1.0000

           x |   0.6912   1.0000

           z |   0.6752   0.9489   1.0000

           w |   0.1303  -0.0034  -0.0104   1.0000

 

 

.

.

. reg y x z w

 

      Source |       SS       df       MS              Number of obs =    5000

-------------+------------------------------           F(  3,  4996) = 1662.12

       Model |  125311.865     3  41770.6218           Prob > F      =  0.0000

    Residual |  125554.128  4996  25.1309304           R-squared     =  0.4995

-------------+------------------------------           Adj R-squared =  0.4992

       Total |  250865.994  4999  50.1832354           Root MSE      =  5.0131

 

------------------------------------------------------------------------------

           y |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]

-------------+----------------------------------------------------------------

           x |   3.537127   .2250129    15.72   0.000     3.096003    3.978251

           z |   1.448041   .2255274     6.42   0.000     1.005908    1.890174

           w |   .9645307   .0719836    13.40   0.000     .8234112     1.10565

       _cons |   .0187231   .0709189     0.26   0.792     -.120309    .1577553

------------------------------------------------------------------------------

 

.

. estat vif

 

    Variable |       VIF       1/VIF 

-------------+----------------------

           z |     10.04    0.099614

           x |     10.04    0.099624

           w |      1.00    0.999468

-------------+----------------------

    Mean VIF |      7.03

 

.

 

2)  Simulations.

 

Overview: why do a Monte Carlo simulation?  (Again, I know some of you don’t like this…but it can be very useful in understanding the potential magnitude of the problem).

 

where x, z, w, and e are standard normal variables

and where x and z are correlated (but w is not correlated with x or z).

 

I will run 5 simulations.  Each simulation consists of 500 replications.

Simulation

Correlation between x and z

# of cases

# of replications

1

.5

1000

500

2

.7

1000

500

3

.9

1000

500

4

.95

1000

500

5

.98

1000

500

 

ŕexplain the difference between # of cases and # of replications

ŕexplain the results

 

. simulate x=r(x) z=r(z) w=r(w), reps(500): lecs1, obs(1000) cor(.5)

 

. sum

 

    Variable |       Obs        Mean    Std. Dev.       Min        Max

-------------+--------------------------------------------------------

           x |       500    4.004177    .1921215   3.496746   4.538243

           z |       500    1.009423     .181314   .4881066   1.654673

           w |       500    .9945636    .1644668   .5140132   1.507911

 

 

. simulate x=r(x) z=r(z) w=r(w), reps(500): lecs1, obs(1000) cor(.7)

 

. sum

 

    Variable |       Obs        Mean    Std. Dev.       Min        Max

-------------+--------------------------------------------------------

           x |       500    3.999184    .2271602   3.332926   4.615234

           z |       500    1.001303    .2179345   .4813399   1.650741

           w |       500    .9940164    .1650191   .4928653   1.449148

 

 

. simulate x=r(x) z=r(z) w=r(w), reps(500): lecs1, obs(1000) cor(.9)

 

. sum

 

    Variable |       Obs        Mean    Std. Dev.       Min        Max

-------------+--------------------------------------------------------

           x |       500    4.025335    .3671216   2.832163    5.10611

           z |       500    .9901153    .3612931  -.0343102   2.288233

           w |       500      .99566    .1496341   .5517082   1.415757

 

.

. simulate x=r(x) z=r(z) w=r(w), reps(500): lecs1, obs(1000) cor(.95)

 

. sum

 

    Variable |       Obs        Mean    Std. Dev.       Min        Max

-------------+--------------------------------------------------------

           x |       500     3.99129    .4858451   2.849516   5.467155

           z |       500    1.016175    .4906718   -.668867   2.224641

           w |       500    1.005454    .1596497   .4580643   1.447566

 

 

. simulate x=r(x) z=r(z) w=r(w), reps(500): lecs1, obs(1000) cor(.98)

 

. sum

 

    Variable |       Obs        Mean    Std. Dev.       Min        Max

-------------+--------------------------------------------------------

           x |       500    3.972145    .7681128    1.66325   6.775041

           z |       500    1.022473    .7831032  -1.780295   3.355815

           w |       500     .998398    .1472098   .5489501   1.449503

 

 

.

. log close

       log:  C:\papers\soc709\lecs1.log

  log type:  text

 closed on:  28 Mar 2007, 15:21:45

-------------------------------------------------------------------------------

 

.

end of do-file

 

.