Sociology 709

 

Lecture Y

 

Instrumental variables and structural equation models: the good, the bad, and the ugly

 

The goal of this lecture is to provide a very introductory overview of the fundamental problems involved in using instrumental variables (IV) or structural equation models (SEM).

 

SEM can be thought of as an extended IV model, so we will discuss IV models first.

 

Both of these approaches are often used to make causal interpretations from non-experimental data.  The key point of this lecture is that you can only do this by making additional assumptions, and your results will only be as good as those additional assumptions are. 

 

Takeaway message:  don’t get overly impressed by the complicated math of these models.  Focus on what the fundamental assumptions are [as discussed below] and whether they are believable—they cannot be tested, only argued theoretically.  Many times researchers will bury the assumptions and not discuss them explicitly—they are the pied pipers of empirical research.  Don’t use these models yourself without being very confident of the assumptions you are making.

 

Note: this criticism of SEM applies to its use for causal modeling with endogenous variables, not as a method for dealing with multiple indicators of latent variables [explain]

 

IV estimation

IV is used when one of your variables is hypothesized to be correlated with the error term.  In other words, some unobserved factor in your error term is correlated with X.

 

Example: Imagine you are studying the effect of education on income, and you hypothesize that an unobserved factor (let’s call it “ambition”, but it could also be intrinsic ability) positively affects both education and income.  As a result the coefficient on income will be upwardly biased (recall our earlier discussion of omitted variable bias).

 

For example, the true equation might be:

 

 

but we don’t observe ambition, so we estimate:

 

 

The IV approach is as follows.  If there is another variable Z that is correlated with education but not with ambition (and everything else in the error term) then we can use Z to get around the omitted variable bias.

 

Steps:

1) Convince yourself that Z is correlated with education but not ambition.

2) Regress X on Z and use the results to predict .

is a the component of X driven by Z, and is uncorrelated with the error term (provided step 1 is true).

3) Regress income on  to find the true effect of education (provided the assumptions hold true).

This approach is called “Two Stage Least Squares”

 

Even if the assumptions are correct, the standard errors need to be corrected for heteroskedasticity.  The IVREG command in Stata will do it for you.

 

Let’s go through an empirical example.  We’ll call it “The Good” because it shows how well the approach works when the assumptions hold.

 

The do file for this example is iv.do

 

I created a data set with 2000 cases with 4 variables,

 

. des

 

Contains data

  obs:         2,000                         

 vars:             4                         

 size:        40,000 (99.6% of memory free)

-------------------------------------------------------------------------------

              storage  display     value

variable name   type   format      label      variable label

-------------------------------------------------------------------------------

x               float  %9.0g                  education

e               float  %9.0g                  ambition

z               float  %9.0g                  iv

y               float  %9.0g                  income

--------------------------------------------------------------

 

X is correlated with e and Z, but Z and e are not correlated with each other:

 

. cor

(obs=2000)

 

             |        x        e        z

-------------+---------------------------

           x |   1.0000

           e |   0.5000   1.0000

           z |   0.5000   0.0000   1.0000

 

 

 

. reg y x

 

      Source |       SS       df       MS              Number of obs =    2000

-------------+------------------------------           F(  1,  1998) = 5994.00

       Model |   112443.75     1   112443.75           Prob > F      =  0.0000

    Residual |  37481.2497  1998  18.7593842           R-squared     =  0.7500

-------------+------------------------------           Adj R-squared =  0.7499

       Total |      149925  1999  74.9999998           Root MSE      =  4.3312

 

------------------------------------------------------------------------------

           y |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]

-------------+----------------------------------------------------------------

           x |        7.5    .096873    77.42   0.000     7.310017    7.689983

       _cons |   2.21e-09   .0968488     0.00   1.000    -.1899352    .1899352

------------------------------------------------------------------------------

 

Note:  This model is biased…the true coefficient is 5.

 

. outreg using iv, se replace

 

. reg y z

 

      Source |       SS       df       MS              Number of obs =    2000

-------------+------------------------------           F(  1,  1998) =  181.64

       Model |  12493.7501     1  12493.7501           Prob > F      =  0.0000

    Residual |  137431.249  1998  68.7844091           R-squared     =  0.0833

-------------+------------------------------           Adj R-squared =  0.0829

       Total |      149925  1999  74.9999998           Root MSE      =  8.2936

 

------------------------------------------------------------------------------

           y |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]

-------------+----------------------------------------------------------------

           z |        2.5   .1854977    13.48   0.000     2.136211    2.863789

       _cons |   9.25e-09   .1854514     0.00   1.000    -.3636983    .3636983

------------------------------------------------------------------------------

 

. outreg using iv, se append

 

. reg x z

 

      Source |       SS       df       MS              Number of obs =    2000

-------------+------------------------------           F(  1,  1998) =  666.00

       Model |  499.750001     1  499.750001           Prob > F      =  0.0000

    Residual |     1499.25  1998  .750375376           R-squared     =  0.2500

-------------+------------------------------           Adj R-squared =  0.2496

       Total |        1999  1999           1           Root MSE      =  .86624

 

------------------------------------------------------------------------------

           x |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]

-------------+----------------------------------------------------------------

           z |         .5   .0193746    25.81   0.000     .4620035    .5379965

       _cons |  -9.34e-10   .0193698    -0.00   1.000     -.037987     .037987

------------------------------------------------------------------------------

 

Now we predict x based on z

 

. outreg using iv, se append

 

. predict xhat

(option xb assumed; fitted values)

 

. cor xhat e

(obs=2000)

 

             |     xhat        e

-------------+------------------

        xhat |   1.0000

           e |   0.0000   1.0000

 

 

Now we regress y on the predicted x, free of correlation with ambition.

 

. reg y xhat

 

      Source |       SS       df       MS              Number of obs =    2000

-------------+------------------------------           F(  1,  1998) =  181.64

       Model |  12493.7501     1  12493.7501           Prob > F      =  0.0000

    Residual |  137431.249  1998  68.7844091           R-squared     =  0.0833

-------------+------------------------------           Adj R-squared =  0.0829

       Total |      149925  1999  74.9999998           Root MSE      =  8.2936

 

------------------------------------------------------------------------------

           y |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]

-------------+----------------------------------------------------------------

        xhat |          5   .3709955    13.48   0.000     4.272422    5.727579

       _cons |   9.46e-09   .1854514     0.00   1.000    -.3636983    .3636983

------------------------------------------------------------------------------

 

. outreg using iv, se append

 

. . ivreg y (x = z)

 

Instrumental variables (2SLS) regression

 

      Source |       SS       df       MS              Number of obs =    2000

-------------+------------------------------           F(  1,  1998) =  499.50

       Model |  99950.0001     1  99950.0001           Prob > F      =  0.0000

    Residual |  49974.9995  1998  25.0125122           R-squared     =  0.6667

-------------+------------------------------           Adj R-squared =  0.6665

       Total |      149925  1999  74.9999998           Root MSE      =  5.0013

 

------------------------------------------------------------------------------

           y |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]

-------------+----------------------------------------------------------------

           x |          5   .2237187    22.35   0.000     4.561254    5.438746

       _cons |   1.39e-08   .1118314     0.00   1.000    -.2193183    .2193183

------------------------------------------------------------------------------

Instrumented:  x

Instruments:   z

------------------------------------------------------------------------------

 

OK, so far so good…it worked.  Now we turn to the bad & ugly.

 

What if the instrument Z is correlated with e?

 

. * scenario 2: the bad & ugly

. cor

(obs=2000)

 

             |        x        e        z

-------------+---------------------------

           x |   1.0000

           e |   0.5000   1.0000

           z |   0.1000   0.1000   1.0000

 

 

.

. gen y=5*x+5*e

 

. des

 

Contains data

  obs:         2,000                         

 vars:             4                         

 size:        40,000 (99.6% of memory free)

-------------------------------------------------------------------------------

              storage  display     value

variable name   type   format      label      variable label

-------------------------------------------------------------------------------

x               float  %9.0g                  education

e               float  %9.0g                  ambition

z               float  %9.0g                  iv

y               float  %9.0g                  income

-------------------------------------------------------------------------------

Sorted by: 

     Note:  dataset has changed since last saved

 

.

. reg y x

 

      Source |       SS       df       MS              Number of obs =    2000

-------------+------------------------------           F(  1,  1998) = 5994.00

       Model |   112443.75     1   112443.75           Prob > F      =  0.0000

    Residual |  37481.2497  1998  18.7593842           R-squared     =  0.7500

-------------+------------------------------           Adj R-squared =  0.7499

       Total |      149925  1999  74.9999998           Root MSE      =  4.3312

 

------------------------------------------------------------------------------

           y |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]

-------------+----------------------------------------------------------------

           x |        7.5    .096873    77.42   0.000     7.310017    7.689983

       _cons |   2.21e-09   .0968488     0.00   1.000    -.1899352    .1899352

------------------------------------------------------------------------------

 

. outreg using iv2, se replace

 

. reg y z

 

      Source |       SS       df       MS              Number of obs =    2000

-------------+------------------------------           F(  1,  1998) =   27.00

       Model |  1999.00002     1  1999.00002           Prob > F      =  0.0000

    Residual |      147926  1998  74.0370368           R-squared     =  0.0133

-------------+------------------------------           Adj R-squared =  0.0128

       Total |      149925  1999  74.9999998           Root MSE      =  8.6045

 

------------------------------------------------------------------------------

           y |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]

-------------+----------------------------------------------------------------

           z |          1   .1924501     5.20   0.000     .6225761    1.377424

       _cons |   2.74e-08    .192402     0.00   1.000    -.3773295    .3773295

------------------------------------------------------------------------------

 

. outreg using iv2, se append

 

. reg x z

 

      Source |       SS       df       MS              Number of obs =    2000

-------------+------------------------------           F(  1,  1998) =   20.18

       Model |  19.9900003     1  19.9900003           Prob > F      =  0.0000

    Residual |     1979.01  1998  .990495497           R-squared     =  0.0100

-------------+------------------------------           Adj R-squared =  0.0095

       Total |        1999  1999           1           Root MSE      =  .99524

 

------------------------------------------------------------------------------

           x |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]

-------------+----------------------------------------------------------------

           z |         .1   .0222597     4.49   0.000     .0563453    .1436547

       _cons |   3.69e-09   .0222542     0.00   1.000    -.0436438    .0436438

------------------------------------------------------------------------------

 

. outreg using iv2, se append

 

. predict xhat

(option xb assumed; fitted values)

 

. cor xhat e

(obs=2000)

 

             |     xhat        e

-------------+------------------

        xhat |   1.0000

           e |   0.1000   1.0000

 

 

. reg y xhat

 

      Source |       SS       df       MS              Number of obs =    2000

-------------+------------------------------           F(  1,  1998) =   27.00

       Model |  1999.00001     1  1999.00001           Prob > F      =  0.0000

    Residual |      147926  1998  74.0370368           R-squared     =  0.0133

-------------+------------------------------           Adj R-squared =  0.0128

       Total |      149925  1999  74.9999998           Root MSE      =  8.6045

 

------------------------------------------------------------------------------

           y |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]

-------------+----------------------------------------------------------------

        xhat |         10   1.924501     5.20   0.000     6.225761    13.77424

       _cons |  -1.08e-08    .192402    -0.00   1.000    -.3773295    .3773295

------------------------------------------------------------------------------

 

. outreg using iv2, se append

 

. predict uhat, resid

 

. cor z uhat e

(obs=2000)

 

             |        z     uhat        e

-------------+---------------------------

           z |   1.0000

        uhat |   0.0000   1.0000

           e |   0.1000   0.8602   1.0000

 

 

Important point: Z is not correlated with the residuals from this regression.  We can’t tell whether or not it is correlated with ambition.

 

. ivreg y (x = z)

 

Instrumental variables (2SLS) regression

 

      Source |       SS       df       MS              Number of obs =    2000

-------------+------------------------------           F(  1,  1998) =   79.92

       Model |  99949.9999     1  99949.9999           Prob > F      =  0.0000

    Residual |  49974.9996  1998  25.0125123           R-squared     =  0.6667

-------------+------------------------------           Adj R-squared =  0.6665

       Total |      149925  1999  74.9999998           Root MSE      =  5.0013

 

------------------------------------------------------------------------------

           y |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]

-------------+----------------------------------------------------------------

           x |         10   1.118593     8.94   0.000     7.806268    12.19373

       _cons |  -9.49e-09   .1118314    -0.00   1.000    -.2193183    .2193183

------------------------------------------------------------------------------

Instrumented:  x

Instruments:   z

------------------------------------------------------------------------------

 

.

. ** key note: z is not correlated with the residual from the regression uhat

. ** but z is correlated with e...except we will never know this in real data

.

end of do-file

 

 

The comparison between these two examples boils down to a question of whether Z, the instrument for X (education) is correlated with e (ambition) or not.  This is a theoretical argument, because it cannot be tested.

 

In practice, choosing different instruments results in heated argument (about whether they are correlated with the error term) and different results.

 

Question: is there a variable that might be correlated with education but not ambition (ability)?  What other IV situations can you think of?

 

 

Causality and structural equation models.

 

[Draw simple SEM model, discuss problem of correlated error terms]