Sociology 709
Lecture Y
Instrumental variables and structural equation models: the good, the bad, and the ugly
The goal of this lecture is to provide a very introductory overview of the fundamental problems involved in using instrumental variables (IV) or structural equation models (SEM).
SEM can be thought of as an extended IV model, so we will discuss IV models first.
Both of these approaches are often used to make causal interpretations from non-experimental data. The key point of this lecture is that you can only do this by making additional assumptions, and your results will only be as good as those additional assumptions are.
Takeaway message: don’t get overly impressed by the complicated math of these models. Focus on what the fundamental assumptions are [as discussed below] and whether they are believable—they cannot be tested, only argued theoretically. Many times researchers will bury the assumptions and not discuss them explicitly—they are the pied pipers of empirical research. Don’t use these models yourself without being very confident of the assumptions you are making.
Note: this criticism of SEM applies to its use for causal modeling with endogenous variables, not as a method for dealing with multiple indicators of latent variables [explain]
IV estimation
IV is used when one of your variables is hypothesized to be correlated with the error term. In other words, some unobserved factor in your error term is correlated with X.
Example: Imagine you are studying the effect of education on income, and you hypothesize that an unobserved factor (let’s call it “ambition”, but it could also be intrinsic ability) positively affects both education and income. As a result the coefficient on income will be upwardly biased (recall our earlier discussion of omitted variable bias).
For example, the true equation might be:
![]()
but we don’t observe ambition, so we estimate:
![]()
The IV approach is as follows. If there is another variable Z that is correlated with education but not with ambition (and everything else in the error term) then we can use Z to get around the omitted variable bias.
Steps:
1) Convince yourself that Z is correlated with education but not ambition.
2) Regress X on Z and use the results to predict
.
is a the component of X driven by Z, and is uncorrelated with
the error term (provided step 1 is true).
3) Regress income on
to find the true
effect of education (provided the assumptions hold true).
This approach is called “Two Stage Least Squares”
Even if the assumptions are correct, the standard errors need to be corrected for heteroskedasticity. The IVREG command in Stata will do it for you.
Let’s go through an empirical example. We’ll call it “The Good” because it shows how well the approach works when the assumptions hold.
The do file for this example is iv.do
I created a data set with 2000 cases with 4 variables,
. des
Contains data
obs: 2,000
vars: 4
size: 40,000 (99.6% of memory free)
-------------------------------------------------------------------------------
storage display value
variable name
type format label
variable label
-------------------------------------------------------------------------------
x
float %9.0g education
e float %9.0g ambition
z
float %9.0g iv
y
float %9.0g income
--------------------------------------------------------------
X is correlated with e and Z, but Z and e are not correlated with each other:
. cor
(obs=2000)
| x
e z
-------------+---------------------------
x | 1.0000
e | 0.5000
1.0000
z | 0.5000
0.0000 1.0000
.
reg y x
Source | SS
df MS Number of obs = 2000
-------------+------------------------------ F(
1, 1998) = 5994.00
Model |
112443.75 1 112443.75 Prob > F =
0.0000
Residual |
37481.2497 1998 18.7593842 R-squared =
0.7500
-------------+------------------------------ Adj R-squared = 0.7499
Total | 149925
1999 74.9999998 Root MSE =
4.3312
------------------------------------------------------------------------------
y | Coef.
Std. Err. t P>|t|
[95% Conf. Interval]
-------------+----------------------------------------------------------------
x | 7.5
.096873 77.42 0.000
7.310017 7.689983
_cons |
2.21e-09 .0968488 0.00
1.000 -.1899352 .1899352
------------------------------------------------------------------------------
Note: This model is biased…the true coefficient is 5.
.
outreg using iv, se replace
.
reg y z
Source | SS
df MS Number of obs = 2000
-------------+------------------------------ F(
1, 1998) = 181.64
Model |
12493.7501 1 12493.7501 Prob > F =
0.0000
Residual |
137431.249 1998 68.7844091 R-squared =
0.0833
-------------+------------------------------ Adj R-squared = 0.0829
Total | 149925
1999 74.9999998 Root MSE =
8.2936
------------------------------------------------------------------------------
y | Coef.
Std. Err. t P>|t|
[95% Conf. Interval]
-------------+----------------------------------------------------------------
z | 2.5
.1854977 13.48 0.000
2.136211 2.863789
_cons |
9.25e-09 .1854514 0.00
1.000 -.3636983 .3636983
------------------------------------------------------------------------------
.
outreg using iv, se append
.
reg x z
Source | SS
df MS Number of obs = 2000
-------------+------------------------------ F(
1, 1998) = 666.00
Model |
499.750001 1 499.750001 Prob > F =
0.0000
Residual | 1499.25
1998 .750375376 R-squared =
0.2500
-------------+------------------------------ Adj R-squared = 0.2496
Total | 1999
1999 1 Root MSE =
.86624
------------------------------------------------------------------------------
x | Coef.
Std. Err. t P>|t|
[95% Conf. Interval]
-------------+----------------------------------------------------------------
z | .5
.0193746 25.81 0.000
.4620035 .5379965
_cons |
-9.34e-10 .0193698 -0.00
1.000 -.037987 .037987
------------------------------------------------------------------------------
Now we predict x
based on z
.
outreg using iv, se append
.
predict xhat
(option
xb assumed; fitted values)
.
cor xhat e
(obs=2000)
| xhat
e
-------------+------------------
xhat |
1.0000
e |
0.0000 1.0000
Now we regress y on
the predicted x, free of correlation with ambition.
.
reg y xhat
Source | SS
df MS Number of obs = 2000
-------------+------------------------------ F(
1, 1998) = 181.64
Model |
12493.7501 1 12493.7501 Prob > F =
0.0000
Residual |
137431.249 1998 68.7844091 R-squared =
0.0833
-------------+------------------------------ Adj R-squared = 0.0829
Total | 149925
1999 74.9999998 Root MSE =
8.2936
------------------------------------------------------------------------------
y | Coef.
Std. Err. t P>|t|
[95% Conf. Interval]
-------------+----------------------------------------------------------------
xhat | 5
.3709955 13.48 0.000
4.272422 5.727579
_cons |
9.46e-09 .1854514 0.00
1.000 -.3636983 .3636983
------------------------------------------------------------------------------
.
outreg using iv, se append
.
. ivreg y (x = z)
Instrumental
variables (2SLS) regression
Source | SS
df MS Number of obs = 2000
-------------+------------------------------ F(
1, 1998) = 499.50
Model |
99950.0001 1 99950.0001 Prob > F =
0.0000
Residual |
49974.9995 1998 25.0125122 R-squared =
0.6667
-------------+------------------------------ Adj R-squared = 0.6665
Total | 149925
1999 74.9999998 Root MSE =
5.0013
------------------------------------------------------------------------------
y | Coef.
Std. Err. t P>|t|
[95% Conf. Interval]
-------------+----------------------------------------------------------------
x | 5
.2237187 22.35 0.000
4.561254 5.438746
_cons |
1.39e-08 .1118314 0.00
1.000 -.2193183 .2193183
------------------------------------------------------------------------------
Instrumented: x
Instruments: z
------------------------------------------------------------------------------
OK, so far so good…it worked. Now we turn to the bad & ugly.
What if the instrument Z is correlated with e?
. *
scenario 2: the bad & ugly
.
cor
(obs=2000)
| x
e z
-------------+---------------------------
x |
1.0000
e |
0.5000 1.0000
z |
0.1000 0.1000 1.0000
.
.
gen y=5*x+5*e
.
des
Contains
data
obs:
2,000
vars: 4
size:
40,000 (99.6% of memory free)
-------------------------------------------------------------------------------
storage display
value
variable
name type format
label variable label
-------------------------------------------------------------------------------
x float %9.0g education
e float %9.0g ambition
z float %9.0g iv
y float %9.0g income
-------------------------------------------------------------------------------
Sorted
by:
Note:
dataset has changed since last saved
.
.
reg y x
Source | SS
df MS Number of obs = 2000
-------------+------------------------------ F(
1, 1998) = 5994.00
Model |
112443.75 1 112443.75 Prob > F =
0.0000
Residual |
37481.2497 1998 18.7593842 R-squared =
0.7500
-------------+------------------------------ Adj R-squared = 0.7499
Total | 149925
1999 74.9999998 Root MSE =
4.3312
------------------------------------------------------------------------------
y | Coef.
Std. Err. t P>|t|
[95% Conf. Interval]
-------------+----------------------------------------------------------------
x | 7.5
.096873 77.42 0.000
7.310017 7.689983
_cons |
2.21e-09 .0968488 0.00
1.000 -.1899352 .1899352
------------------------------------------------------------------------------
.
outreg using iv2, se replace
.
reg y z
Source | SS
df MS Number of obs = 2000
-------------+------------------------------ F(
1, 1998) = 27.00
Model |
1999.00002 1 1999.00002 Prob > F =
0.0000
Residual | 147926
1998 74.0370368 R-squared =
0.0133
-------------+------------------------------ Adj R-squared = 0.0128
Total | 149925
1999 74.9999998 Root MSE =
8.6045
------------------------------------------------------------------------------
y | Coef.
Std. Err. t P>|t|
[95% Conf. Interval]
-------------+----------------------------------------------------------------
z | 1
.1924501 5.20 0.000
.6225761 1.377424
_cons |
2.74e-08 .192402 0.00
1.000 -.3773295 .3773295
------------------------------------------------------------------------------
.
outreg using iv2, se append
.
reg x z
Source | SS
df MS Number of obs = 2000
-------------+------------------------------ F(
1, 1998) = 20.18
Model |
19.9900003 1 19.9900003 Prob > F =
0.0000
Residual | 1979.01
1998 .990495497 R-squared =
0.0100
-------------+------------------------------ Adj R-squared = 0.0095
Total | 1999
1999 1 Root MSE =
.99524
------------------------------------------------------------------------------
x | Coef.
Std. Err. t P>|t|
[95% Conf. Interval]
-------------+----------------------------------------------------------------
z | .1
.0222597 4.49 0.000
.0563453 .1436547
_cons |
3.69e-09 .0222542 0.00
1.000 -.0436438 .0436438
------------------------------------------------------------------------------
.
outreg using iv2, se append
.
predict xhat
(option
xb assumed; fitted values)
.
cor xhat e
(obs=2000)
| xhat
e
-------------+------------------
xhat |
1.0000
e |
0.1000 1.0000
.
reg y xhat
Source | SS
df MS Number of obs = 2000
-------------+------------------------------ F(
1, 1998) = 27.00
Model |
1999.00001 1 1999.00001 Prob > F =
0.0000
Residual | 147926
1998 74.0370368 R-squared =
0.0133
-------------+------------------------------ Adj R-squared = 0.0128
Total | 149925
1999 74.9999998 Root MSE =
8.6045
------------------------------------------------------------------------------
y | Coef.
Std. Err. t P>|t|
[95% Conf. Interval]
-------------+----------------------------------------------------------------
xhat | 10
1.924501 5.20 0.000
6.225761 13.77424
_cons |
-1.08e-08 .192402 -0.00
1.000 -.3773295 .3773295
------------------------------------------------------------------------------
.
outreg using iv2, se append
.
predict uhat, resid
.
cor z uhat e
(obs=2000)
| z
uhat e
-------------+---------------------------
z |
1.0000
uhat |
0.0000 1.0000
e |
0.1000 0.8602 1.0000
Important point: Z is
not correlated with the residuals from this regression. We can’t tell whether or not it is correlated
with ambition.
.
ivreg y (x = z)
Instrumental
variables (2SLS) regression
Source | SS
df MS Number of obs = 2000
-------------+------------------------------ F(
1, 1998) = 79.92
Model |
99949.9999 1 99949.9999 Prob > F =
0.0000
Residual |
49974.9996 1998 25.0125123 R-squared =
0.6667
-------------+------------------------------ Adj R-squared = 0.6665
Total | 149925
1999 74.9999998 Root MSE =
5.0013
------------------------------------------------------------------------------
y | Coef.
Std. Err. t P>|t|
[95% Conf. Interval]
-------------+----------------------------------------------------------------
x | 10
1.118593 8.94 0.000
7.806268 12.19373
_cons |
-9.49e-09 .1118314 -0.00
1.000 -.2193183 .2193183
------------------------------------------------------------------------------
Instrumented: x
Instruments: z
------------------------------------------------------------------------------
.
.
** key note: z is not correlated with the residual from the regression uhat
.
** but z is correlated with e...except we will never know this in real data
.
end
of do-file
The comparison between these
two examples boils down to a question of whether Z, the instrument for X
(education) is correlated with e (ambition) or not. This is a theoretical argument, because it
cannot be tested.
In practice, choosing
different instruments results in heated argument (about whether they are
correlated with the error term) and different results.
Question: is there a variable
that might be correlated with education but not ambition (ability)? What other IV situations can you think of?
Causality and structural
equation models.
[Draw simple SEM model,
discuss problem of correlated error terms]