Sociology 709 Review Sheet

 

Go over midterm review sheet first (everything builds on the first half of the course)

 

Categorical dependent variables

 

Lecture K: Logit and probit models

What is the formula for a logit model showing the relationship between the log-odds and the probability that Y=1?

What is the formula for a probit model showing the relationship between the Z-score and the probability thatY=1?

Why do we want to model the log-odds (logit) or Z-score (probit)?

How do you interpret the output for a logit or probit model?

Sample syntax:

logit y x

probit y x

 

Lecture L: Multinomial logit, ordered probit [This lecture will not be on the final exam Spring 2007]

Multinomial logit: we have multiple categories.  We model the odds of observing category k compared to category j.

mlogit y x

where y= 1, 2, 3, 4 (for example)

Example output:

. tab statefip re, col nofreq

 

                      |                re

    State (FIPS code) |     white      black   hispanic |     Total

----------------------+---------------------------------+----------

        Massachusetts |     25.73      12.35      22.76 |     23.89

             Michigan |     30.50      24.21      19.51 |     28.55

       North Carolina |     43.78      63.44      57.72 |     47.56

----------------------+---------------------------------+----------

                Total |    100.00     100.00     100.00 |    100.00

 


 

. xi: mlogit statefip i.re, baseout(25)

i.re              _Ire_1-3            (naturally coded; _Ire_1 omitted)

 

Iteration 0:   log likelihood = -13733.058

Iteration 1:   log likelihood = -13577.333

Iteration 2:   log likelihood = -13575.265

Iteration 3:   log likelihood = -13575.261

 

Multinomial logistic regression                   Number of obs   =      13037

                                                  LR chi2(4)      =     315.59

                                                  Prob > chi2     =     0.0000

Log likelihood = -13575.261                       Pseudo R2       =     0.0115

 

------------------------------------------------------------------------------

    statefip |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]

-------------+----------------------------------------------------------------

Michigan     |

      _Ire_2 |   .5033837   .0953817     5.28   0.000      .316439    .6903285

      _Ire_3 |  -.3242781   .0846035    -3.83   0.000    -.4900979   -.1584584

       _cons |   .1701275   .0266333     6.39   0.000     .1179271    .2223278

-------------+----------------------------------------------------------------

North Caro~a |

      _Ire_2 |   1.105145   .0851317    12.98   0.000     .9382904    1.272001

      _Ire_3 |   .3987839   .0689982     5.78   0.000     .2635499    .5340179

       _cons |   .5316914   .0247155    21.51   0.000       .48325    .5801329

------------------------------------------------------------------------------

(statefip==Massachusetts is the base outcome)

 

 

Ordered probit: for ordinal data (example: happiness questions)…identical to the probit but with more categories.

 

 

Diagnosing and dealing with problems in the data

 

Lecture P: Influential cases

How much effect is the inclusion or exclusion of a specific case having on your results?

(Dfbeta, Cook’s D)

DFBETA  is the impact on each coefficient deleting each observation in turn,

 for all coefficients j and cases i

 Cook’s D presents a summary of the DFBETA for each case.

* example syntax

regress prestige income educ

predict d, cooksd

predict dfbeta_inc, dfbeta(income)

summarize d, detail

list d dfbeta occtitle if d>.5

 

Lecture Q: Omitted variable bias

 

Why is this a problem?

Can you guess what direction the bias might be (up or down)?

Formulas Q1 and Q2, Q2bàimportant to understand the direction of bias.

àThink about how fixed-effects models might be a solution.

From equation Q2 

In other words, if X is correlated with Z (i.e., covariance not equal to zero), then omitting Z from the model will bias our estimate of the coefficient on X.  If and the correlation is positive, then  will be upwardly biased (i.e., it the estimate will be larger than it should be).

 

Gronniger, “Familial Obesity As A Proxy For Omitted Variables In The Obesity-Mortality Relationship”, Demography, Volume 42-Number 4, November 2005: 719-735

Why would evidence of the effect of "coresidential obesity" on mortality suggest that that the mortality risk of obesity is overstated in the literature?

 

Lecture R: Heteroskedasticity

 

What assumption of the error term does heteroskedasticity violate?

What are the implications of heteroskedasticity (i.e. why is it a problem)?

An example of heteroskedastic data (school test data).

How to detect it: (not this is not a universal test)

estat hettest, iid  after the regression (see example below)

Solutions:

 a) Robust regression.  In Stata, “reg y x, robust” (see example below).  Point estimates are the same as OLS, but the VCE is estimated by using the Huber-White “sandwich” estimator of the variance of OLS coefficients.

b) Weighted Least Squares (WLS): if we have knowledge about the pattern of heteroskedasticity we can weight the data weights that are proportional to the inverse of the variance for each case.

example syntax:

gen wgt=1/(100*(x)^2)

reg y x [aw=wgt]

 

Lecture S: Multicollinearity

Why is multicollinearity a problem?

Why is it less of a problem than omitted variable bias?

How do you detect it?

Example syntax:

estat vif

What can you do about it?

 

Lecture T: Weighting data

Clustering versus sampling weights.

What characteristics of the sampling design affect estimates such as totals, means, proportions, and regression coefficients? What characteristics of the sampling design affect standard errors, p-values, and confidence intervals?

Why does clustering affect the standard errors?

Intraclass correlation

Deft

example syntax:

svyset dnum [pw=pw], fpc(fpc)
svy: mean api00
svy: total enroll
estat eff, deff deft
svy: regress api00 meals ell avg_ed
estat eff, deff deft

 

 

Lecture U: Missing data

Why is missing data a problem?

MCAR, MAR, and nonignorable missing data

Why would listwise deletion be a bad strategy? (Why is it better than some other approaches?)

What is the difference between single and multiple imputation?

example syntax:

ice lnwage hgc sex age age2 exp80 exp802 ten, dryrun
 
ice lnwage hgc sex age age2 exp80 exp802 ten using impute, ///
passive(exp802: exp80^2) cmd(sex: logit) m(5) replace
 
drop _all
use impute
 
micombine reg lnwage hgc sex age* exp80* ten*

 

 

Advanced Topics

 

Lecture V: Longitudinal data, fixed effects & random effects

Cross sectional:

Fixed Effects:

 

Advantages of this model: [explain] Got rid of u!

 

What was the logic of the fixed effects model with the teenage pregnancy data?

Example syntax:

Setting the data for longitudinal analysis

example:

iis caseid

tis year

example regressions:

xi: reg lnwage i.sex pctfem hgc hrswk tenur exp*

xi: xtreg lnwage i.sex pctfem hgc hrswk tenur exp*, fe

xi: xtreg lnwage i.sex pctfem hgc hrswk tenur exp*, re

 

 

Lecture X: Maximum Likelihood

What is the maximum likelihood approach?

Explain a graph of the likelihood of obtaining a series of heads (H) or tails (T), for example HTHHH where the Y-axis is the likelihood and the X-axis it p, the probability of the coin coming up heads.

Explain, graphically, how the computer “climbs the hill” to find the maximum likelihood estimate.  You can use our example in class of walking up a hill with a blindfold on.

 

 

Lecture Y: Instrumental variables

What is the key assumption of an instrumental variable model?  Can the assumption be tested?

Can you determine causality in cross-sectional data without making additional assumptions?