University of North Carolina
at Chapel Hill

SOCI 209 - LINEAR REGRESSION MODELS - Spring 2006
Professor François Nielsen

Assignment 4 - Released Tue 4 April
DUE Tue 18 April

Problems on Outlying & Influential Observations, Collinearity, Heteroskedasticity & Autocorrelated Errors

Use a regression program of your choice to do problems requiring data analysis.
ALSM5e = Applied Linear Statistical Models 5e (2004) OR Applied Linear Regression Models 4e (2004) (new editions).
ALSM4e = Applied Linear Statistical Models 4e (1996) OR Applied Linear Regression Models 3e (1996) (old editions).
1.  ALSM5e 10.3 p. 414 [ALSM4e 9.3 p. 392] (just discard influential cases?)

2.  This problem uses the Yule data set.  It focuses on diagnostics and remedial measures for outliers and influential cases.

a.  Use the Yule data, estimate the full modeof pauperism for the 32 unions (cases), and generate the principal regression diagnostics for outlying and influential observations.
. * use menu to open yule.data is not in your working directory
. use yule
. regress paup outratio propold pop
. *next calculate leverage, studentized residual, and cook
. predict leverage, leverage
. predict student, rstudent
. predict cook, cooksd
b.  Use the studentized deleted residuals to identify outliers in the Y dimension, using the Bonferroni procedure with an initial alpha = .01 level.  State the decision rule and conclusion.
. * next identify cases with high values of student
. list union student if abs(student) > 2
. * next calculate Bonferroni significance for student
. display .01/(2*32)
You then need to calculate the p-value for teh appropriate distribution for all cases using a generate statement.
c.  Identify any X-outlying (high-leverage) observation using the appropriate diagnostic and rule of thumb; that's easy, the following commands show you how.
. * next calculate 2(p/n) cutoff for leverage
. display 2*4/32
.25
. list union leverage if leverage > 2*4/32
d.  Identify any influential observation by looking at an index plot of Cook's distance (COOK in SYSTAT) and calculating the corresponding percentiles of the appropriate F distribution for cases with high values of COOK; compare the percentiles with the cutoffs suggested in the textbook.
. * next do stem and leaf display for cook (can't find
. * how to make index plot in STATA)
. stem cook
. list union cook if cook > .3
. * next do index plot
. gen secno=_n
. * this generates sequence number secno; next do plot
. * you'll have to experiment with the following command, as latest versions of STATA
. * may use different syntax for graph
. graph cook secno, twoway connect(l[.])
. * next calculate percentile of F distribution for each value of cook
. gen fperc=100*F(4,28,cook)
. list union cook fperc if fperc>10
What union(s) appear(s) to be problematic (influential), if any?
e.  Use the Hadi procedure for robust outlier detection.
. * next use Hadi multivariate outliers procedure creating variables hadiout
. * (1 if case is Hadi outlier, 0 otherwise) and Mahalanobis distance
. hadimvo paup outratio propold pop, generate (hadiout mahadist)
. * next list Hadi outliers
. list union mahadist if hadiout
Are results of the Hadi procedure consistent with those of the other diagnostics?  Why (substantively) are these particular unions deviant?  (How would you find out more about the various neighborhoods of metropolitan London in late 19th century?)  On what substantive grounds could one justify removing these deviant cases?
f.  After selecting out the outliers identified by the Hadi procedure, estimate the following 3 models.  If appropriate estimate a final (4th) trimmed model
. * next do regressions with 30 cases that are not outliers
. reg paup outratio if ~hadiout
. reg paup outratio propold if ~hadiout
. reg paup outratio propold pop if ~hadiout 
. reg paup outratio pop if ~hadiout
Present the regression results in a tabular form suitable for publication.

g.  Re-estimate the full model with the 32 cases using robust regression.  use STATA's rreg procedure.
. * next do robust regression with original 32 cases
. rreg paup outratio propold pop
How do these estimates compare to OLS with the 32 cases and OLS with the outliers removed?
3.  ALSM5e 10.13 p. 416 [ALSM4e 9.13 p. 395] (cosmetics sales; clues of collinearity)
. regress y x1 x2 x3
. corr x1 x2 x3
4.  ALSM5e 10.14 p. 417 [ALSM4e 9.14 p. 395] (cosmetics sales; interpreting VIF, advantage of experiment)  Note that VIFk = 1/TOLk, and conversely TOLk = 1/VIFk .
. reg y x1 x2 x3
. vif
. reg y x1
5.  ALSM5e 11.6 p. 472 [ALSM4e 10.6 p. 445] (computer-assisted learning; handling heteroscedasticity)  This is a small but complete paradigm for handling heteroskedasticity.  Disregard the detailed instructions in the textbook.  Instead use the data set and do the following steps.
a. run the regression and plot the residuals against the estimate; do you see any funny funnel pattern?
      regress y x
      predict residuals, residuals
      predict yhat, xb
      scatter residuals yhat
       
    b. run the regression of Y on X and calculate the Breusch-Pagan aka Cook-Weisberg test of heteroskedasticity.
      reg y x
      hettest
    Is the test significant?  What does this mean?

    c. in STATA rerun the regression 3 times using (1) the Huber-White robust standard errors (option robust), (2) the MacKinnon-White HC2 standard errors (option hc2), and (3) the MacKinnon HC3 standard errors (option hc3)

      reg y x, robust
      reg y x, hc2
      reg y x, hc3
    Construct a table (see exhibit at the end of Module 12 for an example) comparing the width of the 95% CI obtained using options robust, hc2, and hc3; comment on the relative performance of the 3 CI estimators.


6.  ALSM5e 12.13 p. 504 [ALSM4e 12.13 p. 523] (advertising agency; detecting autocorrelation of errors)  In part b. since the cases are ordered over time and the observations are equally spaced, the plot of residuals against time is the same as an index plot (plot of a variable against the case number).

reg y x
predict resids, residuals

*. automatic variable _n is case number
generate obs=_n
scatter residuals obs
reg y x
dwstat
7.  ALSM5e 12.14 p. 504 [ALSM4e 12.14 p. 523] (advertising agency; Cochrane-Orcutt procedure) Omit part f and g.
tsset x
prais y x, rhotype(reg)
prais y x, corc


Last modified 3 Apr 2006