SOCI 209 - LINEAR REGRESSION MODELS - Spring
2006
Professor François Nielsen
Assignment 4 - Released Tue 4 April
DUE Tue 18 April
ALSM5e = Applied Linear Statistical Models 5e (2004) OR Applied Linear Regression Models 4e (2004) (new editions).1. ALSM5e 10.3 p. 414 [ALSM4e 9.3 p. 392] (just discard influential cases?)
ALSM4e = Applied Linear Statistical Models 4e (1996) OR Applied Linear Regression Models 3e (1996) (old editions).
2. This problem uses the Yule data set. It focuses on diagnostics and remedial measures for outliers and influential cases.
a. Use the Yule data, estimate the full modeof pauperism for the 32 unions (cases), and generate the principal regression diagnostics for outlying and influential observations.3. ALSM5e 10.13 p. 416 [ALSM4e 9.13 p. 395] (cosmetics sales; clues of collinearity). * use menu to open yule.data is not in your working directoryb. Use the studentized deleted residuals to identify outliers in the Y dimension, using the Bonferroni procedure with an initial alpha = .01 level. State the decision rule and conclusion.
. use yule. regress paup outratio propold pop
. *next calculate leverage, studentized residual, and cook
. predict leverage, leverage
. predict student, rstudent
. predict cook, cooksdc. Identify any X-outlying (high-leverage) observation using the appropriate diagnostic and rule of thumb; that's easy, the following commands show you how.. * next identify cases with high values of student
. list union student if abs(student) > 2
. * next calculate Bonferroni significance for student
. display .01/(2*32)
You then need to calculate the p-value for teh appropriate distribution for all cases using a generate statement.d. Identify any influential observation by looking at an index plot of Cook's distance (COOK in SYSTAT) and calculating the corresponding percentiles of the appropriate F distribution for cases with high values of COOK; compare the percentiles with the cutoffs suggested in the textbook.. * next calculate 2(p/n) cutoff for leverage
. display 2*4/32
.25
. list union leverage if leverage > 2*4/32What union(s) appear(s) to be problematic (influential), if any?. * next do stem and leaf display for cook (can't find
. * how to make index plot in STATA)
. stem cook
. list union cook if cook > .3
. * next do index plot
. gen secno=_n
. * this generates sequence number secno; next do plot. * you'll have to experiment with the following command, as latest versions of STATA. * may use different syntax for graph
. graph cook secno, twoway connect(l[.])
. * next calculate percentile of F distribution for each value of cook
. gen fperc=100*F(4,28,cook)
. list union cook fperc if fperc>10
e. Use the Hadi procedure for robust outlier detection.Are results of the Hadi procedure consistent with those of the other diagnostics? Why (substantively) are these particular unions deviant? (How would you find out more about the various neighborhoods of metropolitan London in late 19th century?) On what substantive grounds could one justify removing these deviant cases?. * next use Hadi multivariate outliers procedure creating variables hadiout
. * (1 if case is Hadi outlier, 0 otherwise) and Mahalanobis distance
. hadimvo paup outratio propold pop, generate (hadiout mahadist)
. * next list Hadi outliers
. list union mahadist if hadiout
f. After selecting out the outliers identified by the Hadi procedure, estimate the following 3 models. If appropriate estimate a final (4th) trimmed model. * next do regressions with 30 cases that are not outliers
. reg paup outratio if ~hadiout
. reg paup outratio propold if ~hadiout
. reg paup outratio propold pop if ~hadiout
. reg paup outratio pop if ~hadioutPresent the regression results in a tabular form suitable for publication.
g. Re-estimate the full model with the 32 cases using robust regression. use STATA's rreg procedure.How do these estimates compare to OLS with the 32 cases and OLS with the outliers removed?. * next do robust regression with original 32 cases
. rreg paup outratio propold pop
4. ALSM5e 10.14 p. 417 [ALSM4e 9.14 p. 395] (cosmetics sales; interpreting VIF, advantage of experiment) Note that VIFk = 1/TOLk, and conversely TOLk = 1/VIFk .. regress y x1 x2 x3. corr x1 x2 x3
5. ALSM5e 11.6 p. 472 [ALSM4e 10.6 p. 445] (computer-assisted learning; handling heteroscedasticity) This is a small but complete paradigm for handling heteroskedasticity. Disregard the detailed instructions in the textbook. Instead use the data set and do the following steps.. reg y x1 x2 x3. vif. reg y x1
a. run the regression and plot the residuals against the estimate; do you see any funny funnel pattern?
c. in STATA rerun the regression 3 times using (1) the Huber-White robust standard errors (option robust), (2) the MacKinnon-White HC2 standard errors (option hc2), and (3) the MacKinnon HC3 standard errors (option hc3)
6. ALSM5e 12.13 p. 504 [ALSM4e 12.13
p. 523] (advertising agency; detecting autocorrelation of errors)
In part b. since the cases are ordered over time and the observations are
equally spaced, the plot of residuals against time is the same as an index
plot (plot of a variable against the case number).
7. ALSM5e 12.14 p. 504 [ALSM4e 12.14 p. 523] (advertising agency; Cochrane-Orcutt procedure) Omit part f and g.reg y xpredict resids, residuals
*. automatic variable _n is case number
generate obs=_nscatter residuals obsreg y xdwstat
tsset xprais y x, rhotype(reg)prais y x, corc