SOCI 709 (formerly 209) - LINEAR REGRESSION MODELS - Spring 2006
Professor François Nielsen
Assignment 1 - Released Thu 26 Jan
DUE Tue 14 Feb
For this first set of problems using a statistical program is not absolutely necessary (as all problems can be calculated using a spreadsheet program) but it will make your life much easier for problems 5, 6, and 12.
1. ALSM5e 1.2 p. 33 [ALSM4e 1.2 p. 36] (functional vs. statistical relation)
2. ALSM5e 1.5 p. 33 [ALSM4e 1.5 p. 36] (linear regression model versus regression function)
3. ALSM5e 1.11 p. 34 [ALSM4e 1.11 p. 37] (meaning of b0 and b1 in simple regression model)
4. ALSM5e 1.16 p. 34 [ALSM4e 1.16 p. 38] (is normality assumption required for validity of OLS?)
5. ALSM5e 1.19 p. 35 [ALSM4e 1.19 p. 38] (grade point average example) Additional question: e. What is the point estimate of the change in the mean response (in standard deviations of Y) when X increases by one standard deviation? *NOTE: The data sets needed to complete Problems 5, 6 and 12 are not the same for the new and old editions of the textbook: the data set for ALSM5e is called “knnch01pr19.dta” and the data set for ALSM4e is called “nknwch01pr19.dta.” Both data sets can be found on the Soci 209 website under the “datasets” link. (You can use the data set you want, but for Problem 12 be sure to use Xh=28 for ALSM5e and Xh=4.7 for ALSM4e. )
6. ALSM5e 1.23 p. 36 [ALSM4e 1.23 p. 39] (grade point average example; residuals)
7. ALSM5e 1.30 p. 37 [ALSM4e 1.30 p. 41] (what is regression function when b1 = 0)
8. ALSM5e 2.1 p. 89 [ALSM4e 2.1 p. 86] (relationship between CI and hypothesis test).
9. ALSM5e 2.3 p. 90 [ALSM4e 2.3 p. 86] (meaning of non-significance)
10. ALSM5e 2.9 p. 91 [ALSM4e 2.9 p. 88] (why s{^Yh} not on printout?) Hint: no calculations are needed; just look up the formula 2.30 for s{^Yh} on ALSM5e p. 53 [ALSM4e p. 58] and think.
11. Optional -- You will not be penalized by skipping this question. ALSM5e 2.12 p. 91 [ALSM4e 2.12 p. 88] (limiting behavior of s2{pred} vs. s 2{^Yh})
12. ALSM5e 2.13 p. 91 [ALSM4e 2.13 p. 88] (grade point average example; CI for E{Yh} vs. CI for Yh(new)) Parts a, b, and c only. To cut down on calculations we are providing you with the following results: for Xh=28, ^Yh=3.2012 and s{^Yh}=0.0706; MSE=0.3883; t(0.975; 118)=1.9803. For part a use formula ALSM5e 2.33 p. 54 [ALSM4e 2.33 p. 59] and proceed as in Example 1 ALSM5e p. 54 [ALSM4e p. 60]. For part b make sure you understand why this situation is different from that in part a and then use formula ALSM5e 2.36 p. 58 [ALSM4e 2.36 p. 64] and ALSM5e 2.38 p. 59 [ALSM4e 2.38 p. 65] and proceed as in the example on ALSM5e p. 59 [ALSM4e p. 65].
13. Ransacking the standard regression output. The table below presents descriptive statistics and the (censored) output of a simple regression based on a sample of 56 countries circa 1975. The dependent variable (T20) is income inequality, measured as the share of total income accruing to the top quintile (20%) of incomes, and the independent variable (NRI) is the natural rate of population increase, calculated as the crude birth rate minus the crude death rate (with both rates per 1,000 population per year). The corresponding scatterplot is attached.
T20 NRI N of cases 56 56 Minimum 37.300 2.000 Maximum 67.300 34.000 Mean 51.116 19.732 Standard Dev 8.192 9.972
DEP VAR: T20 N: 56 MULTIPLE R: _____
SQUARED MULTIPLE R: _____
ADJUSTED SQUARED MULTIPLE R:
xxxx
STANDARD ERROR OF ESTIMATE: _____
VARIABLE COEFFICIENT STD ERROR STD COEF TOLERANCE T P(2 TAIL)
CONSTANT
38.237 1.498
_____ xxxxx _____
_____
NRI
0.653 0.068
_____ xxxxx _____
_____
ANALYSIS OF VARIANCE
SOURCE SUM-OF-SQUARES DF MEAN-SQUARE F-RATIO P
REGRESSION
2329.667 __ ______
______ _____
RESIDUAL
1361.448 __ ______
Do the following:
a. On the basis of the information available, reconstruct all the missing entries of the table (indicated by underscore), briefly explaining for each entry how you obtained it. Disregard entries that have been x'ed out.
b. The value of the t-statistic t* corresponding to b1 can in fact be calculated in two ways from the information given. What's the other way to calculate this statistic (other than the way you have used)?
c. Test, using the test statistic t*, whether there is a linear association between income inequality (T20) and the natural rate of population increase (NRI). Use a level of significance of a= .05. State the alternatives, decision rule, and conclusion. (Use the P-value approach.)
d. What is the P-value of the test in c.? (Give the best estimate of the P-value you can get either from the relevant table in the textbook or from the calculator on the web or using the display command in STATA; see instructions at link Tables in side-bar.) How does the P-value support the conclusion reached in c.?
e. Now test the hypothesis that there is a positive association between income inequality and the natural rate of population increase. State the alternatives, decision rule, and conclusion. How does the P-value corresponding to this test relate to the P-value discussed in d.?
f. Calculate the .95 confidence interval for b1. Did you expect this interval to include the value zero, and why?
g. Which measure, r2 or r, has the more clear-cut operational interpretation? Explain.