SOCI 709 (formerly 209) - LINEAR
REGRESSION MODELS - Spring 2006
Professor François Nielsen
Assignment 2 - Released Tue 16 February
DUE Thu 2 Mar
ALSM5e = Applied Linear Statistical
Models 5e (2004) OR Applied Linear Regression Models
4e (2004) (i.e., new editions).
ALSM4e = Applied Linear Statistical
Models 4e (1996) OR Applied Linear Regression Models
3e (1996) (i.e., old editions).
1. ALSM5e 5.1 p. 209 [ALSM4e 5.1 p. 211] (matrix addition, subtraction, multiplication)
2. Optional. ALSM5e 5.8 p. 210 [ALSM4e 5.8 p. 212] (linear dependence and rank)
3. ALSM5e 5.15 p. 211 [ALSM4e 5.15 p. 213] (simultaneous equations)
4. Optional. ALSM5e 5.17 p. 211 [ALSM4e 5.17 p. 213] (E{.} of linear function of RVs) Do parts a. and b. only. Hint: look at the theorems and the example in ALSM5e p. 196 [ALSM4e p. 197].
5. Optional. ALSM5e 5.20 p. 211 [ALSM4e 5.20 p. 213] (matrix of a quadratic form) Hint: look at ALSM5e pp. 205-206 [ALSM4e pp. 206-207] for the relationship between a quadratic form and its matrix. It also helps to remember that the matrix of a quadratic form is symmetric.
6. Optional. ALSM5e 5.29 p. 212 [ALSM4e 5.29 p. 214] (unbiasedness of b)
7. ALSM5e 5.30 p. 212 [ALSM4e 5.30 p. 214] (alternative matrix expression for ^Yh)
8. This problem uses the data
in the craft.dta file. The following two paragraphs provide
some substantive background concerning the data. They are not essential
to the assignment.
In a 1959 article ("Bureaucratic
and Craft Administration of Production." Administrative Science
Quarterly 4: 168-187.) Arthur Stinchcombe studies the mode of organization
of firms in the construction industry. He argues that, in the construction
industry, professionalization of the labor force (through apprenticeship,
craft trade-unions, licensing laws, etc.) serves the same functions as,
and is an alternative to, bureaucratic administration in mass production
industries. Stinchcombe calls this type of control "craft administration".
In craft administration work activities are controlled by the internalized
professional skills of the workers, while in bureaucratic administration
work activities (e.g., on the assembly line) are controlled through detailed
work instructions followed by relatively unskilled workers. Stinchcombe
reckons that craft administration of production is more effective than
bureaucratic administration in the construction industry because of the
instability in the volume of work and product mix that construction firms
typically experience. Craft administration allows construction firms
to quickly assemble the mix of specialized craftsmen needed for a particular
project, relying for control on the professional socialization of workers
instead of a permanent bureaucratic apparatus.
In support of
his argument, Stinchcombe compares 9 sectors of the construction industry
in Ohio with respect to the percentage of clerks in the labor force, used
as a measure of bureaucratization. The independent variables are
an index of seasonality of employment, used as a measure of the instability
of the environment, and the mean size of firms in the sector (Stinchcombe
1959:178, Table 3). On the basis of a visual inspection, he claims
that the degree of bureaucratization in a sector (% clerks) is negatively
associated with seasonality, controlling for mean size of firm. This
pattern supports his prediction that greater instability (seasonality)
should be associated with less bureaucracy (and, conversely, more reliance
on craft administration), keeping mean size of firms constant.
Your mission
is to verify Stinchcombe's claim with multiple regression analysis, using
matrix methods with STATA. The listing below shows the steps to take; they
are somewhat different from the the ones shown in class (end of Module
4) but closer to the "textbook" notation.
The variables are:Use the following steps (lines with asterisks are optional comments describing the step):
clerks (% clerks in the labor force - the measure of bureaucracy)
size (mean size of firms in sector)
season (index of seasonality of employment - the measure of environmental instability)
* create a constant term that is 1 for each case
generate const=1
* setup matrices y and X (beware that STATA is case sensitive)
mkmat clerks, matrix(y)
mkmat const season size, matrix(X)
matrix list y
matrix list X
* calculate y'y, X'X and X'y
matrix yy=y'*y
matrix XX=X'*X
matrix Xy=X'*y
* calculate number of observations and df
matrix nobs=rowsof(X)
matrix df=nobs[1,1]-colsof(X)
* calculate b as (X'X)-1X'y
matrix b=syminv(XX)*Xy
matrix list b
* calculate SSE and MSE
matrix SSE=yy-b'*Xy
matrix MSE=SSE/df[1,1]
*calculate covariance matrix of b, call it V
matrix V=syminv(XX)*MSE
*calculate the t-ratio t* for each coefficient (parentheses are
not brackets!)
display b[1,1]/sqrt(V[1,1])
display b[2,1]/sqrt(V[2,2])
display b[3,1]/sqrt(V[3,3])
* calculate the 2-sided P-value for each coefficient using the
following formula
* where t* is one of the t-ratios you just calculated; copy and
paste the
* value of t* from your output each time (abs() is the absolute
value function)
display 2*ttail(df[1,1],abs(t*))
* decide which coefficient(s) is (are) significant at the .05 level
* calculate the hat matrix H
matrix H=X*syminv(XX)*X'
matrix list H
* calculate the trace of H (=sum of diagonal elements)
matrix tr=trace(H)
matrix list tr
* guess a general formula giving the value of the trace of H
* end of STATA commands
9. Confirm your results in problem 8. by doing the regression with regress.
10. This problem uses the file world209.dta, extracted from the World Handbook of Political and Social Indicators. (Click on Data Sets in side bar; World Handbook files are in a special section at the bottom of the page; world209.htm is the list of variables.) We look at variables that may affect the life expectancy of females. The computer instructions are listed below is STATA, but you can use another computer program that produces equivalent output.
(1) Make a scatterplot matrix of the following variables, in this order:
l10v111 - Logarithm base 10 of GNP/cap,75STATA command:
v175 - Protein per cap/diem,74
v181 - Doctors/Million Pop,75
v207 - Crude Birth Rates,75
v227 - Literacy Rates,75
v195 - Life Expectancy: Females,75
Note that the dependent variable v195 is listed last, so that the bottom row of panels will contain the bivariate relationships of life expectancy, measured on the vertical axis, with each independent variable, measured on the horizontal axis. (You can print or save your graph by using the Graph menu.)graph matrix l10v111 v175 v181 v207 v227 v195, half
(2) Do a simple regression of life expectancy (V195) on Doctors/Million (V181).
To check the shape of this relationship fit a nonparametric regression curve to the data using the lowess algorithm:regress v195 v181
graph twoway lowess v195 v181 || scatter v195 v181What does the lowess curve indicate concerning the linearity of the association between v195 and v181? Would you call this relationship "linear"?
ladder v181Choose a transformation and generate the transformed variable; say
gladder v181
qladder v181
generate v181new = <your transformation of v181>and fit the lowess regression curve again with
graph twoway lowess v195 v181new || scatter v195 v181newDoes the plot look more linear with v181new as the independent variable?
(3) Calculate the correlation matrix of life expectancy and the other independent variables (using the transformed variable v181new instead of v181). The command is
correlate l10v111 v175 v181new v207 v227 v195What variable has the strongest correlation with female life expectancy? What is the correlation between the crude birth rate and literacy, and what does the sign of the correlation indicate?
(4) Do a multiple regression of v195 on l10v111, v175, v181new, v207, v227, followed by the command vif (this will produce the variance inflation factors VIF and/or tolerances TOL=1/VIF):
regress v195 l10v111 v175 v181new v207 v227From your output do the following
vif
(5) Use the following commands to calculate the yhats and the residuals, and produce four additional plots for residual analysis:
predict yhat, xba. A residual plot (residual by fitted value) with a lowess regression line
predict e, resid
graph twoway lowess e yhat, yline(0) || scatter e yhatb. A boxplot of the residuals
graph box ec. A histogram of the residuals with kernel density estimator
graph twoway hist e || kdensity ed. A normal probability plot of the residuals
qnorm e, gridTo interpret the normal probability plot (aka quantile normal plot) see ALSM5e Figure 3.9 p. 112; ALSM4e Figure 3.9 p. 108. On the basis of the four plots discuss the distribution of the residuals with respect to linearity, homoskedasticity, normality, and the possible presence of outliers. If there are outliers (flagged by the box plot, for example) find out what countries they are. To do this one way is to list cases with residuals above a certain value (if the outlying residuals are positive) or below a certain value (if the outlying residuals are negative), or both. See the box plot for the cutoff values. Assuming that outliers have residuals greater than 10 or smaller than -10 (your own values may be different) you would do
list v3 e if (e>10) & (e<.)(The condition e<. is necessary because STATA represents missing values as a very large number.) v3 is a four-letters country code. Can you guess the identities of the outliers? You can find the full name of a country in the variable list world209.htm (look for the list of country names at the end).
list v3 e if (e<-10)