University of North Carolina
at Chapel Hill

SOCI 709 (formerly 209) - LINEAR REGRESSION MODELS - Spring 2006
Professor François Nielsen

Assignment 2 - Released Tue 16 February
DUE Thu 2 Mar

ALSM5e = Applied Linear Statistical Models 5e (2004) OR Applied Linear Regression Models 4e (2004) (i.e., new editions).
ALSM4e = Applied Linear Statistical Models 4e (1996) OR Applied Linear Regression Models 3e (1996) (i.e., old editions).

PROBLEMS ON MATRIX REPRESENTATION OF REGRESSION MODEL

Problems in this section do not require using a statistical computer program.

1. ALSM5e 5.1 p. 209 [ALSM4e 5.1 p. 211] (matrix addition, subtraction, multiplication)

2. Optional. ALSM5e 5.8 p. 210 [ALSM4e 5.8 p. 212] (linear dependence and rank)

3. ALSM5e 5.15 p. 211 [ALSM4e 5.15 p. 213] (simultaneous equations)

4. Optional. ALSM5e 5.17 p. 211 [ALSM4e 5.17 p. 213] (E{.} of linear function of RVs)  Do parts a. and b. only.  Hint: look at the theorems and the example in ALSM5e p. 196 [ALSM4e p. 197].

5. Optional. ALSM5e 5.20 p. 211 [ALSM4e 5.20 p. 213] (matrix of a quadratic form)  Hint: look at ALSM5e pp. 205-206 [ALSM4e pp. 206-207] for the relationship between a quadratic form and its matrix.  It also helps to remember that the matrix of a quadratic form is symmetric.

6. Optional. ALSM5e 5.29 p. 212 [ALSM4e 5.29 p. 214] (unbiasedness of b)

7. ALSM5e 5.30 p. 212 [ALSM4e 5.30 p. 214] (alternative matrix expression for ^Yh)

8.  This problem uses the data in the craft.dta file.  The following two paragraphs provide some substantive background concerning the data. They are not essential to the assignment.
  In a 1959 article ("Bureaucratic and Craft Administration of Production."  Administrative Science Quarterly 4: 168-187.) Arthur Stinchcombe studies the mode of organization of firms in the construction industry. He argues that, in the construction industry, professionalization of the labor force (through apprenticeship, craft trade-unions, licensing laws, etc.) serves the same functions as, and is an alternative to, bureaucratic administration in mass production industries. Stinchcombe calls this type of control "craft administration".  In craft administration work activities are controlled by the internalized professional skills of the workers, while in bureaucratic administration work activities (e.g., on the assembly line) are controlled through detailed work instructions followed by relatively unskilled workers.  Stinchcombe reckons that craft administration of production is more effective than bureaucratic administration in the construction industry because of the instability in the volume of work and product mix that construction firms typically experience.  Craft administration allows construction firms to quickly assemble the mix of specialized craftsmen needed for a particular project, relying for control on the professional socialization of workers instead of a permanent bureaucratic apparatus.
    In support of his argument, Stinchcombe compares 9 sectors of the construction industry in Ohio with respect to the percentage of clerks in the labor force, used as a measure of bureaucratization.  The independent variables are an index of seasonality of employment, used as a measure of the instability of the environment, and the mean size of firms in the sector (Stinchcombe 1959:178, Table 3).  On the basis of a visual inspection, he claims that the degree of bureaucratization in a sector (% clerks) is negatively associated with seasonality, controlling for mean size of firm.  This pattern supports his prediction that greater instability (seasonality) should be associated with less bureaucracy (and, conversely, more reliance on craft administration), keeping mean size of firms constant.
    Your mission is to verify Stinchcombe's claim with multiple regression analysis, using matrix methods with STATA. The listing below shows the steps to take; they are somewhat different from the the ones shown in class (end of Module 4) but closer to the "textbook" notation.

The variables are:
clerks (% clerks in the labor force - the measure of bureaucracy)
size (mean size of firms in sector)
season (index of seasonality of employment - the measure of environmental instability)
Use the following steps (lines with asterisks are optional comments describing the step):

* create a constant term that is 1 for each case
generate const=1
* setup matrices y and X (beware that STATA is case sensitive)
mkmat clerks, matrix(y)
mkmat const season size, matrix(X)
matrix list y
matrix list X
* calculate y'y, X'X and X'y
matrix yy=y'*y
matrix XX=X'*X
matrix Xy=X'*y
* calculate number of observations and df
matrix nobs=rowsof(X)
matrix df=nobs[1,1]-colsof(X)
* calculate b as (X'X)-1X'y
matrix b=syminv(XX)*Xy
matrix list b
* calculate SSE and MSE
matrix SSE=yy-b'*Xy
matrix MSE=SSE/df[1,1]
*calculate covariance matrix of b, call it V
matrix V=syminv(XX)*MSE
*calculate the t-ratio t* for each coefficient (parentheses are not brackets!)
display b[1,1]/sqrt(V[1,1])
display b[2,1]/sqrt(V[2,2])
display b[3,1]/sqrt(V[3,3])
* calculate the 2-sided P-value for each coefficient using the following formula
* where t* is one of the t-ratios you just calculated; copy and paste the
* value of t* from your output each time (abs() is the absolute value function)
display 2*ttail(df[1,1],abs(t*))
* decide which coefficient(s) is (are) significant at the .05 level
* calculate the hat matrix H
matrix H=X*syminv(XX)*X'
matrix list H
* calculate the trace of H (=sum of diagonal elements)
matrix tr=trace(H)
matrix list tr
* guess a general formula giving the value of the trace of H
* end of STATA commands

9.  Confirm your results in problem 8. by doing the regression with regress.

10.  This problem uses the file world209.dta, extracted from the World Handbook of Political and Social Indicators.  (Click on Data Sets in side bar; World Handbook files are in a special section at the bottom of the page; world209.htm is the list of variables.)  We look at variables that may affect the life expectancy of females.  The computer instructions are listed below is STATA, but you can use another computer program that produces equivalent output.

(1)  Make a scatterplot matrix of the following variables, in this order:

l10v111 - Logarithm base 10 of GNP/cap,75
v175 - Protein per cap/diem,74
v181 - Doctors/Million Pop,75
v207 - Crude Birth Rates,75
v227 - Literacy Rates,75
v195 - Life Expectancy: Females,75
STATA command:
graph matrix l10v111 v175 v181 v207 v227 v195, half
Note that the dependent variable v195 is listed last, so that the bottom row of panels will contain the bivariate relationships of life expectancy, measured on the vertical axis, with each independent variable, measured on the horizontal axis.  (You can print or save your graph by using the Graph menu.)

(2)  Do a simple regression of life expectancy (V195) on Doctors/Million (V181).

regress v195 v181
To check the shape of this relationship fit a nonparametric regression curve to the data using the lowess algorithm:
graph twoway lowess v195 v181 || scatter v195 v181
What does the lowess curve indicate concerning the linearity of the association between v195 and v181? Would you call this relationship "linear"?
No?  Let's try to find a ladder of power transformation of v181 that  makes its distribution more normal looking; this often also helps to straighten the relationship.  The following three commands represent different ways of choosing the best transformation.  (With ladder what you want is the lowest chi-square.)
ladder v181
gladder v181
qladder v181
Choose a transformation and generate the transformed variable; say
generate v181new = <your transformation of v181>
and fit the lowess regression curve again with
graph twoway lowess v195 v181new || scatter v195 v181new
Does the plot look more linear with v181new as the independent variable?

(3)  Calculate the correlation matrix of life expectancy and the other independent variables (using the transformed variable v181new instead of v181).  The command is

correlate l10v111 v175 v181new v207 v227 v195
What variable has the strongest correlation with female life expectancy?  What is the correlation between the crude birth rate and literacy, and what does the sign of the correlation indicate?

(4)  Do a multiple regression of v195 on l10v111, v175, v181new, v207, v227, followed by the command vif (this will produce the variance inflation factors VIF and/or tolerances TOL=1/VIF):

regress v195 l10v111 v175 v181new v207 v227
vif
From your output do the following
a.  On the basis of the F statistic, assess the significance of the multiple regression model as a whole.
b.  Assess the significance of each regression coefficient at the .05 level.
c.  Comparing values of the appropriate coefficients, assess the relative impacts of the independent variables on female life expectancy (i.e., which variables have stronger effects, weaker effects?). Hint: you may want to redo the regression with the beta option.
d . Looking at the VIF or TOL and using the standard rule of thumb, check for evidence of collinearity involving any of the independent variables.

(5)  Use the following commands to calculate the yhats and the residuals, and produce four additional plots for residual analysis:

predict yhat, xb
predict e, resid
a.  A residual plot (residual by fitted value) with a lowess regression line
graph twoway lowess e yhat, yline(0) || scatter e yhat
b.  A boxplot of the residuals
graph box e
c.  A histogram of the residuals with kernel density estimator
graph twoway hist e || kdensity e
d.  A normal probability plot of the residuals
qnorm e, grid
To interpret the normal probability plot (aka quantile normal plot) see ALSM5e Figure 3.9 p. 112; ALSM4e Figure 3.9 p. 108. On the basis of the four plots discuss the distribution of the residuals with respect to linearity, homoskedasticity, normality, and the possible presence of outliers.  If there are outliers (flagged by the box plot, for example) find out what countries they are. To do this one way is to list cases with residuals above a certain value (if the outlying residuals are positive) or below a certain value (if the outlying residuals are negative), or both.  See the box plot for the cutoff values.  Assuming that outliers have residuals greater than 10 or smaller than -10 (your own values may be different) you would do
list v3 e if (e>10) & (e<.)
list v3 e if (e<-10)
(The condition e<. is necessary because STATA represents missing values as a very large number.) v3 is a four-letters country code.  Can you guess the identities of the outliers?  You can find the full name of a country in the variable list world209.htm (look for the list of country names at the end).



Last modified 16 Feb 2006