SOCI208 - Module 17 - Multiple Regression

1.  Need for Models With More Than One Independent Variable

1.  Motivations for Multiple Regression Analysis

The 2 principal motivations for models with more than one independent variable are: The second motivation is very important for scientific applications of regression analysis.  It is discussed further in the next section.

2.  Supporting a Causal Statement by Eliminating Alternative Hypotheses

Theories about social phenomena are made up of causal statements.  In Constructing Social Theories, Arthur Stinchcombe (1968) defines a causal statement or law as "A causal law is a statement or proposition in a theory which says that there exist environments ... in which a change in the value of one variable is associated with a change in the value of another variable and can produce this change without any change in other variables in the environment" (p. 31).
Such a causal statement can be represented schematically as
X (independent) --> Y (dependent)
Stinchcombe argues further that one of the requirements to support or refute a causal theory is to ascertain nonspuriousness.  Ascertaining nonspuriousness means checking whether one or more other variables affect both X and Y and thereby produce an apparent association between X and Y that is spuriously attributed to a causal influence of X on Y.
Multiple regression analysis can be used to ascertain nonspuriousness to an extent that depends on the design of the study: In the context of regression analysis, spuriousness is called specification bias.  Specification bias is a more general and continuous notion than spuriousness.  The idea is that if a regression model of Y on X excludes a variable that is both associated with X and a cause of Y (the model is then called misspecified) the estimated association of Y with X will be inflated (or, conversely, deflated) relative to its true value.  The regression estimator, in a sense, falsely "attributes" to X a causal influence that is in reality due to the omitted variable(s).
Ascertaining nonspuriousness is equivalent to eliminating alternative hypotheses on the source of the relationship between X and Y by adding variables explicitly to the regression model.

3.  The D-Score Data: an Example of Spurious Association

The D-score data (Koopmans 1987) illustrate how a spurious association can be elucidated using multiple regression analysis.
A test of cognitive development is administered to a sample of 12 children with ages ranging from 3 to 10.  The cognitive development measure is called a D-score.
The simple regression of D-score on sex is carried out.  Sex is represented by the variable BOY (coded Boy - 1, Girl - 0).  The regression reveals a significant positive effect of BOY on D-score: boys score significantly higher than girls (P-value = 0.039).

Table 1.  Simple Regression Analysis of the D-Score Data Set

Example from Koopmans, Lambert.  1987.  Introduction to Contemporary Statistical Methods.  (2d edition.)  PWS-Kent.  Pp. 554-557.

Data
 Case number          OBS       DSCORE          AGE          BOY         BOY$
        1            1.000        8.610        3.330        0.000 G
        2            2.000        9.400        3.250        0.000 G
        3            3.000        9.860        3.920        0.000 G
        4            4.000        9.910        3.500        0.000 G
        5            5.000       10.530        4.330        1.000 B
        6            6.000       10.610        4.920        0.000 G
        7            7.000       10.590        6.080        1.000 B
        8            8.000       13.280        7.420        1.000 B
        9            9.000       12.760        8.330        1.000 B
       10           10.000       13.440        8.000        0.000 G
       11           11.000       14.270        9.250        1.000 B
       12           12.000       14.130       10.750        1.000 B

Pearson Correlation Matrix
 
                    DSCORE          AGE          BOY
 DSCORE              1.000
 AGE                 0.957        1.000
 BOY                 0.600        0.647        1.000

Simple Linear Regression

Dep Var: DSCORE   N: 12   Multiple R: 0.600   Squared multiple R: 0.360
Adjusted squared multiple R: 0.296   Standard error of estimate: 1.671
 
Effect         Coefficient    Std Error     Std Coef Tolerance     t   P(2 Tail)
CONSTANT            10.305        0.682        0.000      .      15.109    0.000
BOY                  2.288        0.965        0.600     1.000    2.372    0.039
 
Analysis of Variance
Source             Sum-of-Squares   df  Mean-Square     F-ratio       P
Regression                15.709     1       15.709       5.629       0.039
Residual                  27.910    10        2.791

-------------------------------------------------------------------------------
*** WARNING ***
Case           10 is an outlier        (Studentized Residual =        2.566)
 
Durbin-Watson D Statistic          1.183
First Order Autocorrelation        0.315


However, a symbolic plot of D-score against age, using symbols to identify sex (B = Boy, G = Girl), reveals a systematic pattern.

Q - What is the pattern in the following figure?



A multiple regression analysis is then carried out, with D-score as the dependent variable and both BOY and AGE as independent variables.
The results are shown in Table 2.  This time the effect of BOY becomes non-significant (P-value is 0.799); the effect of AGE on D-score is strongly significant.  One concludes that the significant effect of sex (represented by the variable BOY) in the first regression was spurious.  It was a consequence of the (accidental) association in the sample between age and sex, i.e. the tendency (visible in the scatterplot) for boys to be older than girls, combined with the strong effect of age on D-score.  Introducing ("controlling for") age in the model has eliminated the spurious effect of sex on cognitive development.

Table  2.  Multiple Regression of D-Score on BOY and AGE

Dep Var: DSCORE   N: 12   Multiple R: 0.958   Squared multiple R: 0.917
Adjusted squared multiple R: 0.899   Standard error of estimate: 0.634
 
Effect         Coefficient    Std Error     Std Coef Tolerance     t   P(2 Tail)
CONSTANT             6.927        0.506        0.000      .      13.697    0.000
BOY                 -0.126        0.480       -0.033     0.581   -0.262    0.799
AGE                  0.753        0.097        0.979     0.581    7.775    0.000
 
Analysis of Variance
Source             Sum-of-Squares   df  Mean-Square     F-ratio       P
Regression                40.002     2       20.001      49.765       0.000
Residual                   3.617     9        0.402

-------------------------------------------------------------------------------
Durbin-Watson D Statistic          2.277
First Order Autocorrelation       -0.313


4.  The Mechanism of Specification Bias aka Spuriousness

The mechanism of spuriousness aka specification bias is presented graphically in the next exhibit

5.  Standard Tabular Presentation of Regression Results

1.  Standard Presentation
The standard journal presentation of multiple regression results is aimed in part at facilitating the elaboration model by examining the effect of introducing a new "test" variable in the model.
The following table presents the results of the regression analysis of the D-score data in standard tabular format.
 
Table 3.  Unstandardized Regression Coefficients of Cognitive Development (D-score) on Sex and Age for 12 Children Aged 3 to 10 (t Ratios in Parentheses)
Independent variable
Model 1
Model 2
Constant
10.305***
6.927***
 
(15.109)
(13.697)
Boy ( boy=1, girl=0)
2.288*
-.126
 
(2.372)
(-.262)
Age (years)
--
.753***
   
(7.775)
R-square
.360
.917
Adjusted R-square
.296
.899
Note:  * p < .05  ** p < .01  *** p < .001  (2-tailed tests)
2.  Suggestions on Preparing Tables of Regression Results
The following guidelines would help prepare tables of results acceptable by most professional journals.

2.  The Multiple Regression Model

1.  The Multiple Regression Model With p - 1 Independent Variables

The multiple linear regression model with p - 1 independent variables can be written
Yi = b0 + b1Xi  + b2Xi2 + ...  + bp-1Xi,p-1 + ei  i = 1,..., n
where
Yi is the response for the ith case
Xi1 ,Xi2 , ...,Xi,p-1are the values of p-1 independent variables for the ith case, assumed to be known constants
b0, b1, ..., bp-1are parameters
ei are independent ~ N(0, s2)
(The independent variables are indexed 1 to p - 1 so that the total number of independent variables, including the implicit column of 1 associated with the intercept b0, is equal to p.)
The interpretation of the parameters is
  1. b0, the Y intercept, indicates the mean of the distribution of Y when X1 = X2 = ... = Xp-1 = 0
  2. bk (k = 1, 2, ..., p - 1) indicates the change in the mean response E{Y} (measured in Y units) when Xk increases by one unit while all the other independent variables remain constant
  3. s2 is the common variance of the distribution of Y

2.  Example - Proposition 14 and The UFW

An example of the use of multiple regression analysis is based on the study by McVeigh (1993) of the United Farm Workers (UFW) movement in California in the 1970s.  The UFW movement was founded and led by Cesar Chavez to represent the interests of farm laborers in California.  Proposition 14 was placed on the ballot in California in 1976 by the UFW, and contained provisions favorable to the movement, including a section allowing union organizers limited access to the work site on the grower's property for organization purposes.  The California growers, who employed farm workers, were opposed to the proposition.  A YES vote on Proposition 14 therefore represented support for the UFW, while a NO vote represented support for the growers.
McVeigh assembled data on the 58 counties of California to test various hypothese relating support for the UFW, represented by the percent voting YES on Proposition 14, with social characteristics of counties.  One of McVeigh's hypotheses is that support for Proposition 14 should be negatively related with the percentage of farm laborers in the labor force.  This is because, (1) counties with large proportions of farm laborers have economies that are highly dependent on agriculture, so that a victory by UFW is likely to threaten the economic interests of a large segment of the population whose livelihood depends on agricultural production; this segment of the population is likely to vote NO on Proposition 14; and (2), even in counties where the proportion of farm workers is relatively large, the farm workers represent only a small percentage of the population and therefore do not constitute a substantial voting block in favor of Proposition 14.
McVeigh has a number of additional hypotheses relating social characteristics of the California counties with support for the UFW.  For example, he argues that the strategy of growers of casting Proposition 14 (which had provisions allowing access by union organizers to farm workers on the work site) as an attack on private property would lead to lower support for UFW in counties with large proportions of owner-occupied houses, since homeowners would presumably feel most threatened by the proposition.  This and other hypothese can be investigated with multiple regression analysis.  McVeigh estimated the multiple regression of YESON14 on seven independent variables equivalent to the equation
YESON14 = b0 + b1LABOR + b2CARTER + b3GINI + b4SPANORIG + b5MDINC + b6LOGPOP + b7HOMEOWN + ei
The variables are defined as
LABOR, % farm laborers in the labor force
CARTER, % vote for Jimmy Carter for president (a sure sign of liberalism)
GINI, the Gini coefficient of inequality of the distribution of family income
SPANORIG, percent population of Hispanic origin
MDINC, median family income
LOGPOP, logarithm of county population
HOMEOWN, percent housing owner occupied
The SYSTAT multiple-regression output follows.

Table 4.  Regression of Yes Vote on Proposition 14 in California Counties, 1976.

 DEP VAR: YESON14      N:      58  MULTIPLE R: 0.891  SQUARED MULTIPLE R: 0.793
 ADJUSTED SQUARED MULTIPLE R:  .764    STANDARD ERROR OF ESTIMATE:        4.258

 VARIABLE      COEFFICIENT    STD ERROR     STD COEF TOLERANCE    T   P(2 TAIL)
 CONSTANT          -48.632       16.312        0.000      .      -2.981    0.004
 LABOR              -2.086        0.615       -0.389     0.315   -3.393    0.001
 CARTER              0.530        0.133        0.301     0.729    3.993    0.000
 GINI               68.438       35.317        0.161     0.600    1.938    0.058
 SPANORIG            0.011        0.104        0.011     0.422    0.109    0.914
 MDINC               3.003        0.591        0.497     0.431    5.080    0.000
 LOGPOP              1.389        0.551        0.289     0.314    2.522    0.015
 HOMEOWN            -0.249        0.086       -0.248     0.566   -2.896    0.006
 

                        ANALYSIS OF VARIANCE

 SOURCE       SUM-OF-SQUARES   DF  MEAN-SQUARE     F-RATIO       P
 REGRESSION        3475.600     7      496.514      27.389       0.000
 RESIDUAL           906.426    50       18.129



 

3.  Basic Results For Multiple Regression Model

1.  Correlation Matrix and Splom

The simple correlation coefficients among variables in the multiple regression model are often presented in the form of a matrix.
The correlations can also be presented graphically in a corresponding scatterplot matrix, or splom.  As presented in the next exhibit, the dependent variable (YESON14) is listed last, so the correlations involving it appear together on the bottom row of the splom, with each panel showing the dependent variable on the vertical axis.  The splom uses the HALF option so that only one panel is shown for each correlation, to reduce the visual clutter.

2.  Estimated Regression Function ^Y

The estimated regression function for the multiple regression model with p - 1 variables is
^Y = b0 + b1X1 + ... + bp - 1Xp - 1
where b0, b1, ..., bp - 1 are estimated as the solution of the ordinary least squares normal equations.  (These equations will be derived in matrix notation in SOCI209.)
On the standard multiple regression printout the estimated coefficients bk are presented, together with the estimated standard errors s{bk} and the t-ratio t* = bk/s{bk} (see later).
Example: keeping constant the other variables in the model, the estimated coefficient for LABOR (% farm laborers in labor force) is -2.086, so that an increase of 1 unit (percent) of farm laborers is associated with a decline of about 2% in Yes votes for Proposition 14.

3.  Analysis of Variance (ANOVA)

1.  Fitted Values ^Yi
The fitted values ^Yi are defined in a way analogous to simple regression as
^Yi = b0 + b1Xi1 + ... + bp - 1Xi, p - 1
Note that ^Yi is a single number associated with each case, regardless of the number p - 1 of independent variables in the model.
2.  Sums of Squares
The sums of squares are defined identically in simple and multiple regression, as
SSTO = S(Yi - Y.)2
SSE = S(Yi - ^Yi)2
SSR = S(^Yi - Y.)2
3.  Degrees of Freedom
The degrees of freedom associated with various sums of squares are
SSTO has n - 1 df associated with it, with 1 df lost because the sample mean is estimated from the data (same as before)
SSE has n - p df because the n residuals ei = Yi - ^Yi are calculated using p parameters b0, b1, ..., bp-1 estimated from the data
SSR has p - 1 df because of the p estimated parameters b0, b1, ..., bp-1 used to calculate the Yi, minus 1 df associated with a constraint on the sum of the fitted values (see NWW p. 604)
4.  Mean Squares
Mean squares are sums of squares divided by their respective degrees of freedom (df).
In particular, MSE = SSE/(n - p) is again the estimate of s2, the common variance of e and of Y.
5.  ANOVA Table
Analysis of variance results are summarized in an ANOVA table analogous to the one for simple regression.  Table 5a shows the general format of the ANOVA table and Table 5b shows the table for the UFW example.
 
Table 5a.  General Format of ANOVA Table for Multiple Regression
Source of variation
SS
df
MS
Regression SSR = S(^Yi - Y.)2
p - 1
MSR = SSR/(p -  1)
Error SSE = S(Yi - ^Yi)2
n - p
MSE = SSE/(n - p)
Total SSTO = S(Yi - Y.)2
n - 1
sY2 = SSTO/(n - 1)
Table 5b.  ANOVA Table for UFW Example
Source of variation
SS
df
MS
Regression SSR = 3475.600
7
MSR = 496.514
Error SSE = 906.426
50
MSE = 18.129
Total SSTO = 4382.026
57
sY2 = 76.878

4.  Coefficient of Multiple Determination R2

1.  Coefficient of Multiple Determination R2
The coefficient of multiple determination R2 is defined analogously to the simple regression r2 as
R2 = SSR/SSTO = 1 - (SSE/SSTO)
where
0 <= R2 <= 1
Example: in the UFW example
R2 = SSR/SSTO = 3475.600/4382.026 = 0.793
as shown on the printout of Table 4.
2.  Coefficient of Multiple Correlation
The coefficient of multiple correlation R is the square root of R2 so that
R = +(R2)1/2
where R is always positive (R >= 0).
Q - Why is R always positive in the multiple regression context?
3.  Adjusted R-Square Ra2
The adjusted coefficient of multiple determination Ra2 adjusts for the number of independent variables in the model (to correct the tendency of R2 to always increase when independent variables are added to the model).  It is calculated as
R2a = 1 - ((n-1)/(n-p))(SSE/SSTO) = 1 - MSE/(SSTO/(n - 1))
Example: In the UFW printout the adjusted r-square R2a is
1 - ((58 - 1)/(58 - 8))(906.426/4382.026) = .764
as contrasted with the ordinary (unadjusted) R2 = 0.793

4.  F Test for Regression Relation (Screening Test)

The F test for regression relation (aka screening test) tests the existence of a relation between the dependent variable and the entire set of independent variables.  The test involves the hypothesis setup
H0: b1= b2 = ... = bp-1= 0
H1: Not all bk = 0  k = 1, 2,..., p - 1
The test statistic is (same as for simple linear regression)
F* = MSR/MSE
which is distributed as F(p - 1; n - p), the same df as the numerator and denominator, respectively, in the ratio MSR/MSE.

Using the P-value method, calculate the P-value P{F(p - 1; n - p) > F*}.
If P-value < a conclude H1 (not all coefficients = 0 so there is a statistical relation), otherwise conclude H0 (there is no statistical relation)
Using the decision theory method choose a significance level a.
Then the decision rule is

if F* <= F(1 - a; p - 1, n - p), conclude H0
if F* > F(1 - a; p - 1, n - p), conclude H1
Example: In the UFW example
F* = 496.514/18.129 = 27.389
Using the P-value method, P{F(7, 50) > 27.389} = .000000.  Since P-value = .000000 < .05 = a, conclude H1, that not all regression coefficients are 0.
Using the decision theory method, find F(0.95; 7, 50) = 2.199202.  Since F* = 27.389 > 2.199, conclude H1, that not all regression coefficients are 0 with this method also.

5.  Inference Concerning Individual Regression Coefficients

Statistical inference on individual regression bk is carried out in the same way as for simple regression, except that the t tests are now based on the Student t distribution with n - p df (corresponding to the n - p df associated with MSE), instead of the n - 2 df of the simple regression model.

1.  CI for bk

The 1 - a confidence limits for a coefficient bk of a multiple regression model are given by
bk -/+ t(1 - a/2; n - p)s{bk}
where s{bk} is the estimated standard deviation of bk and is provided on the standard regression printout next to bk.  (The calculation of s{bk} is discussed in SOCI209.)

Example:  In the UFW example, calculate a 95% CI for the coefficient of LABOR.  The ingredients are

b1 = -2.086; s{b1} = 0.615; n = 58; p = 8; a = .05
Calculate n - p = 50 and t(0.975, 50) = 2.008559.  Thus the confidence limits are
L =  -2.086 - (2.008559)(0.615) = -3.321264
U =  -2.086 + (2.008559)(0.615) = -0.850736
In other words one can say that with 95% confidence
 -3.321264 <= bk <= -0.850736
One can say that, with 95% confidence, the decrease in YES vote for Proposition 14 associated with an increase of 1% in the percentage of farm laborers is between -3.321 and -0.851 percent point.

2.  Hypothesis Tests for bk

1. Two-Sided Tests
The most common tests concerning bk involve the null hypothesis that bk = 0.
The alternatives are
H0: bk = 0
H1: bk <> 0
The test statistic is
t* = bk/s{bk}
where s{bk} is the estimated standard deviation of bk.
When bk = 0, t* ~ t(n - p).

Example: Test that the coefficient of LABOR is different from 0.  The hypotheses are

H0: b1 = 0
H1: b1 <> 0
The test statistic (aka "t ratio") is
t* = b1/s{b1} = -2.086/0.615 = -3.393 (provided on printout under "T")
When b1= 0, t* is distributed as t(n - p) = t(50).
Using the P-value method, find the 2-tailed P-value = P{|t(50)| > |-3.393|} = (2)P{t(50) < -3.393} = 0.001.
Since P-value = 0.001 < 0.05 = a, conclude H1, that b1 <> 0.
Using the decision theory method, choose significance level, say a = 0.05.  The critical value t(0.975; 50) = 2.008559.
Since |t*| = |-3.393| > 2.008559, conclude H1, that b1 <> 0, by this method also.
2.  One-Sided Tests
One-sided tests for a coefficient bk are carried out by dividing the 2-sided P-value by 2, as before.

6.  CI for E{Yh}

It is often important to estimate the mean response E{Yh} for given values of the independent variables.
The values of the independent variables for which E{Yh} is to be estimated are denoted
Xh1, Xh2, ..., Xh, p - 1
(This set of values of the X variables may or may not correspond to one of the cases in the data set.)
The estimator of E{Yh} is
^Yh = b0 + b1Xh1 + b2Xh2 + ... + bp - 1Xh, p - 1
The 1 - a confidence limits for the mean response E{Yh} are then given by
^Yh  -/+ t(1 - a/2; n - p)s{^Yh}
where s{^Yh} is the estimated standard deviation of ^Yh.
The quantity s{^Yh} can be calculated with a statistical program using a technique explained in the last section of this module.

Example: In the UFW example, one can obtain the predicted Yes vote ^Yh and its estimated standard error s{^Yh} by adding to the data set a "dummy" case with the chosen Xhk values for the independent variables, and a missing value for the dependent variable.  Using SYSTAT, go to the data window and add a case (row) with YESON14 = ., LABOR = 4.0, CARTER = 61, GINI = 0.36, SPANORIG = 15, MDINC = 10, LOGPOP = 9.5, and HOMEOWN = 50.  The ID number for the new case is 59.  Then run the regression model and save the residuals.  Open the file of residuals.  The desired quantities are given for case 59 as

^Yh = ESTIMATE = 30.934
s{^Yh} = SEPRED = 2.578
Choosing a = 0.05, the 0.95 confidence limits for ^Yh are then calculated as
L = 30.934 - (2.578)(2.008559) = 25.756
U = 30.934 + (2.578)(2.008559) = 36.112
where 2.008559 is t(0.975; 50).

7.  Prediction Interval for Yh(new)

Skip this topic, unless you are a business type.
 

8.  Other Elements of the Multiple Regression Printout

Two additional elements of the standard regression output become relevant in the multiple-regression context.

1.  Standardized Regression Coefficients

The standardized regression coefficient  bk* is  calculated as:
bk*  =  bk(s(Xk)/s(Y))
where s(Xk) and s(Y) denote the sample standard deviations of Xk and Y, respectively.
Thus the standardized coefficient bk* is calculated as the original (unstandardized) regression coefficient bk multiplied by the ratio of the standard deviation of Xk to the standard deviation of Y.
Conversely, one can recover the unstandardized coefficient from the standardized one as
bk  =   bk*(s(Y)/s(Xk))
The standardized coefficient bk* measures the change in standard deviations of Y associated with an increase of one standard deviation of X.
Standardized coefficients permit comparisons of the relative strength of the effects of different independent variables, measured in different metrics (= units).

Example:  In the results of Table 4 for the UFW data, standardized coefficients are provided in the column headed STD COEF.  One sees that the standardized coefficients for MDINC (median county income) is 0.497, and the standardized coefficient for HOMEOWN (% home ownership) is -0.248.  Using standardized coefficients it is possible to say that the positive effect of MDINC (0.497) on Yes vote on Proposition 14 is about twice as strong as the negative effect of HOMEOWN (-0.248) on Yes vote.  Standardization replaces the original metric of an independent variable into standard deviation units.

Q - Why did rich counties give such strong support to the UFW?

2.  Tolerance or Variance Inflation Factor

The standard multiple regression output often provides a diagnostic measure of the collinearity of a predictor with the other predictors in the model, either the tolerance (TOL) or the variance inflation factor (VIF).
1.  Tolerance (TOL)
TOL = 1 - Rk2
where Rk2 is the R-square of the regression of Xk on the other p-2 predictors in the regression and a constant.  TOL can vary between 0 and 1; A common rule of thumb is that
TOL < .1
is an indication that collinearity may unduly influence the results.
2.  Variance Inflation Factor
VIF = (TOL)-1 = (1 - Rk2)-1
The variance inflation factor is the inverse of the tolerance.  Large values of VIF therefore indicate a high level of collinearity.
The corresponding rule of thumb is that
 VIF > 10
is an indication that collinearity may unduly influence the results.
We will discuss the mechanism and consequences of collinearity in SOCI209.

Example:  In the UFW and Proposition 14 printout (Table 4), TOL values are given in the column headed Tolerance.  TOL values range from 0.729 (CARTER) down to 0.314 (LOGPOP).  The smallest TOL value is thus well above the 0.1 cutoff, so one concludes there is no collinearity problem in this regression model.

9.  Multiple Regression in Practice

Instructions to do multiple regression with a variety of options are provided in the following exhibits.




Last modified 20 Nov 2002