SOCI208 - Module 17 - Multiple Regression
1. Need for Models With More Than One Independent Variable
1. Motivations for Multiple Regression Analysis
The 2 principal motivations for models with more than one independent variable
are:
-
to make the predictions of the model more precise by adding other factors
believed to affect the dependent variable to reduce the proportion of error
variance associated with SSE; for example
-
in a model explaining prestige of current occupation as a function of years
of education, add SES of family of origin and IQ
-
in a study the sale prices of homes in a county, include as many characteristics
of the house that can affect the price (such as heated area, land area,
age of the house, number of bathrooms, etc.) to obtain the best fitting
model, in order to derive estimates ^Yh of the values of houses
in the county (for tax purposes) that are as accurate as possible
-
to support a causal theory by eliminating potential sources of spuriousness.
This is sometimes called the elaboration model. EX:
-
in a model of socioeconomic success as a function of SES of family of origin,
add IQ of subject to control for a possible inflated effect of SES that
overestimates childhood environmental influences on adult outcome
The second motivation is very important for scientific applications of
regression analysis. It is discussed further in the next section.
2. Supporting a Causal Statement by Eliminating Alternative Hypotheses
Theories about social phenomena are made up of causal statements.
In Constructing Social Theories, Arthur Stinchcombe (1968) defines
a causal statement or law as "A causal law is a statement or proposition
in a theory which says that there exist environments ... in which a change
in the value of one variable is associated with a change in the value of
another variable and can produce this change without any change in other
variables in the environment" (p. 31).
Such a causal statement can be represented schematically as
X (independent) --> Y (dependent)
Stinchcombe argues further that one of the requirements to support or refute
a causal theory is to ascertain nonspuriousness. Ascertaining
nonspuriousness means checking whether one or more other variables affect
both X and Y and thereby produce an apparent association between X and
Y that is spuriously attributed to a causal influence of X on Y.
Multiple regression analysis can be used to ascertain nonspuriousness
to an extent that depends on the design of the study:
-
with experimental data (in which the values of some of the variables in
X are deliberately set by the experimenter) ascertaining nonspuriousness
is effective, since the value of the "treatment" has been deliberately
dissociated (through random assignment) from the values of the other independent
variables characterizing the elements
-
with observational data the task of ascertaining nonspuriousness remains
open-ended, as it is never possible to prove that all potential sources
of spuriousness have been controlled
In the context of regression analysis, spuriousness is called
specification
bias. Specification bias is a more general and continuous notion
than spuriousness. The idea is that if a regression model of Y on
X excludes a variable that is both associated with X and a cause
of Y (the model is then called misspecified) the estimated association
of Y with X will be inflated (or, conversely, deflated) relative to its
true value. The regression estimator, in a sense, falsely "attributes"
to X a causal influence that is in reality due to the omitted variable(s).
Ascertaining nonspuriousness is equivalent to eliminating alternative
hypotheses on the source of the relationship between X and Y by adding
variables explicitly to the regression model.
3. The D-Score Data: an Example of Spurious Association
The D-score data (Koopmans 1987) illustrate how a spurious association
can be elucidated using multiple regression analysis.
A test of cognitive development is administered to a sample of 12 children
with ages ranging from 3 to 10. The cognitive development measure
is called a D-score.
The simple regression of D-score on sex is carried out. Sex is
represented by the variable BOY (coded Boy - 1, Girl - 0). The regression
reveals a significant positive effect of BOY on D-score: boys score significantly
higher than girls (P-value = 0.039).
Table 1. Simple Regression Analysis of the D-Score Data Set
Example from Koopmans,
Lambert. 1987. Introduction to Contemporary Statistical Methods.
(2d edition.) PWS-Kent. Pp. 554-557.
Data
Case number
OBS DSCORE
AGE BOY
BOY$
1 1.000
8.610 3.330
0.000 G
2 2.000
9.400 3.250
0.000 G
3 3.000
9.860 3.920
0.000 G
4 4.000
9.910 3.500
0.000 G
5 5.000
10.530 4.330
1.000 B
6 6.000
10.610 4.920
0.000 G
7 7.000
10.590 6.080
1.000 B
8 8.000
13.280 7.420
1.000 B
9 9.000
12.760 8.330
1.000 B
10 10.000
13.440 8.000
0.000 G
11 11.000
14.270 9.250
1.000 B
12 12.000
14.130 10.750
1.000 B
Pearson Correlation Matrix
DSCORE AGE
BOY
DSCORE
1.000
AGE
0.957 1.000
BOY
0.600 0.647
1.000
Simple Linear Regression
Dep Var: DSCORE
N: 12 Multiple R: 0.600 Squared multiple R: 0.360
Adjusted squared multiple
R: 0.296 Standard error of estimate: 1.671
Effect
Coefficient Std Error Std Coef
Tolerance t P(2 Tail)
CONSTANT
10.305 0.682
0.000 . 15.109
0.000
BOY
2.288 0.965
0.600 1.000 2.372
0.039
Analysis of Variance
Source
Sum-of-Squares df Mean-Square
F-ratio P
Regression
15.709 1 15.709
5.629 0.039
Residual
27.910 10 2.791
-------------------------------------------------------------------------------
*** WARNING ***
Case
10 is an outlier (Studentized
Residual = 2.566)
Durbin-Watson D Statistic
1.183
First Order Autocorrelation
0.315
However, a symbolic plot of D-score against age, using symbols to identify
sex (B = Boy, G = Girl), reveals a systematic pattern.
Q - What is the pattern in the following figure?
A multiple regression analysis is then carried out, with D-score as
the dependent variable and both BOY and AGE as independent variables.
The results are shown in Table 2. This time the effect of BOY
becomes non-significant (P-value is 0.799); the effect of AGE on D-score
is strongly significant. One concludes that the significant effect
of sex (represented by the variable BOY) in the first regression was spurious.
It was a consequence of the (accidental) association in the sample between
age and sex, i.e. the tendency (visible in the scatterplot) for boys to
be older than girls, combined with the strong effect of age on D-score.
Introducing ("controlling for") age in the model has eliminated the spurious
effect of sex on cognitive development.
Table 2. Multiple Regression of D-Score on BOY and AGE
Dep Var: DSCORE
N: 12 Multiple R: 0.958 Squared multiple R: 0.917
Adjusted squared multiple
R: 0.899 Standard error of estimate: 0.634
Effect
Coefficient Std Error Std Coef
Tolerance t P(2 Tail)
CONSTANT
6.927 0.506
0.000 . 13.697
0.000
BOY
-0.126 0.480
-0.033 0.581 -0.262
0.799
AGE
0.753 0.097
0.979 0.581 7.775
0.000
Analysis of Variance
Source
Sum-of-Squares df Mean-Square
F-ratio P
Regression
40.002 2 20.001
49.765 0.000
Residual
3.617 9
0.402
-------------------------------------------------------------------------------
Durbin-Watson D Statistic
2.277
First Order Autocorrelation
-0.313
4. The Mechanism of Specification Bias aka
Spuriousness
The mechanism of spuriousness aka specification bias is presented
graphically in the next exhibit
5. Standard Tabular Presentation of Regression
Results
1. Standard Presentation
The standard journal presentation of multiple
regression results is aimed in part at facilitating the elaboration model
by examining the effect of introducing a new "test" variable in the model.
The following table presents the results of
the regression analysis of the D-score data in standard tabular format.
Table 3. Unstandardized Regression Coefficients of Cognitive
Development (D-score) on Sex and Age for 12 Children Aged 3 to 10 (t Ratios
in Parentheses)
| Independent variable |
Model 1
|
Model 2
|
| Constant |
10.305***
|
6.927***
|
| |
(15.109)
|
(13.697)
|
| Boy ( boy=1, girl=0) |
2.288*
|
-.126
|
| |
(2.372)
|
(-.262)
|
| Age (years) |
--
|
.753***
|
| |
|
(7.775)
|
| R-square |
.360
|
.917
|
| Adjusted R-square |
.296
|
.899
|
| Note: * p < .05 ** p < .01 ***
p < .001 (2-tailed tests) |
2. Suggestions on Preparing Tables of Regression Results
The following guidelines would help prepare tables of results acceptable
by most professional journals.
-
the title of the table contains information on the type of regression coefficients
shown (here, unstandardized coefficients), the dependent variable, the
independent variables (when there are too many to list in the title, one
says "on selected independent variables"), the nature of the units of observation
(children in a given age range), and the sample size (12). When n
is not the same in all the models (e.g., because of missing data), state
the maximum n in the title and specify the actual sample sizes in a row
labeled "N" placed below "Adjusted R-square". When appropriate, add
to the title information on elements of the larger context, such as geographic
location and time frame.
-
the independent variables are introduced one at a time in successive models
shown in the different columns of the table; variants of this strategy
often introduce together sets of related variables, such as
-
a set of indicators representing a categorical variable
-
different powers of X representing a polynomial function
-
variables related conceptually, e.g. father's education, mother's education,
and family income together representing family SES
-
significance levels of the coefficients are indicated with asterisks. American
Sociological Review usage is shown here. Check the main journals
in your field for usage. A legend at the bottom of the table indicates
the meaning of the symbols and specifies the type of test used (1-tailed
or 2-tailed). Both 2-tailed and 1-tailed tests can be used in the
same table by using a different symbol for 1-tailed tests. EX:
add a line at the bottom with: + p < .05 ++ p < .01 +++
p < .001 (1-tailed tests)
-
both R-square and Adjusted R-square are shown. Reviewers will often
insist that you show the adjusted R-square, even though N may be so large
it makes no difference. Give it to them. Never omit the regular
("unadjusted") R-square, though, as this can be used to reconstruct F-tests
from the table more easily (see SOCI209)!
-
the t-ratios (coefficient estimates divided by their standard error) are
shown in parentheses below the regression coefficients. Some people
present the standard error instead of the t-ratio, but this is a deplorable
practice because the standard errors are in the metric of the corresponding
regression coefficients. Thus standard errors are in general not
comparable across coefficients (unless the independent variables are in
the same metric) and they suffer different degrees of rounding when a fixed
number of decimal places is used. Because of this t-ratios for some
coefficients may not be computable with sufficient precision from the table,
which may lead to incorrect judgements of significance. By contrast
t-ratios are all in the same metric (that of a Student t variate with n-p
df) and are therefore directly comparable across coefficients, and they
convey the same (optimal) degree of precision across all coefficients when
a fixed number of decimals is used. Thus it is much better to
present the t-ratios than the standard errors of estimate.
-
a place holder (--) is used in place of the regression coefficient to show
that a variable is not included in a model; this is especially helpful
in large tables with many columns
-
the independent variables are labeled with human readable text, not the
computer symbol, in such a way that
-
the name of the variable is consistent with the numerical scale (e.g.,
SES must have values that are large for high SES and small for low SES;
values of "Democracy" must be high for democratic countries and low for
non-democratic ones)
-
the coding of a 0,1 indicator variable is explicitly defined when any doubt
is possible (e.g., an indicator called SEX must specify whether it is coded
as 1 for male and 0 for female, or the other way around)
-
the best way to label an indicator variable is with the name of the category
that is coded 1 (e.g., instead of calling the indicator SEX or GENDER,
call it MALE (with 1 for male and 0 for female) or FEMALE
(with 1 for female and 0 for male)
2. The Multiple Regression Model
1. The Multiple Regression Model With p - 1 Independent Variables
The multiple linear regression model with p - 1 independent variables can
be written
Yi = b0
+ b1Xi
+ b2Xi2
+ ... + bp-1Xi,p-1
+
ei
i = 1,..., n
where
Yi is the response for the ith case
Xi1 ,Xi2 , ...,Xi,p-1are
the values of p-1 independent variables for the ith case, assumed to be
known constants
b0,
b1,
..., bp-1are
parameters
ei are
independent ~ N(0, s2)
(The independent variables are indexed 1 to p - 1 so that the total number
of independent variables, including the implicit column of 1 associated
with the intercept b0, is equal to
p.)
The interpretation of the parameters is
-
b0,
the Y intercept, indicates the mean of the distribution of Y when X1
= X2 = ... = Xp-1 = 0
-
bk (k = 1, 2, ..., p - 1) indicates
the change in the mean response E{Y} (measured in Y units) when Xk
increases by one unit while all the other independent variables remain
constant
-
s2
is the common variance of the distribution of Y
2. Example - Proposition 14 and The UFW
An example of the use of multiple regression analysis is based on the study
by McVeigh (1993) of the United Farm Workers (UFW) movement in California
in the 1970s. The UFW movement was founded and led by Cesar Chavez
to represent the interests of farm laborers in California. Proposition
14 was placed on the ballot in California in 1976 by the UFW, and contained
provisions favorable to the movement, including a section allowing union
organizers limited access to the work site on the grower's property for
organization purposes. The California growers, who employed farm
workers, were opposed to the proposition. A YES vote on Proposition
14 therefore represented support for the UFW, while a NO vote represented
support for the growers.
McVeigh assembled data on the 58 counties of California to test various
hypothese relating support for the UFW, represented by the percent voting
YES on Proposition 14, with social characteristics of counties. One
of McVeigh's hypotheses is that support for Proposition 14 should be negatively
related with the percentage of farm laborers in the labor force.
This is because, (1) counties with large proportions of farm laborers have
economies that are highly dependent on agriculture, so that a victory by
UFW is likely to threaten the economic interests of a large segment of
the population whose livelihood depends on agricultural production; this
segment of the population is likely to vote NO on Proposition 14; and (2),
even in counties where the proportion of farm workers is relatively large,
the farm workers represent only a small percentage of the population and
therefore do not constitute a substantial voting block in favor of Proposition
14.
McVeigh has a number of additional hypotheses relating social characteristics
of the California counties with support for the UFW. For example,
he argues that the strategy of growers of casting Proposition 14 (which
had provisions allowing access by union organizers to farm workers on the
work site) as an attack on private property would lead to lower support
for UFW in counties with large proportions of owner-occupied houses, since
homeowners would presumably feel most threatened by the proposition.
This and other hypothese can be investigated with multiple regression analysis.
McVeigh estimated the multiple regression of YESON14 on seven independent
variables equivalent to the equation
YESON14 = b0
+ b1LABOR
+ b2CARTER
+ b3GINI
+ b4SPANORIG
+ b5MDINC
+ b6LOGPOP
+ b7HOMEOWN
+ ei
The variables are defined as
LABOR, % farm laborers in the labor force
CARTER, % vote for Jimmy Carter for president (a sure sign of liberalism)
GINI, the Gini coefficient of inequality of the distribution of family
income
SPANORIG, percent population of Hispanic origin
MDINC, median family income
LOGPOP, logarithm of county population
HOMEOWN, percent housing owner occupied
The SYSTAT multiple-regression output follows.
Table 4. Regression of Yes Vote on Proposition 14 in California
Counties, 1976.
DEP VAR: YESON14
N: 58 MULTIPLE R: 0.891 SQUARED
MULTIPLE R: 0.793
ADJUSTED SQUARED
MULTIPLE R: .764 STANDARD ERROR OF ESTIMATE:
4.258
VARIABLE
COEFFICIENT STD ERROR STD COEF
TOLERANCE T P(2 TAIL)
CONSTANT
-48.632 16.312
0.000 . -2.981
0.004
LABOR
-2.086 0.615
-0.389 0.315 -3.393
0.001
CARTER
0.530 0.133
0.301 0.729 3.993
0.000
GINI
68.438 35.317
0.161 0.600 1.938
0.058
SPANORIG
0.011 0.104
0.011 0.422 0.109
0.914
MDINC
3.003 0.591
0.497 0.431 5.080
0.000
LOGPOP
1.389 0.551
0.289 0.314 2.522
0.015
HOMEOWN
-0.249 0.086
-0.248 0.566 -2.896
0.006
ANALYSIS OF VARIANCE
SOURCE
SUM-OF-SQUARES DF MEAN-SQUARE
F-RATIO P
REGRESSION
3475.600 7 496.514
27.389 0.000
RESIDUAL
906.426 50 18.129
3. Basic Results For Multiple Regression Model
1. Correlation Matrix and Splom
The simple correlation coefficients among variables in the multiple regression
model are often presented in the form of a matrix.
The correlations can also be presented graphically in a corresponding
scatterplot matrix, or splom. As presented in the next exhibit, the
dependent variable (YESON14) is listed last, so the correlations involving
it appear together on the bottom row of the splom, with each panel showing
the dependent variable on the vertical axis. The splom uses the HALF
option so that only one panel is shown for each correlation, to reduce
the visual clutter.
2. Estimated Regression Function ^Y
The estimated regression function for the multiple regression model with
p - 1 variables is
^Y = b0 + b1X1 + ... + bp
- 1Xp - 1
where b0, b1, ..., bp - 1 are estimated
as the solution of the ordinary least squares normal equations. (These
equations will be derived in matrix notation in SOCI209.)
On the standard multiple regression printout the estimated coefficients
bk are presented, together with the estimated standard errors
s{bk} and the t-ratio t* = bk/s{bk} (see
later).
Example: keeping constant the other variables in the model, the estimated
coefficient for LABOR (% farm laborers in labor force) is -2.086, so that
an increase of 1 unit (percent) of farm laborers is associated with a decline
of about 2% in Yes votes for Proposition 14.
3. Analysis of Variance (ANOVA)
1. Fitted Values ^Yi
The fitted values ^Yi are defined in a way analogous to simple
regression as
^Yi = b0 + b1Xi1
+ ... + bp - 1Xi, p - 1
Note that ^Yi is a single number associated with each case,
regardless of the number p - 1 of independent variables in the model.
2. Sums of Squares
The sums of squares are defined identically in simple and multiple regression,
as
SSTO = S(Yi
- Y.)2
SSE = S(Yi
- ^Yi)2
SSR = S(^Yi
- Y.)2
3. Degrees of Freedom
The degrees of freedom associated with various sums of squares are
SSTO has n - 1 df associated with it, with 1 df lost because
the sample mean is estimated from the data (same as before)
SSE has n - p df because the n residuals ei = Yi
- ^Yi are calculated using p
parameters b0,
b1,
..., bp-1
estimated from the data
SSR has p - 1 df because of the p estimated parameters b0,
b1,
..., bp-1
used to calculate the Yi, minus 1 df associated with a constraint
on the sum of the fitted values (see NWW p. 604)
4. Mean Squares
Mean squares are sums of squares divided by their respective degrees of
freedom (df).
In particular, MSE = SSE/(n - p) is again the estimate of s2,
the common variance of e and of Y.
5. ANOVA Table
Analysis of variance results are summarized in an ANOVA table analogous
to the one for simple regression. Table 5a shows the general format
of the ANOVA table and Table 5b shows the table for the UFW example.
Table 5a. General Format of ANOVA Table for Multiple
Regression
| Source of variation |
SS
|
df
|
MS
|
| Regression |
SSR = S(^Yi
- Y.)2 |
p - 1
|
MSR = SSR/(p - 1) |
| Error |
SSE = S(Yi
- ^Yi)2 |
n - p
|
MSE = SSE/(n - p) |
| Total |
SSTO = S(Yi
- Y.)2 |
n - 1
|
sY2 = SSTO/(n - 1) |
Table 5b. ANOVA Table for UFW Example
| Source of variation |
SS
|
df
|
MS
|
| Regression |
SSR = 3475.600 |
7
|
MSR = 496.514 |
| Error |
SSE = 906.426 |
50
|
MSE = 18.129 |
| Total |
SSTO = 4382.026 |
57
|
sY2 = 76.878 |
4. Coefficient of Multiple Determination R2
1. Coefficient of Multiple Determination R2
The coefficient of multiple determination R2 is defined analogously
to the simple regression r2 as
R2 = SSR/SSTO = 1 - (SSE/SSTO)
where
0 <= R2 <= 1
Example: in the UFW example
R2 = SSR/SSTO = 3475.600/4382.026 = 0.793
as shown on the printout of Table 4.
2. Coefficient of Multiple Correlation
The coefficient of multiple correlation R is the square root of R2
so that
R = +(R2)1/2
where R is always positive (R >= 0).
Q - Why is R always positive in the multiple regression context?
3. Adjusted R-Square Ra2
The adjusted coefficient of multiple determination Ra2
adjusts for the number of independent variables in the model (to correct
the tendency of R2 to always increase when independent variables
are added to the model). It is calculated as
R2a = 1 - ((n-1)/(n-p))(SSE/SSTO) = 1
- MSE/(SSTO/(n - 1))
Example: In the UFW printout the adjusted r-square
R2a
is
1 - ((58 - 1)/(58 - 8))(906.426/4382.026) = .764
as contrasted with the ordinary (unadjusted) R2 = 0.793
4. F Test for Regression Relation (Screening Test)
The F test for regression relation (aka screening test) tests the
existence of a relation between the dependent variable and the entire
set of independent variables. The test involves the hypothesis
setup
H0: b1=
b2
= ... = bp-1=
0
H1: Not all bk = 0
k = 1, 2,..., p - 1
The test statistic is (same as for simple linear
regression)
F* = MSR/MSE
which is distributed as F(p - 1; n - p), the same
df as the numerator and denominator, respectively, in the ratio MSR/MSE.
Using the P-value method, calculate the P-value P{F(p - 1; n - p) >
F*}.
If P-value < a conclude H1
(not all coefficients = 0 so there is a statistical relation), otherwise
conclude H0 (there is no statistical relation)
Using the decision theory method choose a significance level a.
Then the decision rule is
if F* <= F(1 - a; p - 1, n -
p), conclude H0
if F* > F(1 - a; p - 1, n - p), conclude
H1
Example: In the UFW example
F* = 496.514/18.129 = 27.389
Using the P-value method, P{F(7, 50) > 27.389} = .000000. Since P-value
= .000000 < .05 = a, conclude H1,
that not all regression coefficients are 0.
Using the decision theory method, find F(0.95; 7, 50) = 2.199202.
Since F* = 27.389 > 2.199, conclude H1, that not all regression
coefficients are 0 with this method also.
5. Inference Concerning Individual Regression Coefficients
Statistical inference on individual regression bk
is carried out in the same way as for simple regression, except that the
t tests are now based on the Student t distribution with n - p df (corresponding
to the n - p df associated with MSE), instead of the n - 2 df of the simple
regression model.
1. CI for bk
The 1 - a confidence limits for a coefficient
bk
of a multiple regression model are given by
bk -/+ t(1 - a/2; n -
p)s{bk}
where s{bk} is the estimated standard deviation of bk
and is provided on the standard regression printout next to bk.
(The calculation of s{bk} is discussed in SOCI209.)
Example: In the UFW example, calculate a 95% CI for the coefficient
of LABOR. The ingredients are
b1 = -2.086; s{b1} = 0.615; n = 58; p
= 8; a = .05
Calculate n - p = 50 and t(0.975, 50) = 2.008559. Thus the confidence
limits are
L = -2.086 - (2.008559)(0.615) = -3.321264
U = -2.086 + (2.008559)(0.615) = -0.850736
In other words one can say that with 95% confidence
-3.321264 <= bk
<= -0.850736
One can say that, with 95% confidence, the decrease in YES vote for Proposition
14 associated with an increase of 1% in the percentage of farm laborers
is between -3.321 and -0.851 percent point.
2. Hypothesis Tests for bk
1. Two-Sided Tests
The most common tests concerning bk
involve the null hypothesis that bk
= 0.
The alternatives are
H0: bk = 0
H1: bk <> 0
The test statistic is
t* = bk/s{bk}
where s{bk} is the estimated standard deviation of bk.
When bk = 0, t* ~ t(n - p).
Example: Test that the coefficient of LABOR is different from 0.
The hypotheses are
H0: b1
= 0
H1: b1
<> 0
The test statistic (aka "t ratio") is
t* = b1/s{b1} = -2.086/0.615 = -3.393
(provided on printout under "T")
When b1=
0, t* is distributed as t(n - p) = t(50).
Using the P-value method, find the 2-tailed P-value = P{|t(50)| > |-3.393|}
= (2)P{t(50) < -3.393} = 0.001.
Since P-value = 0.001 < 0.05 = a, conclude
H1, that b1
<> 0.
Using the decision theory method, choose significance level, say a
= 0.05. The critical value t(0.975; 50) = 2.008559.
Since |t*| = |-3.393| > 2.008559, conclude H1, that b1
<> 0, by this method also.
2. One-Sided Tests
One-sided tests for a coefficient bk
are carried out by dividing the 2-sided P-value by 2, as before.
6. CI for E{Yh}
It is often important to estimate the mean response E{Yh} for
given values of the independent variables.
The values of the independent variables for which E{Yh}
is to be estimated are denoted
Xh1, Xh2, ..., Xh, p - 1
(This set of values of the X variables may or may not correspond to one
of the cases in the data set.)
The estimator of E{Yh} is
^Yh = b0 + b1Xh1
+ b2Xh2 + ... + bp - 1Xh, p - 1
The 1 - a confidence limits for the mean response
E{Yh} are then given by
^Yh -/+ t(1 - a/2;
n - p)s{^Yh}
where s{^Yh} is the estimated standard deviation of ^Yh.
The quantity s{^Yh} can be calculated with a statistical
program using a technique explained in the last section of this module.
Example: In the UFW example, one can obtain the predicted Yes vote ^Yh
and its estimated standard error s{^Yh} by adding to the data
set a "dummy" case with the chosen Xhk values for the independent
variables, and a missing value for the dependent variable. Using
SYSTAT, go to the data window and add a case (row) with YESON14 = ., LABOR
= 4.0, CARTER = 61, GINI = 0.36, SPANORIG = 15, MDINC = 10, LOGPOP = 9.5,
and HOMEOWN = 50. The ID number for the new case is 59. Then
run the regression model and save the residuals. Open the file of
residuals. The desired quantities are given for case 59 as
^Yh = ESTIMATE = 30.934
s{^Yh} = SEPRED = 2.578
Choosing a = 0.05, the 0.95 confidence limits
for ^Yh are then calculated as
L = 30.934 - (2.578)(2.008559) = 25.756
U = 30.934 + (2.578)(2.008559) = 36.112
where 2.008559 is t(0.975; 50).
7. Prediction Interval for Yh(new)
Skip this topic, unless you are a business type.
8. Other Elements of the Multiple Regression Printout
Two additional elements of the standard regression output become relevant
in the multiple-regression context.
1. Standardized Regression Coefficients
The standardized regression coefficient bk* is calculated
as:
bk* = bk(s(Xk)/s(Y))
where s(Xk) and s(Y) denote the sample standard deviations of
Xk and Y, respectively.
Thus the standardized coefficient bk* is calculated as the
original (unstandardized) regression coefficient bk multiplied
by the ratio of the standard deviation of Xk to the standard
deviation of Y.
Conversely, one can recover the unstandardized coefficient from the
standardized one as
bk = bk*(s(Y)/s(Xk))
The standardized coefficient bk* measures the change in standard
deviations of Y associated with an increase of one standard deviation of
X.
Standardized coefficients permit comparisons of the relative strength
of the effects of different independent variables, measured in different
metrics
(= units).
Example: In the results of Table 4 for the UFW data, standardized
coefficients are provided in the column headed STD COEF. One sees
that the standardized coefficients for MDINC (median county income) is
0.497, and the standardized coefficient for HOMEOWN (% home ownership)
is -0.248. Using standardized coefficients it is possible to say
that the positive effect of MDINC (0.497) on Yes vote on Proposition 14
is about twice as strong as the negative effect of HOMEOWN (-0.248) on
Yes vote. Standardization replaces the original metric of an independent
variable into standard deviation units.
Q - Why did rich counties give such strong support to the UFW?
2. Tolerance or Variance Inflation Factor
The standard multiple regression output often provides a diagnostic measure
of the collinearity of a predictor with the other predictors in the model,
either the tolerance (TOL) or the
variance inflation factor
(VIF).
1. Tolerance (TOL)
TOL = 1 - Rk2
where Rk2 is the R-square of the regression of Xk
on the other p-2 predictors in the regression and a constant. TOL
can vary between 0 and 1;
-
TOL close to 1 means that Rk2 is close to 0, indicating
that Xk is not highly correlated with the other predictors in
the model
-
TOL close to 0 means that Xk is highly correlated with the other
predictors; one then says that Xk is collinear with the
other predictors
A common rule of thumb is that
TOL < .1
is an indication that collinearity may unduly influence the results.
2. Variance Inflation Factor
VIF = (TOL)-1 = (1 - Rk2)-1
The variance inflation factor is the inverse of the tolerance. Large
values of VIF therefore indicate a high level of collinearity.
The corresponding rule of thumb is that
VIF > 10
is an indication that collinearity may unduly influence the results.
We will discuss the mechanism and consequences of collinearity in SOCI209.
Example: In the UFW and Proposition 14 printout (Table 4), TOL
values are given in the column headed Tolerance. TOL values range
from 0.729 (CARTER) down to 0.314 (LOGPOP). The smallest TOL value
is thus well above the 0.1 cutoff, so one concludes there is no collinearity
problem in this regression model.
9. Multiple Regression in Practice
Instructions to do multiple regression with a variety of options are provided
in the following exhibits.
Last modified 20 Nov 2002