University of North Carolina
at Chapel Hill
SOCI 709 - LINEAR REGRESSION MODELS
- Spring 2006
Professor François Nielsen
Paper Assignment
Released Tue 21 Mar - DUE Tue 25 Apr
The purpose of this paper is to give you more
experience with regression analysis as well as an opportunity to present
and interpret data in a format similar to that used for published papers.
You do not need to do any reading of the substantive literature for this
assignment.
1. Choosing a Topic
You can choose the topic and data for your paper
in any of the following ways:
-
Use your own dependent
variable and a data set that you already have or that you can easily obtain.
Using your own data set and model is highly encouraged as long as
-
you have a dependent variable
that is at least arguable continuous (this might be an ordinal variable
with, say, 4 categories)
-
you follow the outline
of analysis below (although your data set may have different kinds of issues
than the ones illustrated below and your analysis should differ accordingly)
-
you do not spend a lot
of time collecting the data and agonizing about substantive issues (since
the focus of this paper is on the mechanics of regression analysis, not
the substantive issues)
-
Do the analysis of income
inequality using the world209 data set as outlined below
-
Choose and analyze another
dependent variable from the world209 data set
-
Channel Blau and Duncan by using one year of
the GSS data to do an analysis of the determinants of occupational prestige
(PRESTIGE) on years of education (EDUC), father's occupation (PAPRES16),
father's education (PAEDUC), cognitive ability (WORDSUM), SEX and other
variables you may think of. Compar the achievement model for men
and women, or for blacks and whites. You can see the beginning of
such an analysis on my Social Stratification site at http://www.unc.edu/~nielsen/soci230/odocs/status.htm
-
If you are interested in real estate do a regression
analysis of the data set for home prices in a mid-western town, from the
point of view of a real estate consultant trying to discover the most important
attributes of a house in determining sale price
-
Use a dependent variable
of your choice with any (sufficiently rich) data set that may be available
through the course site or on the CD that comes with the textbook, with
the same proviso about following the general outline of analysis below
2. Outline for Analysis of Income Inequality
in World Handbook Data
For the suggested topic use the world209.syd
(World Handbook of Political and Social Indicators) data set. If
you use SYSTAT, to deal with files the easiest thing to do is to copy the
data set in your own network directory and then within SYSTAT set up that
directory as your "project directory". For example, suppose you are
a sociology student and your name is Brigitte Bardot. Within SYSTAT
click Edit -> Options -> File location -> Project Directory, and
enter
z:\asnt1\users\sociology\students\bardot
Then you can save files of residuals or other
output without specifying a path (see below).
To simplify, we have chosen the dependent
variable for you, and also your first independent variable. Have
your dependent variable be the percent of income owned by a country's richest
5% (v152), a measure of income inequality. Have your first independent
variable be the proportion of the labor for in agriculture (v286).
Part 1 - Do the Analysis
At this stage you are doing the analysis.
You are not yet writing the paper. You only need to write notes
to yourself to keep track of your findings.
1. Run descriptive statistics (mean,
standard deviation, minimum and maximum) and produce box plots and stem
and leaf plots for the dependent and independent variable. Note (for
yourself) the following: What does the distribution of each variable look
like? Are there any outliers (extreme values)? You can use
the menus to do this.
2. Regress v152 on v286 and save the
residuals. Is the relationship positive or negative? Interpret
the regression coefficient. What does the R2 indicate about
the magnitude of this relationship? Is the relationship significant
and if so at what level?
SYSTAT
commands (you can replace myresid with a more imaginative
file name):
mglh
model v152=constant+v286
save
save myresid
estimate
3. Produce a box plot, stem and leaf, and
a normal probability plot of the residuals. Also look at the plot
of residuals against estimate that SYSTAT produces automatically.
What do these indicate? Does the relationship appear linear?
If not, do an appropriate transformation.
SYSTAT
commands:
stats
box
residual
stem
residual
pplot
residual
4. Look at the list of the variables available
in the world209.syd file. (The list of variables in the World
Handbook linked as world209.htm under Data Sets in the side-bar.
The variables included in world209.syd are marked with an asterisk.)
What are some other variables that might account for the bivariate relationship
between proportion of labor force in agriculture and income inequality?
Select 2 to 5 additional variables to include in the model. Choose
them from those available in the data set.
5. Run descriptive statistics and plots
(box and stem and leaf) for all the variables. Any outliers?
Create a correlation matrix of all of your variables and a splom (or let
SYSTAT produce the splom automatically when you produce the correlations).
Look at the relationships between the dependent variable v152 and all the
potential independent variables in the splom. Are these relationships
linear? If you suspect nonlinearity, redo the splom specifying smooth=lowess
to see if a curvilinear relationship appears. If you find (a) curvilinear
relationship(s), think of the possibility of transforming the variable(s)
or using polynomial regression to model the curvilinearity.
6. At this point if I (Nielsen) ask
myself "how would I do it" I realize that what I would actually do is estimate
first the full model, with all the independent variables I have selected,
and do a comprehensive analysis of potential problems with that regression
(outliers, collinearity, heteroskedasticity, etc.). I would decide
how to deal with eventual outliers and influential cases at this stage
too. Then, when the data set is "cleaned up" (for instance, if I
have decided to exclude deviant observations) I would redo all the regressions
starting with the first model, and introducing additional variables in
turn. I would not redo the whole diagnostic shebang for each one
of the intermediate regression models.
7. Estimate several multiple regression
models in which you successively add the additional independent variables,
singly or in groups of substantively related variables (such as a polynomial
function consisting of an independent variable and its square, or a set
of indicators representing the same categorical variable, or variables
pertaining to the same substantive area, such as "labor force composition"
variables). [Save the residuals of each regression - but see 6.]
Note (for yourself) the following: What does the R2 indicate
about the fit of the model? Is the overall model significant?
Are individual coefficients statistically significant, and if so at what
level? Interpret the regression coefficient for each additional independent
variable. Is the relationship positive or negative? What does
the coefficient mean in terms of the effect of the variable on income inequality?
8. In any of the models, is there evidence
of collinearity? How can you tell? If so, find a way to deal
with it.
9. [But see 6. You can do the
whole treatment only for the full model. And estimate the intermediate
models afterwards.] For each model look at the scatter plot of the
residuals against the estimate that SYSTAT produces automatically with
the regression. Produce a box plot, stem and leaf, and normal probability
plot of the residuals. Look at the diagnostics for outliers and influential
cases. What do they indicate? Do you find evidence of other
problems (such as heteroskedasticity, outliers and/or influential cases,
nonlinear relationships)? How can you tell? If so, decide whether
you need to take corrective action.
Part 2 - Write-Up the Results
At this stage you are writing the paper as if
you were going to submit it to a journal for publication (although in a
somewhat simplified form for purposes of this assignment).
Prepare your results in standard publication
style tables. You will need to prepare two tables.
-
Table 1 contains the correlation matrix of all
the variables, with the basic statistics for all variables (mean, standard
deviation, minimum value, maximum value). The basic statistics are
often combined with the correlations as the last row of the correlation
matrix. See the following published examples for models (I prefer
the second example because it has the minimum and maximum values of the
variables).
Exhibit: model for
Table 1 (from Firebaugh, Glenn. 1983. ASR 48:263.)
Exhibit: another model
for Table 1 (from Nielsen, François. 1994. ASR
59:668)
-
Table 2 contains the regression results.
The leftmost column is the regression of v152 on v286. Successive
columns show the regressions of v152 on v286 and additional variables that
you add singly or in groups (if the group make sense substantively).
The last column contains the most inclusive model, or a trimmed model if
you decide to drop variables that are non significant in the most inclusive
model (in which case, you know what test you need to carry out, right?).
See module 5, Yule data set example, for a model
of Table 2 and additional recommendations on how to present regression
results. It is an extremely good
idea to prepare these two tables before you start writing up
the results, and then write the results "around" (referring to) the tables.
This way of doing it will save you hours of confusion, believe us!
Write the paper using the following sections.
1. Introduction - Explain the
dependent variable and the independent variable, and what relationship
you expect to find between the proportion of the labor force in agriculture
and income inequality. Then list the other independent variables
you have selected and explain why you have selected them, and how you expect
these variables to be related to income inequality and/or how you expect
the inclusion of these variables to affect the original relationship between
income inequality and labor force in agriculture. (This may be very
short. In a real paper this is where you would place a review of
the literature, summarize relevant debates in the field, etc.).
2. Data - Explain how the variables
are measured, including the units (from the variable lists). Insert
Table 1 in this section and refer to Table 1 to discuss any unusual feature
of the data, such as very high correlations between independent variables.
If you have had to transform one or several variables, explain what transformation(s)
you have used and why (e.g., to model a nonlinear relationship, etc.).
This section too may be very short -- if you haven't done anything too
perverse with your data.
3. Methods - Explain that you
will be using multiple regression analysis and what auxiliary analyses
you have performed. List the diagnostic tests you have carried out,
and any corrective action that you have taken. For example, if you
have thrown out some influential cases, explain who they are and why you
have dealt with them that way. (If you have had to use the Zarathustra-KR2
algorithm and the Buzz-Lightyear correction to treat an extremely exotic
pathology in your data, this is where you explain that too.)
4. Results - Discuss the
columns of Table 2 in turn. For each model, describe the effect(s)
of the newly introduced variable(s), their significance, direction, and
how introducing the new variables does or does not affect the coefficients
of variables that are already in the model. Finally discuss the most
inclusive model, or the trimmed model if you estimated one (in which case
report the result of the test of joint significance for the variables you
dropped).
5. Conclusion - Summarize the
major highlights of your analysis of income inequality. What further
research should be done to extend your results (i.e., what variables that
is not available in the file would you like to add to the model, what additional
data should be collected)?
Turn in any relevant additional output as
an appendix only if you think it is relevant. This may include
graphs or auxiliary analyses that document a particular problem with the
data. You do not need to provide copies of all the analyses
you carried out.
Last modified 20 Mar 2006