University of North Carolina
at Chapel Hill

SOCI 709 - LINEAR REGRESSION MODELS - Spring 2006
Professor François Nielsen

Paper Assignment

Released Tue 21 Mar - DUE Tue 25 Apr

The purpose of this paper is to give you more experience with regression analysis as well as an opportunity to present and interpret data in a format similar to that used for published papers.  You do not need to do any reading of the substantive literature for this assignment.

1.  Choosing a Topic

You can choose the topic and data for your paper in any of the following ways:
  1. Use your own dependent variable and a data set that you already have or that you can easily obtain.  Using your own data set and model is highly encouraged as long as
    1.  
    2. you have a dependent variable that is at least arguable continuous (this might be an ordinal variable with, say, 4 categories)
    3. you follow the outline of analysis below (although your data set may have different kinds of issues than the ones illustrated below and your analysis should differ accordingly)
    4. you do not spend a lot of time collecting the data and agonizing about substantive issues (since the focus of this paper is on the mechanics of regression analysis, not the substantive issues)
  2. Do the analysis of income inequality using the world209 data set as outlined below
  3. Choose and analyze another dependent variable from the world209 data set
  4. Channel Blau and Duncan by using one year of the GSS data to do an analysis of the determinants of occupational prestige (PRESTIGE) on years of education (EDUC), father's occupation (PAPRES16), father's education (PAEDUC), cognitive ability (WORDSUM), SEX and other variables you may think of.  Compar the achievement model for men and women, or for blacks and whites.  You can see the beginning of such an analysis on my Social Stratification site at http://www.unc.edu/~nielsen/soci230/odocs/status.htm
  5. If you are interested in real estate do a regression analysis of the data set for home prices in a mid-western town, from the point of view of a real estate consultant trying to discover the most important attributes of a house in determining sale price
  6. Use a dependent variable of your choice with any (sufficiently rich) data set that may be available through the course site or on the CD that comes with the textbook, with the same proviso about following the general outline of analysis below

2.  Outline for Analysis of Income Inequality in World Handbook Data

For the suggested topic use the world209.syd (World Handbook of Political and Social Indicators) data set.  If you use SYSTAT, to deal with files the easiest thing to do is to copy the data set in your own network directory and then within SYSTAT set up that directory as your "project directory".  For example, suppose you are a sociology student and your name is Brigitte Bardot.  Within SYSTAT click Edit -> Options -> File location -> Project Directory, and enter
z:\asnt1\users\sociology\students\bardot
Then you can save files of residuals or other output without specifying a path (see below).
To simplify, we have chosen the dependent variable for you, and also your first independent variable.  Have your dependent variable be the percent of income owned by a country's richest 5% (v152), a measure of income inequality.  Have your first independent variable be the proportion of the labor for in agriculture (v286).

Part 1 - Do the Analysis

At this stage you are doing the analysis.  You are not yet writing the paper.   You only need to write notes to yourself to keep track of your findings.
1.  Run descriptive statistics (mean, standard deviation, minimum and maximum) and produce box plots and stem and leaf plots for the dependent and independent variable.  Note (for yourself) the following: What does the distribution of each variable look like?  Are there any outliers (extreme values)?  You can use the menus to do this.
2.  Regress v152 on v286 and save the residuals.  Is the relationship positive or negative?  Interpret the regression coefficient. What does the R2 indicate about the magnitude of this relationship?  Is the relationship significant and if so at what level?
    SYSTAT commands (you can replace myresid with a more imaginative file name):
      mglh
      model v152=constant+v286
      save
      save myresid
      estimate
3.  Produce a box plot, stem and leaf, and a normal probability plot of the residuals.  Also look at the plot of residuals against estimate that SYSTAT produces automatically.  What do these indicate?  Does the relationship appear linear?  If not, do an appropriate transformation.
    SYSTAT commands:
      stats
      box residual
      stem residual
      pplot residual
4.  Look at the list of the variables available in the world209.syd file.  (The list of variables in the World Handbook linked as world209.htm under Data Sets in the side-bar.  The variables included in world209.syd are marked with an asterisk.)  What are some other variables that might account for the bivariate relationship between proportion of labor force in agriculture and income inequality?  Select 2 to 5 additional variables to include in the model.  Choose them from those available in the data set.

5.  Run descriptive statistics and plots (box and stem and leaf) for all the variables.  Any outliers?  Create a correlation matrix of all of your variables and a splom (or let SYSTAT produce the splom automatically when you produce the correlations).  Look at the relationships between the dependent variable v152 and all the potential independent variables in the splom.  Are these relationships linear?  If you suspect nonlinearity, redo the splom specifying smooth=lowess to see if a curvilinear relationship appears.  If you find (a) curvilinear relationship(s), think of the possibility of transforming the variable(s) or using polynomial regression to model the curvilinearity.

6.  At this point if I (Nielsen) ask myself "how would I do it" I realize that what I would actually do is estimate first the full model, with all the independent variables I have selected, and do a comprehensive analysis of potential problems with that regression (outliers, collinearity, heteroskedasticity, etc.).  I would decide how to deal with eventual outliers and influential cases at this stage too.  Then, when the data set is "cleaned up" (for instance, if I have decided to exclude deviant observations) I would redo all the regressions starting with the first model, and introducing additional variables in turn.  I would not redo the whole diagnostic shebang for each one of the intermediate regression models.

7.  Estimate several multiple regression models in which you successively add the additional independent variables, singly or in groups of substantively related variables (such as a polynomial function consisting of an independent variable and its square, or a set of indicators representing the same categorical variable, or variables pertaining to the same substantive area, such as "labor force composition" variables).  [Save the residuals of each regression - but see 6.]  Note (for yourself) the following:  What does the R2 indicate about the fit of the model?  Is the overall model significant?  Are individual coefficients statistically significant, and if so at what level?  Interpret the regression coefficient for each additional independent variable.  Is the relationship positive or negative?  What does the coefficient mean in terms of the effect of the variable on income inequality?

8.  In any of the models, is there evidence of collinearity?  How can you tell?  If so, find a way to deal with it.

9.  [But see 6.  You can do the whole treatment only for the full model.  And estimate the intermediate models afterwards.]  For each model look at the scatter plot of the residuals against the estimate that SYSTAT produces automatically with the regression.  Produce a box plot, stem and leaf, and normal probability plot of the residuals.  Look at the diagnostics for outliers and influential cases.  What do they indicate?  Do you find evidence of other problems (such as heteroskedasticity, outliers and/or influential cases, nonlinear relationships)?  How can you tell?  If so, decide whether you need to take corrective action.

Part 2 - Write-Up the Results

At this stage you are writing the paper as if you were going to submit it to a journal for publication (although in a somewhat simplified form for purposes of this assignment).

Prepare your results in standard publication style tables.  You will need to prepare two tables.

See module 5, Yule data set example, for a model of Table 2 and additional recommendations on how to present regression results.  It is an extremely good idea to prepare these two tables before you start writing up the results, and then write the results "around" (referring to) the tables.  This way of doing it will save you hours of confusion, believe us!

Write the paper using the following sections.

1.  Introduction - Explain the dependent variable and the independent variable, and what relationship you expect to find between the proportion of the labor force in agriculture and income inequality.  Then list the other independent variables you have selected and explain why you have selected them, and how you expect these variables to be related to income inequality and/or how you expect the inclusion of these variables to affect the original relationship between income inequality and labor force in agriculture.  (This may be very short.  In a real paper this is where you would place a review of the literature, summarize relevant debates in the field, etc.).

2.  Data - Explain how the variables are measured, including the units (from the variable lists).  Insert Table 1 in this section and refer to Table 1 to discuss any unusual feature of the data, such as very high correlations between independent variables.  If you have had to transform one or several variables, explain what transformation(s) you have used and why (e.g., to model a nonlinear relationship, etc.).  This section too may be very short -- if you haven't done anything too perverse with your data.

3.  Methods - Explain that you will be using multiple regression analysis and what auxiliary analyses you have performed.  List the diagnostic tests you have carried out, and any corrective action that you have taken.  For example, if you have thrown out some influential cases, explain who they are and why you have dealt with them that way.  (If you have had to use the Zarathustra-KR2 algorithm and the Buzz-Lightyear correction to treat an extremely exotic pathology in your data, this is where you explain that too.)

4.  Results -  Discuss the columns of Table 2 in turn.  For each model, describe the effect(s) of the newly introduced variable(s), their significance, direction, and how introducing the new variables does or does not affect the coefficients of variables that are already in the model.  Finally discuss the most inclusive model, or the trimmed model if you estimated one (in which case report the result of the test of joint significance for the variables you dropped).

5.  Conclusion - Summarize the major highlights of your analysis of income inequality.  What further research should be done to extend your results (i.e., what variables that is not available in the file would you like to add to the model, what additional data should be collected)?

Turn in any relevant additional output as an appendix only if you think it is relevant.  This may include graphs or auxiliary analyses that document a particular problem with the data.  You do not need to provide copies of all the analyses you carried out.



Last modified 20 Mar 2006