Sociology 709

 

Problem set #9

 

ps9.dta  is a data set of hypothetical math test data in a survey of schools.  The students were sampled by schools, so the data is “clustered” at the school level.  In addition, students were sampled by gender.  100% of gender A students were sampled, and 40% of gender B students were sampled.  What effect does the sampling and clustering have on the mean of math scores and regression estimates of the effect of SES and gender on math scores?

 

. des

 

Contains data from ps9.dta

  obs:         1,400                         

 vars:             5                          4 Apr 2007 20:20

 size:        33,600 (99.7% of memory free)

-------------------------------------------------------------------------------

              storage  display     value

variable name   type   format      label      variable label

-------------------------------------------------------------------------------

schid           float  %9.0g                  school id

gender          float  %9.0g                  gender a=0, b=1

ses             float  %9.0g                  parental soc-econ status

math            float  %9.0g                  math test score

p               float  %9.0g                  sampling probability

-------------------------------------------------------------------------------

Sorted by: 

 

 

1.  Effect of sampling and clustering on means

In Stata,

a.  create a weight variable, wgt,  that is the inverse of the sampling probability

b.  use the command svyset to set the data for survey analysis

the general syntax is

svyset psuid [pw=wgt_var], fpc(varname)

* note: ignore the fpc for this homework.

where psuid is the cluster variable and wgt_var is the variable for the sampling weights

c.  use the command “mean x” to find the mean of the math scores.  Find the mean under three conditions 1. ignoring weights and clusters, 2. with pweights only, 3. with pweights and clusters.  After  running the third condition, inspect the deff’s

Syntax example:

mean x

mean x [pw=wgt_var]

svy: mean x

estat eff, deff deft

d.  Explain what is going on with the mean and the standard error of the mean under these three conditions.  Why does using the pweights give use the right mean but the wrong standard error?

 

2.  Understanding design effects and intra-class correlation.

a.  figure out how many clusters (schools) there are in the data.

b.  figure out how many students were sampled per school

(tab schid)

c.  based on the formula we derived in class, what is the intra-class correlation for mathematics scores in this example? (note the formula in this case may not be precisely correct because of sampling, but it is close enough to use here)  Interpret the meaning of the intra-class correlation.

 

3.  run a regression of math scores on ses and gener, using the three conditions that you used in part c of #1.  Do the different  conditions affect the point estimates of the coefficients?  Do they affect the estimates of the standard errors?  Explain.  Interpret the deff’s for the standard errors.

example syntax:

reg y x z

reg y x z [pw=wgt_var]

svy: reg y x z

estat  eff, deff deft

(note: don’t use the xi command here.  Because gender is dichotomous, you can add it directly without the xi command)

 

 

4. 

Step 3 of the paper

a) rough draft of the introduction and literature review of your paper (1 page on each)