Sociology 709
Problem set #9
ps9.dta is a data set of hypothetical math test data in a survey of schools. The students were sampled by schools, so the data is “clustered” at the school level. In addition, students were sampled by gender. 100% of gender A students were sampled, and 40% of gender B students were sampled. What effect does the sampling and clustering have on the mean of math scores and regression estimates of the effect of SES and gender on math scores?
. des
Contains
data from ps9.dta
obs: 1,400
vars: 5 4 Apr 2007 20:20
size: 33,600 (99.7% of memory free)
-------------------------------------------------------------------------------
storage display value
variable name type format
label variable label
-------------------------------------------------------------------------------
schid float %9.0g school id
gender float %9.0g gender a=0, b=1
ses float %9.0g parental soc-econ status
math float %9.0g math test score
p float %9.0g sampling probability
-------------------------------------------------------------------------------
Sorted
by:
1. Effect of sampling and clustering on means
In Stata,
a. create a weight variable, wgt, that is the inverse of the sampling probability
b. use the command svyset to set the data for survey analysis
the general syntax is
svyset psuid [pw=wgt_var], fpc(varname)
* note: ignore the fpc for this homework.
where psuid is the cluster variable and wgt_var is the variable for the sampling weights
c. use the command “mean x” to find the mean of the math scores. Find the mean under three conditions 1. ignoring weights and clusters, 2. with pweights only, 3. with pweights and clusters. After running the third condition, inspect the deff’s
Syntax example:
mean x
mean x [pw=wgt_var]
svy: mean x
estat eff, deff deft
d. Explain what is going on with the mean and the standard error of the mean under these three conditions. Why does using the pweights give use the right mean but the wrong standard error?
2. Understanding design effects and intra-class correlation.
a. figure out how many clusters (schools) there are in the data.
b. figure out how many students were sampled per school
(tab schid)
c. based on the formula we derived in class, what is the intra-class correlation for mathematics scores in this example? (note the formula in this case may not be precisely correct because of sampling, but it is close enough to use here) Interpret the meaning of the intra-class correlation.
3. run a regression of math scores on ses and gener, using the three conditions that you used in part c of #1. Do the different conditions affect the point estimates of the coefficients? Do they affect the estimates of the standard errors? Explain. Interpret the deff’s for the standard errors.
example syntax:
reg y x z
reg y x z [pw=wgt_var]
svy: reg y x z
estat eff, deff deft
(note: don’t use the xi command here. Because gender is dichotomous, you can add it directly without the xi command)
4.
Step 3 of the paper
a) rough draft of the introduction and literature review of your paper (1 page on each)