Sociology 709

Lecture T

 

Using weights in your analysis.

 

Basic weights

 

If each case represents a number of other identical cases, use frequency weights.

 

Example:  You have data on average income, inc, by county, and you want find the national average.  Weight the data by the number of people in the county.

sum inc [fw=population]

 

If you have weights representing the inverse probability of being sampled, then use probability weights.

 

Example: You have individual data on income and want to find the average.  Because of stratified sampling, some groups had a higher probability of being sampled, =p.  The weight should be the inverse of p.  Reason: 1/p represents how many respondents are “represented” by that case.

gen wgt=1/p

sum inc [pw=wgt]

 

If the weight is not strictly related to the sampling probability, but is inversely proportional to the variance of the  case, then use analytical weights (see, for example, our discussion of WLS for heteroskedasticity).

 

Survey Weights

 

CPC tutorial on basic characteristics of survey data

 

Complex survey design may involve clustering.  [explain what this is]

 

Why does clustering pose a problem for data analysis? 

 

The error terms for cases within the same cluster may be correlated with each other, violating

the iid assumption about the error term.

 

[from the CPC Stata tutorial]

What characteristics of the sampling design affect estimates such as totals, means, proportions, and regression coefficients? Answer: Sampling weights.
What characteristics of the sampling design affect standard errors, p-values, and confidence intervals? Answer: Sampling weights, clustering, and stratification.

 

 

Let’s take the case of estimating the mean for a variable.

 

(1)    

where  is the mean of ,

  is a cluster-specific error  term for cluster j,

and is the individual  specific error term.

 

(2)

 

(3) 

 

 and  are uncorrelated with each other.

 

As a result,

 

(4) 

 

And observations of Y within the same cluster will be correlated with each other,

 

(5)    

 

Note: E[.] is the “expectation of .”, i.e., the average.

 

(6)   

 

àThis violates the OLS assumption of independent error terms

 

(7)   Intra-class correlation:

 

Next, we will derive the variance of the mean under cluster sampling.

 

First, under a simple random sample where there was only 1 case per cluster (i.e. no clustering of cases)

 

(8)    where N is the number of cases in the sample.

 

 

(9)     where C is the number of clusters and M is the number of cases per cluster ( N = C x M)

 

Notice that the variance of the mean declines as N increases in eq 8, but the cluster component in eq 9 only declines as C increases.

 

Let’s modify equation 9.

 

 

(10) 

 

 

 

 

 

 

 

 

 

If M = 1 then the design effect = 1 and we have simple random sampling, don’t have to worry about clustering.

 

If then Y is the same for everyone in the same cluster (i.e. the average teacher salary or size of the school for education data), and the “real” sample size for variable y is not N = C x M but  C, the number of clusters in the data.

 

 

The impact of all this is if the intra-class correlation is greater than 0 for a particular variable then our standard errors will be biased if we assume SRS.

 

 

Calculation of the correct standard errors can be problematic when the design of the survey is complex.-->there may not be a tractable formula.

 

Methods: 1) taylor series linearization,  2) “BRR” : balanced repeated replication, 3) jackknife estimation, 4) bootstrap standard errors.

 

We will be going over bootsrap standard errors later in the course.

 

 

For more details on the use of Stata for survey data analysis, see the UCLA webpage

 

Example:  We will use the example from the UCLA webpage on a cluster sampling of schools, ucla survey example.log

 

do-file for this example

 

use http://www.ats.ucla.edu/stat/stata/library/apiclus1, clear
 
tabulate stype
 
      stype |      Freq.     Percent        Cum.
------------+-----------------------------------
          E |        144       78.69       78.69
          H |         14        7.65       86.34
          M |         25       13.66      100.00
------------+-----------------------------------
      Total |        183      100.00
 
tabulate dnum
 
   district |
     number |      Freq.     Percent        Cum.
------------+-----------------------------------
         61 |         13        7.10        7.10
        135 |         34       18.58       25.68
        178 |          4        2.19       27.87
        197 |         13        7.10       34.97
        255 |         16        8.74       43.72
        406 |          2        1.09       44.81
        413 |          1        0.55       45.36
        437 |          4        2.19       47.54
        448 |         12        6.56       54.10
        510 |         21       11.48       65.57
        568 |          9        4.92       70.49
        637 |         11        6.01       76.50
        716 |         37       20.22       96.72
        778 |          2        1.09       97.81
        815 |          4        2.19      100.00
------------+-----------------------------------
      Total |        183      100.00
 
svyset dnum [pw=pw], fpc(fpc)
 
      pweight: pw
          VCE: linearized
     Strata 1: <one>
         SU 1: dnum
        FPC 1: fpc
 
/* list fpc pw dnum -- to see the values for these items */
 
svy: mean api00
 
(running mean on estimation sample)
 
Survey: Mean estimation
 
Number of strata =       1          Number of obs    =     183
Number of PSUs   =      15          Population size  =  9235.4
                                    Design df        =      14
 
--------------------------------------------------------------
             |             Linearized
             |       Mean   Std. Err.     [95% Conf. Interval]
-------------+------------------------------------------------
       api00 |   644.1694   23.54224      593.6763    694.6625
--------------------------------------------------------------
 
svy: total enroll
 
(running total on estimation sample)
 
Survey: Total estimation
 
Number of strata =       1          Number of obs    =     183
Number of PSUs   =      15          Population size  =  9235.4
                                    Design df        =      14
 
--------------------------------------------------------------
             |             Linearized
             |      Total   Std. Err.     [95% Conf. Interval]
-------------+------------------------------------------------
      enroll |    5076846    1389984       2095626     8058066
--------------------------------------------------------------
 
svy: regress api00 meals ell avg_ed
 
(running regress on estimation sample)
 
Survey: Linear regression
 
Number of strata   =         1                  Number of obs      =       157
Number of PSUs     =        15                  Population size    = 9235.4001
                                                Design df          =        14
                                                F(   3,     12)    =     54.36
                                                Prob > F           =    0.0000
                                                R-squared          =    0.6978
 
------------------------------------------------------------------------------
             |             Linearized
       api00 |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
       meals |  -2.948702   .3266161    -9.03   0.000    -3.649224    -2.24818
         ell |  -.2227005   .3938377    -0.57   0.581    -1.067398    .6219974
      avg_ed |   16.42832   15.32151     1.07   0.302    -16.43304    49.28968
       _cons |   755.4386   55.61202    13.58   0.000     636.1626    874.7145
------------------------------------------------------------------------------

 

 

What happens if we ignore the clustering?