Lecture 26—Monday, February 27, 2006
What was covered?
- Including categorical variables with three levels in regression models
- Indicator, deviation, and helmert coding schemes for categorical regressors
Terminology Defined
Using Categorical Variables in Multiple Regression—Continued
A categorical variable with three levels—the wrong way to include it in a regression model
- Consider again the sociological example in which we use the amount of education required to work in an occupation to predict the degree of prestige people assign to it. This time we include a categorical variable that classifies the type of occupation into three levels: blue collar, professional, and white collar. Suppose we enter it into the regression model in the same way we included gender, i.e., we code occupation type numerically using the successive integers 1, 2, and 3 as shown below.
Label |
Occupation Type |
blue collar |
1 |
professional |
2 |
white collar |
3 |
- We fit the regression model: prestige = β0 + β1 education + β2 type.
- In this model, as in the gender example, we assume that all three occupations exhibit the same relationship between education and prestige but that they start from different baselines. (This is reflected by the absence of the interaction term education:type in the model.)
- Using the coding scheme above yields the following equations for the occupations.

- This translates into the following diagram

- Unlike the situation with the two-level categorical variable gender, our numeric coding scheme for occupation has imposed an unwarranted assumption on the location of the intercepts. While the location of the intercepts of blue collar and professional workers are arbitrary, the location of white collar workers is not.
- Our numeric coding scheme has constrained, for a fixed level of education, the difference in income between blue collar and professional workers to be exactly the same as the difference between white collar and professional workers, β2 .
- Furthermore the income difference between white collar and blue collar workers, for fixed education level, is twice the difference between either one and professional workers.
- These assumptions may turn out to be correct, but in general we'd be better off testing them instead of merely assuming them. In order to do that we need a more general coding scheme than the one we've used.
- Note: The change we need to make is more fundamental than just changing the numerical values of the levels. If we change the number assignments (something other than 1, 2, and 3) in the occupational type variable, we'll just end up with a different set of arbitrary constraints on the intercepts.
- What we need is a way of coding occupation that doesn't enforce any constraints on the estimates we obtain. The standard way of doing this is to use dummy regressors.
A categorical variable with three levels—a correct way to include it in a regression model
Note: by default R alphabetizes the levels of the character variable and defines the dummy coding levels in alphabetical order. As a result the first level alphabetically becomes the baseline level.
- If we fit a linear regression model in R with prestige as the response and education and occupation as predictors by
lm(prestige~education+occupation)
we obtain the following regression model by default.

- Since X1 = 0 and X2 = 0 for blue collar workers, the regression model for them is the following.
- Since X1 = 1 and X2 = 0 for professional workers, the regression model for them is the following.

- Since X1 = 0 and X2 = 1 for white collar workers, the regression model for them is the following.

- This translates into the following diagram

- Since β0, β2, and β3 are estimated separately it is clear from the diagram there are no constraints placed on the location of the intercepts. It is also clear why we only need n 1 regressors to independently describe n levels of a categorical variable. In the additive model, the categorical variable levels just serve to change the intercept of the regression line. Since there is already an intercept in the model, it is used for one of levels of the categorical variable.
- There are other coding schemes besides indicator coding. I consider some of those next.
Coding Schemes of Regressors for Categorical Variables
- There are many possible coding schemes for categorical variables. I outline the standard options available in R below. To simplify our discussion I consider a model in which the only predictor is the variable occupation. Thus in R notation we fit the model:
lm(prestige~occupation)
- In all of these coding schemes we represent occupation using two regressors. What does change is the way the levels of these regressors are coded.
Dummy (Indicator) Coding
- As explained above with dummy coding our occupation variable is converted into dummy (indicator) variables X1 and X2 whose values for the various levels of occupation were listed above. By default the level that is first alphabetically becomes the baseline level. This is the contr.treatment coding scheme of R. If we specify the regression model lm(prestige~occupation), R actually fits the model:

- Assume we are fitting a generalized linear model with a normally distributed random component and an identity link. The regression equation then predicts the mean as a function of X1 and X2. Formally we refer to this as the conditional mean of y (prestige), conditional on the values that are specified for the predictors. Thus when we choose values for X1 and X2 that correspond to the various professions, we obtain the means of those professions. With the dummy coding scheme we obtain the following equations for the means.

- By subtracting the mean for blue collar workers from each of the other equations we obtain expressions for β1 and β2.

- Thus in the dummy coding scheme each coefficient measures a difference in conditional mean between one classification level and the classification level that was chosen as baseline. The intercept corresponds to the mean of the baseline classification level.
Deviation (Effects) Coding
-
In R deviation coding is denoted contr.sum. To assign this contrast to the variable occupation type we would use the following statement:

- This time let's add the three equations.

So we see that intercept in deviation coding corresponds to the mean of all three levels.
- From the equations for blue collar and professional and using the formula for β0 above we immediately obtain interpretations for β1 and β2.
- β1 = the difference between the mean for blue collar and the overall mean
- β2 = the difference between the mean for professional and the overall mean
- Thus in deviation coding the coefficients measure the distance between individual levels and the mean of all the levels.
Helmert Coding
- In R Helmert coding is denoted contr.helmert and is the default coding scheme for factors in some versions of Splus. Assigning this contrast to the variable occupation type is done with the following statement.
contrasts(type)<-'contr.helmert'

- As with deviation coding, if we add all three equations together we isolate β0 and find that the intercept represents the mean of all the levels.

- We can isolate β1 by subtracting the blue collar mean from the professional mean.

The best way to interpret this coefficient is to observe that if β1 is not significantly different from zero we would conclude that the mean prestige for professionals and the mean prestige for blue collar workers are not significantly different from each other.
- We can isolate β2 by taking the average of the blue collar and professional means and then subtracting the result from the white collar mean.

The best way to interpret this coefficient is to observe that if β2 is not significantly different from zero we would conclude that the mean prestige of white collar workers is not significantly different from the mean prestige of blue collar and professional workers together.
- Essentially Helmert coding compares the current level with the average of the all the levels that preceded it. Thus Helmert contrast coding is especially appropriate if there is a natural order to the categories because then sequential comparisons of this sort make sense.
Course Home Page