Lecture 36—Wednesday, March 22, 2006
What was covered?
- Goodness of fit tests for ungrouped binary data
- grouping by the predictor
- grouping by predicted logits, the Hosmer-Lemeshow test
- Measures of model performance for logistic regression
Terminology defined
Assessing model fit with ungrouped binary data
Method 1: Grouping on predictors
- Grouping on values of the predictors is a straight-forward extension of the goodness of fit test for grouped binary data. Typically this approach involves cutting a continuous variable at various values to form groups and then obtaining expected and observed frequencies of success and failure for the observations making up the groups.
- Suppose we form g groups this way with n1, n2, … , ng observations occurring in each of the groups. Let
be the model-based estimated probability for observation i. The table below shows the calculations required for obtaining the expected frequencies.
| Grouping |
|
|
… |
|
| Successes |
|
|
… |
|
| Failure |
|
|
… |
|
- Notice that the sum is taken over observations that occur in the group based on their value of x. As usual we demand that no more than 20% of the cells have expected frequencies less than 5. Some researchers argue that for small tables we should require all the cells to have at least 5 observations.
- We obtain a similar table for the observed frequencies. Here yi is the observed presence-absence variable (coded 1 and 0 respectively).
| Grouping |
|
|
… |
|
| Successes |
|
|
… |
|
| Failure |
|
|
… |
|
- The Pearson chi-square statistic is constructed by summing over the 2g categories of successes and failures. The degrees of freedom are the number of x-categories, g, minus the number of parameters estimated in the model, p.

- With multiple predictors the number of ways in which the categories can be formed is daunting. This leaves uncertain whether a non-significant result is due solely to the specific manner in which the categories were formed. Typically when there is more than one predictor, the approach described in the next section is the preferred one (and certainly easier to do).
Method 2: Hosmer-Lemeshow test—grouping on estimated logits
- The Hosmer-Lemeshow test groups observations using the percentiles of the estimated probabilities (or equivalently, the expected logits). Typically quintiles or deciles are used. For deciles the groups we would form would be the following.

- We then calculate the expected and observed number of successes and failures in each of the intervals just as we did above for groupings based on the x-variable and construct the Pearson chi-square statistic by summing over the categories for both successes and failures. Unlike the typical chi-squared statistic, the degrees of freedom for the Hosmer-Lemeshow statistic is not obtained by subtracting the number of estimated parameters from the number of independent groups. Instead the degrees of freedom is g–2 regardless of how many parameters were estimated.

- The ability of the Hosmer-Lemeshow test to detect true lack of fit has been called into question lately and some alternative tests have been proposed. One of those alternatives is implemented in the Frank Harrell's Design package for R and was demonstrated in lab.
Measures of model performance unique to logistic regression: the classification table
- One use of logistic regression is to classify new observations as successes or failures based on the values of their predictors. Since logistic regression estimates a probability, we need to choose a cut-off value c such that if
we classify the observation to be a success (
= 1), otherwise we classify it as a failure (
= 0). Here
is the probability estimated by the logistic regression model.
- One diagnostic then that we can use for a logistic regression model is how well it predicts the data that went into fitting the model. Suppose we have a data set with 575 observations consisting of 147 successes (Yi = 1) and 428 failures (Yi = 0). The model predicts that of these 575 observations 27 are successes (
= 1) and 548 are failures (
= 0). Results such as this can be organized into what's called a classification table (also called a confusion matrix).
| |
Observed |
|
| Yi = 1 |
Yi = 0 |
Predicted
(using decision rule)
|
 |
16 |
11 |
27 |
 |
131 |
417 |
548 |
| |
147 |
428 |
575 |
In this table c = 0.5 was used in the decision rule.
- The four cells of the table (in yellow) can be classified as follows.
True Positive (TP) |
False Positive (FP) |
False Negative (FN) |
True Negatives (TN) |
- Using the numbers in the table, we can conclude the following.
- The rate of correct group classification (the fraction of observations correctly classified) is

- The true positive rate, i.e., the probability of detecting a true signal, is

This is also called the sensitivity of the decision rule.
- The true negative rate, i.e., the probability of detecting a false signal, is

This is also called the specificity of the decision rule.
- Specificity and sensitivity seem like sensible measures for rating the quality of a logistic regression model. There are some problems with this idea.
- The choice of c in the decision rule is arbitrary. Why choose 0.5? Why is 0.5 necessarily optimal?
- We would seem to be losing a lot of information by dichotomizing the probabilities. This reduces a continuum to a binary yes-no decision. If the predicted probabilities for two observations are 0.49 and 0.51 respectively, and our decision rule uses c = 0.5 thus classifying these observations as a failure and a success, respectively, are we really to believe that these two observations are all that different? In general if the estimated probability is 0.51 we would expect on average out of 100 observations with the characteristics of this observation to obtain 51 successes and 49 failurers, but our decision rule would always classify such an observation as a success.
- Sensitivity and specificity are not entirely functions of the decision rule. They also reflect the margins of the table. Thus the same model evaluated on two different populations can give very different impressions of performance if sensitivity and specificity are used to evaluate it.
- We'll consider some possible solutions to these problems next time.
Course Home Page