Lecture 37—Friday, March 24, 2006
What was covered?
- confusion matrix
- sensitivity and specificity
- ROC curve
- area under the curve (AUC)
- cross-validation
- the latent variable derivation of the logistic and probit regression models
Terminology defined
The confusion (classification) matrix
- Last time we began consideration of the classification table (confusion matrix) as a way to evaluate logistic regression models. We create a classification table by picking a cut-off value c, 0 ≤ c ≤ 1, with which to predict the value of the dichotomous response Y. Our decision rule based on the cut-off is the following.

- Using this decision rule we can construct the following generic confusion matrix.
| |
Observed |
|
| Yi = 1 |
Yi = 0 |
Predicted
(using decision rule) |
 |
A |
C |
A+C |
 |
B |
D |
B+D |
| |
A+B |
C+D |
A+B+C+D |
- From the confusion matrix a number of quantities of interest can be defined. They include the following.


- As noted last time there are problems with attempting to use these quantities to rate logistic regression models.
- The choice of c is arbitrary. All of the statistics will change if a different c is used.
- When we predict a dichotomous response
by reducing the continuum
to a dichotomy, 0 or 1, this entails a loss of information.
- All the rates calculated above depend on the margins of the classification table, A + B and C + D. The margins of the table are determined by the population from which the data come and do not depend on the model. Thus specificity and sensitivity are not just properties of the model, they also depend on the underlying population. As a result these quantities are clearly not absolute measures of model quality. Using the same decision rule with different populations may yield very different values for these summary statistics and thus different conclusions.
- I address each of these issues in turn.
Make the choice of c less arbitrary
- If we think of specificity and sensitivity as functions of c, the cut-off, then a couple of things are obvious.
- If c is chosen to be a very low value (near 0), then we will predict
for almost all of the observations i. As a result the sensitivity will be very high (near 1) and the specificity will be very low (near 0).
- If c is chosen to be very high value (near 1), then we will predict
for nearly all observations i. As a result the sensitivity will be very low (near 0) and the specificity will be very high (near 1).
- Based on these considerations we expect sensitivity to be a monotone non-increasing function of c, while specificity to be a monotone non-decreasing function of c. Fig. 1 illustrates the typical scenario.
- Fig. 1 suggests a possible strategy for choosing c. Choose the value of c that simultaneously maximizes both the sensitivity and the specificity. In the diagram this occurs where the two curves cross. At this point we estimate c = 0.436 and the specificity (sensitivity) to be equal to 0.782.
- Of course this raises the question, why this criterion? While the notion of maximizing both specificity and sensitivity is intuitively appealing it assumes that the two quantities are equally important. It's not difficult to envision situations where sensitivity might be more important than specificity, or vice versa.
Use all possible values of c simultaneously
- The standard way to use all possible cut-offs c simultaneously is through the use of what's called a ROC curve. ROC stands for "receiver operating characteristic" and is a concept derived from signal detection theory. It's early use was in assessing the quality of radar operators who were faced with the task of differentiating noise from signals, and friend from foe.
- To understand the ROC curve we need to graphically visualize the classification process based on the logistic regression model. The goal of habitat suitability modeling is to construct a model that distinguishes good habitat from bad habitat. Essentially we attempt to take what appears to be a single population of data values and divide it statistically into two: good habitats and bad habitats.
- Fig. 2 is my attempt to visualize the final habitat suitability model. The use of bell-shaped curves in the figure is done out of convenience only; I mean nothing substantive by it. It's merely an attempt to graphically depict the confusion matrix. In each figure the closer the two bell-shaped curves are to each other, the more "confused" we are as to what constitutes good or bad habitat. Better models are those that completely separate the two populations (depicted here by two curves with minimal overlap).
Fig. 2 Visualizing habitat suitability models
- The ROC curve is used to display how changing c affects the TPR (true positive rate or sensitivity) and the FPR (false positive rate or 1 – specificity) of the decision rule. In a ROC curve TPR is plotted against FPR. All reference to c is suppressed.
- To understand how the ROC curve is constructed, consider the so-called "good" model of Fig. 2.

Fig. 3 The effect of changing c on model calibration statistics. The area under each curve is a probability
- If we select a low value for c, then nearly all true positive values will be predicted to be positive (TPR is approximately 1). At the same time most negative values will also be classified as positive. So the FPR will also be near 1.
- As c increases, the FPR will decrease faster than TPR. In fact, at least initially, FPR will decrease without TPR changing at all because we'll not yet be intersecting the Y = 1 curve (left graph in Fig. 3).
- Eventually c will increase enough that TPR will begin to decrease also (right side of Fig. 3).
- As c approaches 1 nearly all the observations will be classified as negative. As a result both the TPR and FPR will approach zero.
- If you repeat the above steps for the models labeled "bad" and "great" you'll see that the only thing that changes is the relative rates at which the TPR and FPR decrease to zero. In the "bad" model they decrease at nearly the same rate. In the "great" model, the FPR decreases almost to zero long before the TPR even starts to decrease. Fig. 4 summarizes these results.
- The examples shown in Fig. 4 were artificially generated to match the displays in Fig. 2. As a result the ranking of the models in Fig. 4 is unambiguous. The ROC curve of the "great" model always lies entirely above the ROC curves of the "good" and "bad" models.
- In practice the ranking of the ROC curves from different models will be less clear cut. Typically the ROC curves from different models will cross repeatedly making it difficult to say which model is best.
- A partial resolution to this problem lies in the following observation: better models have ROC curves that are closer to the left and top edges of the unit square. (See for example the "great" model in Fig. 4.) Put another way, the area under a ROC curve for a good model should be close to 1 (the area of the unit square). So, the area under the ROC curve (AUC) is a useful single number summary for comparing the ROC curves of different models. Although the ROC curves may cross, the ROC curve of the better model will enclose on average a greater area.
- AUC can also be given a probabilistic interpretation.
- Suppose we have a data set in which the presence-absence variable consists of n1 ones and n0 zeros. Imagine constructing all possible n1 × n0 pairs of zeros and ones. Define the random variable Ui as follows.

Here
and
are the estimated probabilities (obtained from the logistic regression model) for the "presence" and "absence" observations in the ith pair. Thus Ui = 1 if the model assigns a higher probability to the "presence" observation than to the absence observation. When this happens the observations are said to be concordant, i.e., the model matches the data.
- From this we can calculate the concordance index of the model.

- It turns out the concordance index is equal to the AUC. Thus the AUC can be interpreted as being the fraction of 0-1 pairs correctly classified by the model. If AUC = 0.5 then our model is doing no better than random guessing.
- A fairly arbitrary scale for interpreting AUC values has been proposed to assist in model calibration.

- Note: a wonderful online resource for visualizing ROC curves and their underlying relationship to population models is a site called The Magnificent ROC. Figures like those shown above can be found there along with applets that allow you to dynamically alter c in the decision rule and watch how the corresponding ROC curve changes.
Test the model against new data
- The third criticism of sensitivity and specificity as model metrics is that they are functions of both the model and the data used to build the model. An obvious way around this objection is to calculate these statistics using new data. There are two ways this is typically done.
- Split the data into two parts. Build the model on one part (the training set) and evaluate the model on the second part (the validation set). This approach is called validation. It's major drawback is that it end up not using all of data to build the model. Many researchers are unwilling to do this, particularly if the data are hard to come by.
- Fit the model in the usual way using all the data. Then divide the data into multiple parts called folds (10 is a common choice). Using the basic form of the model selected using all the data, leave out one fold at a time and obtain parameter estimates using the other nine folds combined. For each fold calculate a statistic of interest, e.g., AUC. Repeat this process each time using a different fold for testing the model. Finally average the value of the statistic over the different runs. (With enough folds it is also reasonable to compute a variance.) This method is called cross-validation.
- Cross-validation is very popular and there are a number of R functions in various packages available for this purpose. If the statistics calculated when using the training data do not change much when re-evaluated under validation or cross-validation, then we can be assured we have a robust model.
Latent variable models for binary variables
- Binary regression models were used long before the notion of a generalized linear model was proposed by McCullagh and Nelder in the 1980s. The basic rationale was that underlying a dichotomous random variable Y is an unobserved continuous random variable
. (Here the superscript c is just notation to denote the continuous version of Y). Thus we assume the dichotomous random variable arose as follows.

for some k. The unobserved variable
is also sometimes called a latent variable.
The probit model
- Now suppose that an ordinary linear regression model holds for
. To simplify notation suppose it is a simple linear regression model of the following form.

Here I use c as a subscript of superscript to remind us that this is a model constructed using the underlying continuous variable
.
- We can model
as follows.

- Observe that since
, it follows that
, a standard normal random variate. (We have here a normal random variable minus its mean divided by its standard error.) Make the identifications
to yield the following.

This last expression is just the generalized linear model for a binomial random variable using a probit link.
The logistic model
- It turns out there is something called a logistic probability distribution that in standard form has the density and distribution function shown below.

A logistic distribution looks very much like a normal distribution but with fatter tails. If we assume the errors of our simple linear regression model for the continuous response
has a logistic distribution with mean 0 and variance
, we obtain the following.

where
has a logistic distribution mean 0 and variance 1. If we multiply this quantity by
the resulting random variable will have the standard logistic distribution.

- The last expression is the usual logistic regression model for the probability of a success. Thus both the logistic regression model and probit model can be derived by assuming a continuous underlying distribution for the response in which the response has a logistic distribution or a normal distribution, respectively. Of course with the theory of generalized models at our disposal, there is no longer any need to make such assumptions.
Course Home Page