Lecture 13—Friday, February 3, 2006
What was covered?
- Observed and expected information for the Poisson model
- Information as curvature
- Likelihood ratio test
- Wald test
Terminology Defined
Example of calculating the information
- Consider again the Poisson probability model introduced in Lecture 10. There we constructed the likelihood, loglikelihood, and the score function with the results shown below.



- The Hessian is just the derivative of the score.

- Observed information is just the negative of the Hessian evaluated at the MLE .

- Expected information is the expected value of the negative of the Hessian.

where I use the fact that the expected value of a Poisson random variable is λ.
- If we evaluate the expected information at the maximum likelihood estimate
we see that the expected information and the observed information are identical in this example.

- In general, we expect the observed and expected information will be similar when sample size is large.
Calculating the variance of a maximum likelihood estimator

-
Alternative calculation using other theory. Recall that for a Poisson distribution

and for a random sample of size
m, the variance of the sample mean is

, always . Thus we have from first principles
which we would estimate with

.
-
Thus the variance estimate obtained using likelihood methods is exactly the one we would expect from standard theory.
- In lecture 11 we maximized the Poisson likelihood of the aphid data set numerically and obtained a value of 14.44799 for the negative Hessian from which we calculated the variance of the MLE by taking its reciprocal. It's interesting to compare this to the theoretical result above when using the exact value for
of 3.46.
> out$hessian
[,1]
[1,] 14.44799
> #variance obtained numerically
> 1/out$hessian
[,1]
[1,] 0.06921377
> #theoretical variance
> 3.46/50
[1] 0.0692
Interpreting the information
- You may recall the concept of curvature, κ, from your calculus class. The formal definition of the curvature of a curve is the following.

Here φ is the angle the tangent line makes with the curve and s is arc length. Thus curvature is the rate at which you turn (in radians per unit distance) as you walk along the curve. For a function given by the formula
, its curvature (applying the definition to the function when written in parametric form) turns out to be the following.

- What happens if we apply the curvature formula to the loglikelihood,
? If the loglikelihood is a function of a single scalar parameter θ, then we have

- Next evaluate the curvature at the maximum likelihood estimate,
.

- Recall how the MLE was obtained. We differentiated the loglikelihood and set the derivative equal to zero. Thus,
is the value of θ at which the score is zero, i.e.,

- Using this result in the curvature equation above we obtain the following.

- Except for the sign, the Hessian at the MLE is just the observed information. If we ignore the sign, (curvature can be positive or negative, but information is nonnegative), and just take the magnitude of the curvature we see that the observed information is just the magnitude of the curvature when the curvature is evaluated at the MLE.

So what does this tell us about the meaning of information?
- Consider the the graph in Fig. 1 in which three different loglikelihoods are shown.
- They all are functions of a single parameter θ.
- They all are maximized at the same place,
.
- After that though they look quite different.
- From red to black to blue we go from high to moderate to low curvature at the maximum likelihood estimate
.
- Low curvature translates into a fairly flat loglikelihood. Thus in a neighborhood of
most values of θ have roughly the same loglikelihood and hence loglikelihood is not useful for discriminating one θ from another. Thus we have low information about the true value of θ.
- High curvature translates into a rapidly changing loglikelihood. Thus in a neighborhood of
even values of θ that differ by a small amount have very different loglikelihoods and hence are readily distinguishable from one another. In this case we have a lot of information about the true value of θ.
- A similar set of statements can be made about the variance of the estimator. Since for scalar θ the information and the variance are reciprocals of each other,

the following conclusions immediately follow.
- Low information means high variance of our estimator. Hence confidence intervals for θ are wide.
- High information means low variance of our estimator. Hence confidence intervals for θ are narrow.
- The table below summarizes these results in a more succinct form.
Curvature |
Information |
|
Confidence interval for θ |
high |
high |
low |
narrow |
low |
low |
high |
wide |
Likelihood Ratio Test
- The likelihood ratio test is to likelihood analysis as ANOVA (more properly partial F-tests) is to ordinary linear regression. The likelihood ratio test is used to compare nested models. Nested models share the same probability generating mechanism and share all the same parameters, except in one of the models one or more of the parameters are set to specific values (usually zero), while in the other model the same parameters are estimated.
- So in the usual situation we have a restricted model (one in which some parameters are set to specific values) and a second less restricted model (one in which these same parameters are estimated). Let θ1 denote the set of estimated parameters from the less restricted model and θ2 denote the set of estimated parameters for the restricted model. Because the models are nested the parameter set θ2 is a subset of the parameter set θ1. The likelihood ratio test takes the following form.

It turns out
where the degrees of freedom p is the difference in the number of estimated parameters in the two models.
- The chi-squared distribution is a special type of gamma distribution in which

.
- Let's consider the special case when there is only a single parameter θ and the restricted model specifies a specific value for this parameter θ0. (A common case would be when θ0 = 0.) Then we have

Fig. 2 illustrates the geometry of the test. Observe that the LR test measures closeness on the θ-axis by how close the values are on the loglikelihood axis (after being mapped there by the loglikelihood function). In the LR test two values of θ are close only if their loglikelihoods are close. The chi-squared distribution provides the absolute scale for measuring closeness on the loglikelihood axis.
- The role the loglikelihood curve plays becomes even clearer when we compare two different loglikelihoods, perhaps arising from different data sets, that yield the same maximum likelihood estimate of θ, but have different curvatures. Fig. 3 illustrates such a situation.
- Two scenarios are illustrated. If we are interested in testing

then scenario B gives us far more information for rejecting the null hypothesis than does scenario A. Observe from Fig. 3 that

- So even though the distance
is the same for both scenarios, the distances on the loglikelihood scale are different. Using the LR test we are far more likely to reject the null hypothesis under scenario B than under scenario A.
Wald Test
- The Wald test attempts to use the distance
as a way of testing the null hypothesis

As we've seen in Fig. 3 though this distance is not enough. We also need to take into account of the curvature of the loglikelihood. In Fig. 3 the more informative scenario B is the one with the greater curvature. In the Wald test we weight the distance on the θ axis by the curvature of the loglikelihood curves. Formally, the Wald statistic, W, is the following.

where in the second inequality I make use of the relationship between curvature and information described previously. Taking square roots of both sides we have the Wald statistic.
- Recall though that
. Thus the Wald statistic W can also be written as

- Now under the null hypothesis
is the true value of θ. From theory we also know asymptotically that the MLE is unbiased. So asymptotically at least if the null hypothesis is true then
. Thus asymptotically the expression for W given above is a z-score: a statistic minus its mean divided by its standard error.
- Since the MLE is asymptotically normally distributed (lecture 12), it follows that W, being a z-score, must have a standard normal distribution. This provides the basis for the Wald test as well as the Wald confidence interval that we constructed in Lecture 11.
Course Home Page