Lecture 12—Wednesday, February 1, 2006
What was covered?
- Properties of maximum likelihood estimators (MLEs)
- Asymptotic variance and distribution of maximum likelihood estimators
- The information matrix
Terminology Defined
Properties of maximum likelihood estimators (MLEs)
- The near universal popularity of maximum likelihood estimation derives from the fact that the estimators we obtain have good properties, properties that get better as sample size increases. Contrast this with method of moments estimators (no guarantee of properties, good or bad) and least squares estimators (good properties only if we restrict ourselves to unbiased linear estimators, those constructed by taking linear combinations of the data, and whose statistical theory depends on being able to assume a data generating process based on normal errors).
- Nearly all of the properties of maximum likelihood estimators are asymptotic, i.e., they only kick in once sample size is sufficiently large. How large is large will vary on a case by case basis.
- In what follows I use the notation:
= the maximum likelihood estimate of θ based on a sample of size n.
Some of the not so nice properties
- Maximum likelihood estimators are often biased (although not in the Poisson example done in lecture 10).
- Maximum likelihood estimators need not be unique.
- Maximum likelihood estimators may not exist.
- Maximum likelihood estimators can be difficult to derive. In all but the simplest cases they need to be approximated numerically. Numerical methods can be very sensitive to initial guesses for the parameter estimates and may fail to converge.
- Testing that the critical point corresponds to a maximum can be painful even for simple scenarios. The possibility of obtaining local maxima rather than global maxima is quite real.
A few of the nice properties
This is an abbreviated list since many of the properties of mles would not make sense to you without additional statistical background. Even some of the ones I list here may seem puzzling to you. The most important properties for practitioners are the fourth and fifth that give the asymptotic variance and the asymptotic distribution of maximum likelihood estimators.
-
is a consistent estimator of theta. This means

Thus the maximum likelihood estimate approaches the population value as sample size increases.
-
is asymptotically unbiased, i.e.,
. In other words, maximum likelihood estimators may be biased, but the bias disappears as the sample size increases. As an example, for
a random sample from a normal distribution with mean μ and variance
, the maximum likelihood estimator of
is

This estimator is biased, which is why we typically used the sample variance

as the estimator instead because it is unbiased. But notice that the difference between these two estimators becomes insignificant as n gets large.
-
is asymptotically efficient, i.e., among all asymptotically unbiased estimators it has the minimum asymptotic variance. In other words, maximum likelihood estimators tend to be the most precise estimators possible.
- The variance of
is known (at least asymptotically). For n large,

where
is the inverse of the information matrix (based on a sample of size n). I explain what the information matrix is in the next section. The important fact here is that the standard error of a maximum likelihood estimator can be calculated.
-
is asymptotically normally distributed. So we even know what the sampling distribution of a maximum likelihood estimator looks like, at least for large n.
- Likelihood theory is one of the few places where Bayesians and frequentists agree on something. Both believe that likelihood is where it's at. They diverge in that frequentists focus on maximum likelihood estimation, while a Bayesian would be interested in constructing something called the posterior distribution.
The information matrix
- We've already defined the score function as being the first derivative of the loglikelihood. If there is more than one parameter so that θ is a vector, then we speak of the score vector whose components are the first partial derivatives of the loglikelihood.
- If θ is a vector of parameters, then matrix of second partial derivatives of the loglikelihood is called the Hessian matrix.

If there is only a single parameter θ, then the Hessian is a scalar function.

The information matrix is defined in terms of the Hessian.
- There are two commonly used information matrices: the observed information and the expected information.
1. The observed information is just the negative of the Hessian evaluated at the maximum likelihood estimate.
Course Home Page