Lecture 15—Wednesday, February 14, 2007
What was covered?
- The loglikelihood
- Properties of maximum likelihood estimators (MLEs)
- Asymptotic variance and distribution of maximum likelihood estimators
- The information matrix
Maximum likelihood estimation (continued)
- For both practical and theoretical reasons, it is preferable to work with the natural logarithm of the likelihood function, i.e., the loglikelihood. Starting with a generic probability model and proceeding to our independent Poisson model, the loglikelihood takes the following form.
- Using R we can write the loglikelihood for the aphids data set as follows.
poisloglike<-function(lambda) sum(log(dpois(aphids[,1], lambda)))
or using sapply
poisloglike<-function(lambda) sum(sapply(aphids[,1], function(x) log(dpois(x, lambda))))
The two are equivalent here because lambda is a scalar. Generally the sapply version is safer.
- Recall some of the basic properties of the logarithm function. For positive numbers a and b, and real number n we have the following.
Using these properties on our Poisson probability model we obtain the following.
- , so the logarithm turns multiplication into addition.
- , so the logarithm turns division into subtraction.
- , so the logarithm turns exponentiation into scalar multiplication.
- Why use the loglikelihood rather than the likelihood?
- Since the logarithm is a monotone increasing function, the likelihood and the loglikelihood will achieve their maximum at exactly the same place. So, nothing is lost by doing this.
- For hand calculations the loglikelihood is far easier to work with since it converts products into sums.
- All of the theoretical results concerning maximum likelihood estimators are based on the loglikelihood.
- Using loglikelihoods increases the numerical stability of parameter estimates. Because likelihoods arise from joint probabilities that, under independence, factor into a product of marginal probabilities, the magnitude of the likelihood can be quite small, often very close to zero. With a large number of observations this value can even approach the machine zero of the computing device being used, leading to numerical problems. Log-transforming the likelihood converts these tiny probabilities into moderately large negative numbers thus eliminating numerical instability.
Maximizing the loglikelihood
- Maximizing a loglikelihood can be done in various ways.
- Graphically by plotting the loglikelihood and estimating where the peak occurs. (We'll do this in lab.)
- Algebraically by using calculus. This is an option only for fairly simple problems.
- Numerically using special optimization routines. (We'll also do this in lab using R's nlm function.)
An illustration of maximizing the loglikelihood using calculus (Note: not done in class)
- Since this is a simple problem I proceed to maximize the loglikelihood using calculus.
- From calculus we know that all local maxima occur at points where the first derivative is equal to zero, the so-called critical points. So, I differentiate the loglikelihood constructed above with respect to λ and set the result equal to zero. The derivative of the loglikelihood is called the score function .
- Setting the score function to zero and solving for λ yields the maximum likelihood estimate of λ.
- To verify that this is a maximum we can check the sign of the second derivative at the putative maximum.
- Plugging in the value of λ we obtained above yields
which is negative because x1, x2, ... , xm are counts and hence greater than or equal to zero. Since the second derivative is negative at the critical point we know that the critical point corresponds to a local maximum. Because the second derivative is actually negative everywhere it follows that the loglikelihood is concave down everywhere with a single local maximum. Hence the local maximum is actually a global maximum.
- Thus is the maximum likelihood estimate of λ.
- Remark: Recall that λ is the mean of the Poisson distribution. The "natural" estimator (the plug-in or method of moments estimator) of λ is the sample mean. Thus it is reassuring that the maximum likelihood estimator of λ turned out to be the sample mean too. In fact it is often the case that maximum likelihood estimators are the natural estimators of the parameters in question.
The loglikelihood as calculated in statistical packages
- Observe that when the score function was constructed by differentiating the loglikelihood, any term that did not contain the parameter λ disappeared because its derivative was zero.
- The term that disappeared in our calculations was . This corresponds in the original likelihood to the term .
- Since terms that don't contain model parameters contribute nothing to the process of finding the maximum likelihood estimate, they are often dropped. Formally we could write
and then just drop the term k(x) from further consideration treating as if it were the actual likelihood. In our Poisson example above
and our solution for the maximum likelihood estimate of λ would not change if we had carried out all our calculations on .
- SAS does this in most of its procedures (Procs). So the loglikelihood that is reported there is the value of at the maximum likelihood estimate, not the value of . R on the other hand reports the value of .
- Clearly this doesn't matter for what we've done so far, but later on we'll want to use the loglikelihood to compare different models (using something called AIC).
- If the different models involve different data generating mechanisms (e.g., some assume a Poisson model and others assume a negative binomial model), then reported "loglikelihoods" that are really will not be comparable. For model comparison across different kinds of probability models the k(x) term is crucial.
- If the models in question do share the same data generating mechanism (e.g., all models being compared are Poisson models) then the absence of k(x) in the likelihood won't matter.
- To distinguish these two situations we'll denote a likelihood that includes the k(x) term as the full likelihood.
Properties of maximum likelihood estimators (MLEs)
- The near universal popularity of maximum likelihood estimation derives from the fact that the estimators we obtain have good properties, properties that get better as sample size increases. Contrast this with method of moments estimators (no guarantee of properties, good or bad) and least squares estimators (good properties only if we restrict ourselves to unbiased linear estimators, those constructed by taking linear combinations of the data, and whose statistical theory depends on being able to assume a data generating process based on normal errors).
- Nearly all of the properties of maximum likelihood estimators are asymptotic, i.e., they only kick in once sample size is sufficiently large. How large is large will vary on a case by case basis.
- In what follows I use the notation: to represent the maximum likelihood estimate of θ based on a sample of size n.
Some of the not so nice properties
- Maximum likelihood estimators are often biased (although not in the Poisson example done in lecture 10).
- Maximum likelihood estimators need not be unique.
- Maximum likelihood estimators may not exist.
- Maximum likelihood estimators can be difficult to derive. In all but the simplest cases they need to be approximated numerically. Numerical methods can be very sensitive to initial guesses for the parameter estimates and may fail to converge.
- Testing that the critical point corresponds to a maximum can be painful even for simple scenarios. The possibility of obtaining local maxima rather than global maxima is quite real.
A few of the nice properties
This is an abbreviated list since many of the properties of mles would not make sense to you without additional statistical background. Even some of the ones I list here may seem puzzling to you. The most important properties for practitioners are the fourth and fifth that give the asymptotic variance and the asymptotic distribution of maximum likelihood estimators.
- is a consistent estimator of θ. This means
Thus the maximum likelihood estimate approaches the population value as sample size increases.
- is asymptotically unbiased, i.e., . In other words, maximum likelihood estimators may be biased, but the bias disappears as the sample size increases. As an example, for a random sample from a normal distribution with mean μ and variance , the maximum likelihood estimator of is
This estimator is biased, which is why we typically used the sample variance
as the estimator instead because it is unbiased. But notice that the difference between these two estimators becomes insignificant as n gets large.
- is asymptotically efficient, i.e., among all asymptotically unbiased estimators it has the minimum asymptotic variance. In other words, maximum likelihood estimators tend to be the most precise estimators possible.
- The variance of is known (at least asymptotically). For n large,
where is the inverse of the information matrix (based on a sample of size n). I explain what the information matrix is in the next section. The important fact here is that the standard error of a maximum likelihood estimator can be calculated.
- is asymptotically normally distributed. So we even know what the sampling distribution of a maximum likelihood estimator looks like, at least for large n.
- Likelihood theory is one of the few places where Bayesians and frequentists agree on something. Both believe that likelihood is where it's at. They diverge in that frequentists focus on maximum likelihood estimation, while a Bayesian would be interested in constructing something called the posterior distribution.
The information matrix
- We've already defined the score function as being the first derivative of the loglikelihood. If there is more than one parameter so that θ is a vector, then we speak of the score vector whose components are the first partial derivatives of the loglikelihood.
- If θ is a vector of parameters, then matrix of second partial derivatives of the loglikelihood is called the Hessian matrix.
If there is only a single parameter θ, then the Hessian is a scalar function.
The information matrix is defined in terms of the Hessian.
- The observed information is just the negative of the Hessian evaluated at the maximum likelihood estimate.
For the case of a single parameter θ we have
- As noted above to obtain the asymptotic variance of the maximum likelihood estimates we need to invert this quantity. Thus we'll need to take either the reciprocal of the negative Hessian (for a single parameter) or the inverse of the negative of the Hessian matrix (for a vector of parameters).
- Statistical software (for example, the nlm function in R) typically outputs the Hessian evaluated at the mle or what we're calling the observed information.
- Note: because nlm does minimization our objective function is the negative loglikelihood. Since the negative sign has already been introduced as part of the loglikelihood, the Hessian produced by nlm already has the negative sign included. Because of this nlm returns in our notation.
Books and Articles on likelihood
- Azzalini, Adelchi. 1996. Statistical inference: based on the likelihood. New York: Chapman & Hall.
- Devore, Jay L. 1995. Probability and Statistics for Engineering and the Sciences. Pacific Grove, CA: Duxbury Press. General discussion of maximum likelihood estimation with examples, pp 265271.
- Edwards, A. W. F. 1992. Likelihood. Baltimore: John Hopkins University Press.
- Eliason, Scott R. 1993. Maximum likelihood estimation: logic and practice. Newbury Park, CA: Sage Publications.
- Hilborn, Ray and Marc Mangel. 1997. The ecological detective: confronting models with data. Princeton, NJ: Princeton University Press.
- Krebs, Charles J. 1999. Ecological Methodology. Menlo Park, CA: Addison-Wesley. A number of fairly advanced applications of maximum likelihood estimation appear on pp 8488, 91, 128131, 525529.
- Larsen, Richard J. and Marx, Morris L. 1981. An Introduction to Mathematical Statistics and Its Applications. Englewood Cliffs, New Jersey: Prentice-Hall. General discussion of maximum likelihood estimation with examples, pp 212218.
- McCallum, Hamish. 2000. Population Parameters: Estimation of Ecological Models. Oxford, England: Blackwell Science. General discussion of maximum likelihood estimation with examples, pp 2944.
- Myung, In Jae. 2001. Tutorial on maximum likelihood estimation. Journal of Mathematical Psychology 47: 90100. UNC has access to this journal electronically but it is not listed on the UNC library e-journal pages. Instead go directly to www.sciencedirect.com to access it.
- Pawitan, Yudi. 2001. In all likelihood: statistical modelling and inference using likelihood. New York: Oxford University Press.
- Roff, Derek A. 2006. Introduction to Computer-Intensive Methods of Data Analysis in Biology. New York: Cambridge University Press. Chapter 2 covers maximum likelihood estimation. Uses S-Plus (code also works in R).
- Royall, Richard M. 1997. Statistical evidence: a likelihood paradigm. New York: Chapman & Hall.
- Severini, Thomas A. 2000. Likelihood methods in statistics. New York: Oxford University Press.
- Sorensen, Daniel. 2002. Likelihood, Bayesian and MCMC methods in quantitative genetics. New York: Springer-Verlag.
Some web references on likelihood
Course Home Page