Lecture 10 —Monday, January 30, 2006
What was covered?
- Introduction to maximum likelihood estimation
Terminology Defined
Constructing the Likelihood
- I introduce the concept of likelihood through a motivating example. The data for this example appears in Krebs (1999) and we will carry out this example in the computer lab following the outline developed here.
- Suppose we obtain a random sample of m shoots. On each shoot we count the number of aphids present. The number of aphids observed on a given shoot is a random variable (it has a probability distribution). Denote it by the symbol X. In our random sample then we observe the values of m random variables, X1, X2, ... , Xm, one for each shoot in our sample.
- We observe x1 aphids on shoot 1, x2 aphids on shoot 2, etc. (I use the standard statistical convention of using capital letters for random variables and lower case variables for their values.)
- What was the probability of obtaining the data we collected? The notation for this probability is

or what is called a joint probability, the probability of simultaneously observing all m events. Another way of writing this is

- Since we have a random sample, each of these events is independent of the other. From elementary probability theory, if events A and B are independent then we have

- Applying this to our data we have

where I use product notation in the last step.
- For illustration assume each of the probability terms in this product is a Poisson probability. Thus we will assume that a Poisson distribution is a sensible model for the counts of aphids on stems. Plugging these in and regrouping terms yields the following.

Maximum likelihood estimation
The likelihood

This is the probability of our data. If we knew λ we could calculate the probability of obtaining any set of values x1, x2, ... , xm. Furthermore, for fixed λ if we summed this expression over all possible values of x1, x2, ... , xm we would get 1.
- Now we adopt a different perspective. Since it's the data that are observed and the parameter that is unknown it makes more sense to think of this probability as a function of λ for fixed data, i.e.,

Viewed this way it's no longer a probability. (For fixed data if we sum over all possible values of λ we will not get 1.) So instead we call this function the likelihood function. Keep in mind that it is still the joint probability function for our data under the assumed probability model only by another name.
- One of R. A. Fisher's major contributions (one of many) to statistics was to realize that the likelihood function perspective was a vehicle for obtaining parameter estimates. He proposed what has become known as the maximum likelihood principle for parameter estimation.
- Maximum Likelihood Principle: Choose as your estimates of the parameters those values that make the data we actually obtained the most probable. In other words, we choose as our estimates the parameter values that maximize the value of the likelihood.
- For our aphid model, the maximum likelihood principle enjoins that we choose the value for λ that makes the likelihood as large as possible.
The loglikelihood
- For both practical and theoretical reasons, it is preferable to work with the natural logarithm of the likelihood function, i.e., the loglikelihood. Starting with a generic probability model and proceeding to our independent Poisson model, the loglikelihood takes the following form.

- Recall some of the basic properties of the logarithm function. For positive numbers a and b, and real number n we have the following.
, so the logarithm turns multiplication into addition.
, so the logarithm turns division into subtraction.
, so the logarithm turns exponentiation into scalar multiplication.
- Using these properties on our Poisson probability model we obtain the following.

- Why use the loglikelihood rather than the likelihood?
- Since the logarithm is a monotone increasing function, the likelihood and the loglikelihood will achieve their maximum at exactly the same place. So, nothing is lost by doing this.
- For hand calculations the loglikelihood is far easier to work with since it converts products into sums.
- All of the theoretical results concerning maximum likelihood estimators are based on the loglikelihood.
- Using loglikelihoods increases the numerical stability of parameter estimates. Because likelihoods arise from joint probabilities that, under independence, factor into a product of marginal probabilities, the magnitude of the likelihood can be quite small, often very close to zero. With a large number of observations this value can even approach the machine zero of the computing device being used, leading to numerical problems. Log-transforming the likelihood converts these tiny probabilities into moderately large negative numbers thus eliminating numerical instability.
Maximizing the loglikelihood
- Maximizing a likelihood can be done in various ways.
- Graphically by plotting the likelihood and estimating where the peak occurs. (We'll do this in lab.)
- Algebraically by using calculus. This is an option only for fairly simple problems.
- Numerically using special optimization routines. (We'll also do this in lab using R's nlm and optim functions.)
- Since this is a simple problem I proceed to maximize the loglikelihood using calculus.
- From calculus we know that all local maxima occur at points where the first derivative is equal to zero, the so-called critical points. So, I differentiate the loglikelihood constructed above with respect to λ and set the result equal to zero. The derivative of the loglikelihood is called the score function
.

- Setting the score function to zero and solving for λ yields the maximum likelihood estimate of λ.

- To verify that this is a maximum we can check the sign of the second derivative at the putative maximum.

- Plugging in the value of λ we obtained above yields

which is negative because x1, x2, ... , xm are counts and hence greater than or equal to zero. Since the second derivative is negative at the critical point we know that the critical point corresponds to a local maximum. Because the second derivative is actually negative everywhere it follows that the loglikelihood is concave down everywhere with a single local maximum. Hence the local maximum is actually a global maximum.
- Thus
is the maximum likelihood estimate of λ.
- Remark: Recall that λ is the mean of the Poisson distribution. The "natural" estimator (the plug-in or method of moments estimator) of λ is the sample mean. Thus it is reassuring that the maximum likelihood estimator of λ turned out to be the sample mean too. In fact it is often the case that maximum likelihood estimators are the natural estimators of the parameters in question.
The likelihood as calculated in statistical packages
- Observe that when the score function was constructed by differentiating the loglikelihood, any term that did not contain the parameter λ disappeared because its derivative was zero.
- The term that disappeared in our calculations was
. This corresponds in the original likelihood to the term
.
- Since terms that don't contain model parameters contribute nothing to the process of finding the maximum likelihood estimate, they are often dropped. Formally we could write

and then just drop the term k(x) from further consideration treating
as if it were the actual likelihood. In our Poisson example above

and our solution for the maximum likelihood estimate of λ would not change if we had carried out all our calculations on
.
- SAS does this in most of its Procs. So the loglikelihood that is reported there is the value of
at the maximum likelihood estimate, not the value of
. R on the other hand reports the value of
.
- Clearly this doesn't matter for what we've done so far, but later on we'll want to use the loglikelihood to compare different models (using something called AIC).
- If the different models involve different probability generating mechanisms (e.g., some assume a Poisson model and others assume a negative binomial model), then reported "loglikelihoods" that are really
will not be comparable. For model comparison across different kinds of probability models the k(x) term is crucial.
- If the models in question do share the same probability generating mechanism (e.g., all models being compared are Poisson models) then the absence of k(x) in the likelihood won't matter.
- To distinguish these two situations we'll denote a likelihood that includes the k(x) term as the full likelihood.
Books and Articles on likelihood
- Azzalini, Adelchi. 1996. Statistical inference: based on the likelihood. New York: Chapman & Hall.
- Devore, Jay L. 1995. Probability and Statistics for Engineering and the Sciences. Pacific Grove, CA: Duxbury Press. General discussion of maximum likelihood estimation with examples, pp 265271.
- Edwards, A. W. F. 1992. Likelihood. Baltimore: John Hopkins University Press.
- Eliason, Scott R. 1993. Maximum likelihood estimation: logic and practice. Newbury Park, CA: Sage Publications.
- Hilborn, Ray and Marc Mangel. 1997. The ecological detective: confronting models with data. Princeton, NJ: Princeton University Press.
- Krebs, Charles J. 1999. Ecological Methodology. Menlo Park, CA: Addison-Wesley. A number of fairly advanced applications of maximum likelihood estimation appear on pp 8488, 91, 128131, 525529.
- Larsen, Richard J. and Marx, Morris L. 1981. An Introduction to Mathematical Statistics and Its Applications. Englewood Cliffs, New Jersey: Prentice-Hall. General discussion of maximum likelihood estimation with examples, pp 212218.
- McCallum, Hamish. 2000. Population Parameters: Estimation of Ecological Models. Oxford, England: Blackwell Science. General discussion of maximum likelihood estimation with examples, pp 2944.
- Myung, In Jae. 2001. Tutorial on maximum likelihood estimation. Journal of Mathematical Psychology 47: 90100. UNC has access to this journal electronically but it is not listed on the UNC library e-journal pages. Instead go directly to www.sciencedirect.com to access it.
- Pawitan, Yudi. 2001. In all likelihood: statistical modelling and inference using likelihood. New York: Oxford University Press.
- Roff, Derek A. 2006. Introduction to Computer-Intensive Methods of Data Analysis in Biology. New York: Cambridge University Press. Chapter 2 covers maximum likelihood estimation. Uses S-Plus (code also works in R).
- Royall, Richard M. 1997. Statistical evidence: a likelihood paradigm. New York: Chapman & Hall.
- Severini, Thomas A. 2000. Likelihood methods in statistics. New York: Oxford University Press.
- Sorensen, Daniel. 2002. Likelihood, Bayesian and MCMC methods in quantitative genetics. New York: Springer-Verlag.
Some web references on likelihood
Course Home Page