Lecture 14—Monday, February 12, 2007
What was covered?
- Introduction to maximum likelihood estimation
Fitting regression models when the data generating mechanism is not normal
- So far we've fit regression models with only two types of data generating mechanisms—normal and lognormal. To fit regression models in which we assume the response variable was normally distributed we used least squares. To fit regression models in which we assume the response variable was lognormally distributed, we used a trick.
- Using the fact that if the random variable Y is lognormally distributed, then log Y is normally distributed, we fit an ordinary regression model using log Y as the response and obtained parameter estimates using least squares. It turns out that the interpretation of the regression model obtained in this way is a little bit odd.
- Recall that the ordinary normal-based regression model estimates the mean of the response. This yields the regression curve about which individual observations are assumed to arise as random realizations from a normal distribution. This is the usual signal (regression curve) plus noise interpretation of the ordinary regression model.
- When the response variable is log Y instead of Y, then the ordinary regression model estimates the mean of log Y as a function of the predictors. Typically after the regression model is estimated, we back-transform the equation to obtain a model for Y. What is the proper interpretation of the back-transformed model?
- Suppose we have a finite number of observations y1, y2, ... , yn and we take the average of their logarithms. Using properties of logarithms we obtain the following.
- Thus when we back transform the mean of the logs using exponentiation we don't get the arithmetic mean of the original variable, instead we get what's called the geometric mean—the nth root of the product of the observations.
- The geometric mean has a very different interpretation from the arithmetic mean.
- The arithmetic mean answers the following question. If all n values had the same value what value would that be if it was required to yield the same total as the original observations.
- The geometric mean answers a different question. If all n values had the same value what value would that be if it was required to yield the same product as the original observations.
- It's a theorem in mathematics that the geometric mean can never exceed the arithmetic mean. So although the two means can be equal, typically the geometric mean is smaller than the arithmetic mean.
- The point of this aside is just to note that although we have used the lognormal distribution as a data generating mechanism the manner in which we did so yielded a very different sort of model, a model for the geometric mean rather than the arithmetic mean. The bottom line is if we want to model the arithmetic mean assuming a lognormal distribution for the response, then we'll need to use a methodology other than least squares.
- Similarly fitting regression models in which the response is assumed to have a binomial, Poisson, negative binomial, etc., distribution requires an estimation method other than least squares. In the modern frequentist approach to statistics the preferred alternative to least squares is maximum likelihood estimation.
- Maximum likelihood estimation is obtains the values of the parameters that maximize something called the likelihood. I discuss the concept of likelihood next.
Constructing the Likelihood
- I introduce the concept of likelihood through a motivating example. The data for this example appears in Krebs (1999) and is the aphid data set used in Assignment 2.
- Suppose we obtain a random sample of m shoots. On each shoot we count the number of aphids present. The number of aphids observed on a given shoot is a random variable (it has a probability distribution). Denote it by the symbol X. In our random sample then we observe the values of m random variables, X1, X2, ... , Xm, one for each shoot in our sample.
- We observe x1 aphids on shoot 1, x2 aphids on shoot 2, etc. (I use the standard statistical convention of using capital letters for random variables and lower case variables for their values.)
- What was the probability of obtaining the data we collected? The notation for this probability is
or what is called a joint probability, the probability of simultaneously observing all m events. Another way of writing this is
- Since we have a random sample, each of these events is independent of the other. From elementary probability theory, if events A and B are independent then we have
- Applying this to our data we have
where I use product notation in the last step.
- For illustration assume each of the probability terms in this product is a Poisson probability. Thus we will assume that a Poisson distribution is a sensible model for the counts of aphids on stems. Plugging these in and regrouping terms yields the following.
- Using R notation we would write the joint probability as follows.
Maximum likelihood estimation
The joint probability function is a function of the data, the x-values. The model parameter λ is assumed to be a fixed value in nature. To fix this idea we use the following notation in which a semicolon is used to separate the quantities that are random (the data) from the quantity that is fixed (the parameter of the probability model).
This is the probability of our data. If we knew λ we could calculate the probability of obtaining any set of values x1, x2, ... , xm. Furthermore, for fixed λ if we summed this expression over all possible values of x1, x2, ... , xm we would get 1.
- Now we adopt a different perspective. Since it's the data that are observed and the parameter that is unknown it makes more sense to think of this probability as a function of λ for fixed data, i.e.,
Viewed this way it's no longer a probability. (For fixed data if we sum over all possible values of λ we will not get 1.) So instead we call this function the likelihood function. Keep in mind that it is still the joint probability function for our data under the assumed probability model only by another name.
- One of R. A. Fisher's major contributions (one of many) to statistics was to realize that the likelihood function perspective was a vehicle for obtaining parameter estimates. He proposed what has become known as the maximum likelihood principle for parameter estimation.
- Maximum Likelihood Principle: Choose as your estimates of the parameters those values that make the data we actually obtained the most probable. In other words, we choose as our estimates the parameter values that maximize the value of the likelihood.
- For our aphid model, the maximum likelihood principle enjoins that we choose the value for λ that makes the likelihood as large as possible.
Course Home Page