Lecture 23 (lab 6) —February 21, 2006
What was covered?
- R topics
- general linear model function lm
- generalized linear model functions glm and glm.nb
- nonlinear least squares function nls
- summary, coef, predict, AIC, residuals, and logLik functions applied to model objects
- fitting Poisson, log-transformed, and negative binomial models for means as GLIMs
- fitting and comparing various species-area models
R functions and commands demonstrated
- AIC returns the AIC for a fitted model
- coef returns the coefficient estimates from a fitted model
- contrasts displays the coding scheme used for categorical variables in regression models
- glm is the generalized linear model function
- glm.nb extends generalized linear modeling to the negative binomial distribution (MASS package)
- lm is the function for fitting ordinary regression models
- logLik returns the loglikelihood of a fitted model
- nls is the nonlinear least squares function for fitting nonlinear models using least squares
- predict when applied to a model object returns the value of the linear predictor for each observation in a data set
- residuals returns the residuals from a fitted model
- summary when applied to a model object returns details of the model fit
R function options
- family= (argument to glm) specifies the member of the exponential family of distributions to use in the generalized linear model.
- link= (argument to family inside the glm function) can be used to specify a link different from the default canonical link
R packages used
- MASS for the function glm.nb The MASS library implements procedures described in Venables and Ripley (2002)
Using GLIMs to model the Crawley slug data set
- We revisit the slug data set that appears on Michael Crawley's web site for his text Statistical Computing (Crawley 2002).
> #obtain data
> slugs<-read.table( 'http://www.bio.ic.ac.uk/research/mjcraw/statcomp/data/slugsurvey.txt', header=TRUE)
> names(slugs)
[1] "slugs" "field"
- A number of the models we examined last time can also be fit as GLIMs. By taking advantage of the GLIM paradigm we can avoid writing the likelihood function ourselves. Recall that in a generalized linear model (GLIM) a link function g is used to connect a linear predictor η to the mean μ.

- Using the language of GLIMs all of the means models we fit last week used an identity link. Recall that all of our equations were of the form mu = p[1] + p[3]*field.dummy, so we were setting μ equal to the linear predictor, not some function of μ.
Poisson models
- Last time we fit two Poisson models—a common mean model and a separate means model. In the separate means model the slugs in the two field types were assumed to occur under rocks at different rates. In the common mean model the same rate parameter sufficed for both. Both of these models are examples of generalized linear models.
The "common mean Poisson model" as a GLIM.
- The generalized linear model function in R is glm. We fit the common mean model as a GLIM as follows.
> out1.glm<-glm(slugs~1, data=slugs, family=poisson)
- The first argument to glm is what's called a model formula.
- The left side of the formula is the response, in our case the individual raw counts of slugs.
- The right side of the formula is the linear predictor. In the common mean model the linear predictor is
, a model with only an intercept. In R the intercept is denoted by the numeral 1.
- The left and right sides of the formula are separated by a tilde, ~.
- The glm function has a data= option, so it is not necessary to specify the data set as part of the variable names. The following two function calls are equivalent: glm(slugs$slugs~1, family=poisson) and glm(slugs~1,data=slugs)
- To specify the random component of the GLIM use the family argument. Since we wish to carry out a Poisson regression, the family argument is poisson. If the family argument is omitted R assumes a Gaussian (normal) distribution.
- The canonical link for Poisson regression is the log link. So without additional information, this is what R assumes we want.
- To examine the results of the analysis use the summary function.
> summary(out1.glm)
Call:
glm(formula = slugs ~ 1, family = poisson, data = slugs)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.8841 -1.8841 -0.6343 0.8360 4.2574
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.57380 0.08391 6.838 8.03e-12 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 224.86 on 79 degrees of freedom
Residual deviance: 224.86 on 79 degrees of freedom
AIC: 355.68
Number of Fisher Scoring iterations: 5
- From the output we see that estimate for the intercept is 0.57380. Because a log link was used what we've obtained is the following.

In order to obtain an estimate of the mean slug density we need to exponentiate the result. We can extract the coefficient estimates from the GLIM with the coef function.
> coef(out1.glm)
(Intercept)
0.5738004
> exp(coef(out1.glm))
(Intercept)
1.775
- Observe that this is the same result we obtained last time.
> poi.1<-function(data,p) -sum(log(dpois(data$slugs,lambda=p)))
> nlm(function(p) poi.1(slugs,p),2)->out1
> out1$estimate
[1] 1.774999
- We could have obtained this result directly by using an identity link in the glm. The link function is specified in a link= argument contained in parentheses following the probability distribution specified in the family argument.
> out1.glm.id<-glm(slugs~1,data=slugs,family=poisson(link=identity))
> coef(out1.glm.id)
(Intercept)
1.775
- The model object produced by glm is a list variable with a number of components. We can see what's available by applying the names function to the model object.
names(out1.glm)
[1] "coefficients" "residuals" "fitted.values" "effects"
[5] "R" "rank" "qr" "family"
[9] "linear.predictors" "deviance" "aic" "null.deviance"
[13] "iter" "weights" "prior.weights" "df.residual"
[17] "df.null" "y" "converged" "boundary"
[21] "model" "call" "formula" "terms"
[25] "data" "offset" "control" "method"
[29] "contrasts" "xlevels"
- We can extract the deviance, degrees of freedom of the deviance, and the AIC using the usual list notation.
> out1.glm$deviance
[1] 224.8594
> out1.glm$df.residual
[1] 79
> out1.glm$aic
[1] 355.6766
These are also reported as part of the output from the summary function.
- For the Poisson model the deviance provides an approximate goodness of fit test. As explained in lecture 22, for Poisson models we expect the residual deviance divided by the degrees of freedom to be approximately 1.
> out1.glm$deviance/out1.glm$df.residual
[1] 2.846321
- For our data the ratio is almost 3 indicating that the model does not fit the data. Because the ratio is greater than 1 we conclude that the data are overdispersed. Overdispersion can be addressed by including additional predictors in the model or by assuming a different probability distribution for the random component.
- Other information about the model can be obtained by applying various functions to the model object. Two such functions are logLik, which extracts the loglikelihood, and AIC.
> logLik(out1.glm)
'log Lik.' -176.8383 (df=1)
> AIC(out1.glm)
[1] 355.6766
The degrees of freedom reported for the loglikelihood refers to the number of parameters that were estimated in fitting the model. The numbers reported here match the values we obtained last time by minimizing the negative loglikelihood directly.
> out1$minimum
[1] 176.8383
> my.aic(out1)
[1] 355.6766
The "separate means Poisson model" as a GLIM.
- The separate means Poisson model can be fit as a GLIM as follows.
> out2.glm<-glm(slugs~field,data=slugs,family=poisson)
> summary(out2.glm)
Call:
glm(formula = slugs ~ field, family = poisson, data = slugs)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.1331 -1.5969 -0.9519 0.4580 4.8727
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.2429 0.1400 1.735 0.082744 .
fieldRookery 0.5790 0.1749 3.310 0.000932 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 224.86 on 79 degrees of freedom
Residual deviance: 213.44 on 78 degrees of freedom
AIC: 346.26
Number of Fisher Scoring iterations: 6
- Notice that even though an intercept was not explicitly specified in the model formula slugs~field, an estimate for the intercept appears in the output. R always includes an intercept in a model unless it is told to remove it.
- Observe that both the residual deviance and AIC have decreased from the common mean model indicating that including field type as a predictor has led to a better model. The entry labeled "null deviance" is the deviance for a model containing only an intercept and its reported value, 224.86, is the same as the residual deviance of the common mean model we fit above.
- The column labeled z-value in the table of estimated coefficients is a Wald test for testing whether that coefficient is significantly different from zero. The column labeled Pr(>|z|) is the p-value for this test.
- To understand what these coefficients mean we have to understand exactly how R has included field in the model. Because field is a nominal variable in which the character strings "Nursery" and "Rookery" are used to identify the categories, the read.table function of R converted field to a factor variable when the data were read in. We can see this by printing out the contents of the variable.
> slugs$field
[1] Nursery Nursery Nursery Nursery Nursery Nursery Nursery Nursery Nursery
[10] Nursery Nursery Nursery Nursery Nursery Nursery Nursery Nursery Nursery
[19] Nursery Nursery Nursery Nursery Nursery Nursery Nursery Nursery Nursery
[28] Nursery Nursery Nursery Nursery Nursery Nursery Nursery Nursery Nursery
[37] Nursery Nursery Nursery Nursery Rookery Rookery Rookery Rookery Rookery
[46] Rookery Rookery Rookery Rookery Rookery Rookery Rookery Rookery Rookery
[55] Rookery Rookery Rookery Rookery Rookery Rookery Rookery Rookery Rookery
[64] Rookery Rookery Rookery Rookery Rookery Rookery Rookery Rookery Rookery
[73] Rookery Rookery Rookery Rookery Rookery Rookery Rookery Rookery
Levels: Nursery Rookery
The statement at the end of the output: "Levels: Nursery Rookery" indicates that R recognizes the variable field is a categorical variable (what R calls a factor) with two levels. Internally the variable field is represented differently. To see the internal coding scheme that R has used use the contrasts function.
> contrasts(slugs$field)
Rookery
Nursery 0
Rookery 1
- R has converted a nominal variable with character values into a factor variable with numerical values. It has assigned the level Nursery to have value 0 and the level Rookery to have value 1. From the column heading we see it has chosen a name for the new factor variable using the name of the level that is coded 1, here Rookery. This is also reflected in the output from the summary function where we see listed a coefficient estimate for a regressor that R calls fieldRookery. By default R sorts the levels alphabetically and assigns the first level the value 0 and the second level the value 1. Note that this is the same coding scheme we used last time when we created the 0-1 variable we called field.dummy.
- Thus we should interpret the output from the glm function as follows. R fit the mean model
which translates into the following two models depending on field type.

- Thus the parameter β1 in the second formula distinguishes the two field types. A statistical test of whether β1 is different from zero is a test of whether the mean slug density is different in the two fields. The Wald statistic that appears in the output from the summary function above suggests that the slug mean densities do differ (p = .000932).
- To obtain estimates of the mean densities in each field we can proceed as follows.
> exp(coef(out2.glm)[1])
(Intercept)
1.275
> exp(sum(coef(out2.glm)))
[1] 2.275
- We could also have fit this model using an identity link. With an identity link it is not necessary to exponentiate the results to obtain the mean estimates.
> out2.glm.alt<-glm(slugs~field,data=slugs,family=poisson(link=identity))
> coef(out2.glm.alt)
(Intercept) fieldRookery
1.275 1.000
> # mean density for rookery slugs
> sum(coef(out2.glm.alt))
[1] 2.275
- Observe that these are the same results we obtained last time by minimizing the negative loglikelihood of the following Poisson probability model.
> poi.2<-function(data,p) {
field.dummy<-as.numeric(data$field)-1
mylambda<-p[1]+p[2]*field.dummy
negloglike<- -sum(log(dpois(data$slugs,lambda=mylambda)))
negloglike
}
nlm(function(p) poi.2(slugs,p),c(1.2,1))->out2
> out2$estimate
[1] 1.2749997 0.9999997
- We can obtain estimates of the mean densities of slugs in the two fields directly, without having to carry out any additional arithmetic, by fitting a GLIM without an intercept. This is done by including a –1 on the right side of the model formula
> out2.glm.alt2<-glm(slugs~field-1, data=slugs, family=poisson(link=identity))
> summary(out2.glm.alt2)
Call:
glm(formula = slugs ~ field - 1, family = poisson(link = identity),
data = slugs)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.1331 -1.5969 -0.9519 0.4580 4.8727
Coefficients:
Estimate Std. Error z value Pr(>|z|)
fieldNursery 1.2750 0.1785 7.141 9.24e-13 ***
fieldRookery 2.2750 0.2385 9.539 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: Inf on 80 degrees of freedom
Residual deviance: 213.44 on 78 degrees of freedom
AIC: 346.26
Number of Fisher Scoring iterations: 3
Now the estimated coefficients are the means in the two fields, as their labels suggest. Obtaining this simplification has come at a cost though. Now the statistical tests that appear in the coefficient table are tests of whether the individual estimates of mean density are significantly different from 0. Since we found slugs in both fields we already know the mean density is different from zero; a significance test of this fact is just a test of the obvious. When the model is formulated with an intercept at least one of the tests is interesting because it tests whether the mean density is the same in the two fields.
- Of course we could use the reported standard errors to construct Wald confidence intervals from which we could approximate such a test.
Log-transform normal models
- Normal models can be fit as generalized linear models using glm and family=gaussian, or as ordinary regression models using the lm function. I illustrate the procedure by fitting the common means model.
> #fit as a generalized linear model
> glm(log(slugs+1)~1,data=slugs,family=gaussian)
Call: glm(formula = log(slugs + 1) ~ 1, family = gaussian, data = slugs)
Coefficients:
(Intercept)
0.7332
Degrees of Freedom: 79 Total (i.e. Null); 79 Residual
Null Deviance: 43.43
Residual Deviance: 43.43 AIC: 182.2
> #fit as an ordinary regression model
> lm(log(slugs+1)~1,data=slugs)
Call:
lm(formula = log(slugs + 1) ~ 1, data = slugs)
Coefficients:
(Intercept)
0.7332
- The reported loglikelihood (obtained with the logLik function) and AIC are correct only for comparison with models that also use a log-transformed response. If we want to compare the log-transformed results to those from models in which the response is not transformed we need to write our own loglikelihood function just as we did last time.
Negative binomial regression
- Although the negative binomial distribution is not a member of the exponential family of distributions, it is still possible to use it as the random component of a generalized linear model. The function glm.nb of the MASS package carries this out. The MASS package is part of the standard installation of R but as with all packages, it must first be loaded into memory using the library function. A log link is the link function used by default. There is no family argument because the glm.nb function only carries out negative binomial regression.
> library(MASS)
> out3.nb<-glm.nb(slugs~1,data=slugs)
> summary(out3.nb)
Call:
glm.nb(formula = slugs ~ 1, data = slugs, init.theta = 0.715567208373664,
link = log)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.3360 -1.3360 -0.3625 0.4198 1.8176
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.5738 0.1566 3.665 0.000247 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for Negative Binomial(0.7156) family taken to be 1)
Null deviance: 82.736 on 79 degrees of freedom
Residual deviance: 82.736 on 79 degrees of freedom
AIC: 292.80
Number of Fisher Scoring iterations: 1
Theta: 0.716
Std. Err.: 0.196
2 x log-likelihood: -288.796
- The dispersion parameter (the size parameter of the dnbinom function) is displayed near the end of the output. A value of θ = 0.716 should be compared to an infinite value for the Poisson distribution. The small value of θ we obtained is further evidence of the presence of overdispersion. I next fit a separate means model.
> out4.nb<-glm.nb(slugs~field,data=slugs)
> summary(out4.nb)
Call:
glm.nb(formula = slugs ~ field, data = slugs, init.theta = 0.785931336346756,
link = log)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.4619 -1.2310 -0.5296 0.2241 2.3430
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.2429 0.2268 1.071 0.2840
fieldRookery 0.5790 0.3069 1.886 0.0592 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for Negative Binomial(0.7859) family taken to be 1)
Null deviance: 86.818 on 79 degrees of freedom
Residual deviance: 83.256 on 78 degrees of freedom
AIC: 291.35
Number of Fisher Scoring iterations: 1
Correlation of Coefficients:
(Intercept)
fieldRookery -0.74
Theta: 0.786
Std. Err.: 0.224
2 x log-likelihood: -285.350
- The AIC has decreased slightly suggesting that this is a slightly better model than the common mean model. Observe that the Wald test for the field variable is not significant, p = 0.0592, although it is close to being significant. If we were to use a strict α = .05 criterion for statistical significance we would reject this model in favor of the common means model. This would disagree with the ranking of models obtained using AIC. It is not unusual for AIC and significance testing to disagree in their conclusions for predictors whose substantive effects are not large.
Fitting species area curves to the Galapagos Islands plant richness data set
- Johnson and Raven (1973) provide data on plant species richness for 29 islands in the Galapagos Islands archipelago. Their paper was presented as a rebuttal to Hamilton et al. (1963) who claimed that in the Galapagos Island archipelago a species-area effect is absent.
- Using an expanded data base, Johnson and Raven (1973) argued that the Galapagos Islands do exhibit a species-area effect. In their paper they considered only two forms for the species-area relationship:
- a form in which species richness is linearly related to area, with normally distributed errors, and
- a form in which log-transformed species richness is linearly related to log-transformed area, with normally-distributed errors.
- Today we revisit the analysis of Johnson and Raven but consider an expanded set of possible species-area relations using statistical methodology that was not available in 1973. We consider six models. In the model descriptions that follow S denotes species richness on an island and A denotes the area of that island.
- Linear model:
with identity link such that
.
- Gleason model:
with identity link such that
.
- Arrhenius model:
with identity link such that
.
- Log-Arrhenius model:
with identity link such that
.
- Poisson GLIM:
with log link such that
.
- Negative binomial GLIM:
with log link such that
.
- The data set included in Johnson and Raven (1973) was missing some elevation data. I found a corrected version of the data set with these values supplied at
http://www.ibiblio.org/pub/academic/biology/ecology+evolution/teaching/weisberg/galapago.dat.
I've placed a cleaned up version of the data set at our class web site. The file is space-delimited with the variable names appearing in the first row.
> gala<- read.table('http://www.unc.edu/courses/2006spring/ecol/145/001/data/lab6/galapagos.txt', header=TRUE)
> names(gala)
[1] "Island" "Species" "Endemics" "Area"
[5] "Elevation" "Nearest.dist" "Santacruz.dist" "Adjacent.isl.area"
- I begin by creating a series of plots that are relevant to the various models listed above. The variables of interest are Species and Area.

Fig. 1 Scatter plots of the species-area relation on different scales: S vs. A, S vs. log(A), and log(S) vs. log(A)
- A few features of the graphs stand out.
- On the original scale (left-hand plot) it's hard to imagine any relationship adequately modeling these data. Because of the huge scale difference on the Area axis, any fitted model when plotted on this graph will resemble a stick poking out of an ink blot.
- When area is log-transformed (middle plot), a distinct curvilinear trend emerges. It is also clear that the data are heteroscedastic on this scale, with variability in species richness increasing with log(Area).
- When both area and species richness are log-transformed (right-hand plot), the relationship looks truly linear. There is also no evidence of heteroscedasticity in this plot.
Model 1: Linear Model
- I fit a linear model on the original scale, model 1 above.
> model1<-lm(Species~Area,data=gala)
> logLik(model1)
'log Lik.' -171.5861 (df=3)
> AIC(model1)
[1] 349.1721
- I superimpose the fitted model on a scatter plot of the data.
> plot(gala$Area,gala$Species, xlab='Area', ylab='Species')
> abline(model1,col=2, lty=2)
> mtext('Model 1: linear model', side=3, line=.5)
From the scatter plot alone we would conclude that this is a terrible fit.
Model 2: Gleason Model
- The Gleason model requires a log-transformed predictor, but an untransformed response.
> model2<-lm(Species~log(Area),data=gala)
> logLik(model2)
'log Lik.' -164.5773 (df=3)
> AIC(model2)
[1] 335.1547
- Since the response in models 1 and 2 is the same, we can use the reported AIC to compare the Gleason and linear models. From the AIC the Gleason model is clearly an improvement. I superimpose the fitted Gleason model on a scatter plot of the data.
> plot(log(gala$Area), gala$Species, xlab='log(Area)', ylab='Species')
> abline(model2,col=2,lty=2)
> mtext( 'Model 2: Gleason model', side=3, line=.5)
- This too is clearly a bad model. Based on the scatter plot, if we were to calculate the residuals from the model we would find that they exhibit a systematic trend with increasing log(area)—first positive, then negative, then positive. We are clearly underestimating species richness at low and high areas, and overestimating it at intermediate areas.
- Perhaps even more troubling the Gleason model predicts negative values for species richness even over the range of the data being used to fit the model. When log(Area) drops below –2 species richness is predicted to be negative. This is truly a nonsensical prediction.
Model 3: Arrhenius Model
- The Arrhenius model is a nonlinear model and nonlinear models are trickier to estimate. The nls function (an acronym for nonlinear least squares) can be used to fit such models. The nls function requires initial estimates for the parameters. I choose initial estimates based on my understanding of power law models.
> model3<-nls(Species~b0*Area^b1, data=gala, start=list(b0=1,b1=.5))
> model3
Nonlinear regression model
model: Species ~ b0 * Area^b1
data: gala
b0 b1
33.3976021 0.2985702
residual sum-of-squares: 97323.73
> logLik(model3)
'log Lik.' -158.8675 (df=2)
> #incorrect AIC
> AIC(model3)
[1] 321.735
- The loglikelihood returned from the nls object is correct, but not the AIC. The AIC function only counts the estimated parameters β0 and β1, when in a likelihood framework it should also count σ2. In the correct calculation of AIC there are three parameters to account for.
> #correct AIC
> -2*logLik(model3)+2*(length(coef(model3))+1)
[1] 323.735
- I add the estimated Arrhenius function to a scatter plot of the data.
> arrhenius.func<-function(x) coef(model3)[1]*x^(coef(model3)[2])
> plot(gala$Area,gala$Species,xlab='Area',ylab='Species')
> lines(seq(0,5000,100),arrhenius.func(seq(0,5000,100)),lty=2,col=2)
> mtext('Model 3: Arrhenius model',side=3,line=.5)
- Because of the scale of the plot, the fit is difficult to judge visually. One thing is clear though, the scatter about the fitted curve is not constant; it increases with area. This is something that is not accounted for when a normal distribution is assumed for the random component of the model. (Recall that in a normal distribution the mean and variance are independent.) Having said this, notice that the AIC we obtained is the best of the three models thus far.
Model 4: Log-Arrhenius Model
- The log-Arrhenius model is just an ordinary regression model in which both the response and predictor are log-transformed.
> model4<-lm(log(Species)~log(Area), data=gala)
- Since the response variable has been transformed, we know that the loglikelihood and AIC that are returned by the logLik and AIC functions are not correct for comparing this model with the ones we've already fit. I modify the function we wrote last time for a similar situation to obtain the correct loglikelihood.
> norm.loglike<-function(data,model)
{
t.y<-log(data$Species)
sigma2<-(sum(residuals(model)^2))/dim(data)[1]
loglike<-sum(log(dnorm(t.y, mean=predict(model), sd=sqrt(sigma2))*1/(data$Species)))
out<-list(loglike, c(coef(model), sigma2))
out
}
- For the estimate of σ2 I use the maximum likelihood estimate rather than the usual unbiased estimate. The difference is in the denominator. The MLE divides the sum of squared residuals by n, while the unbiased estimate divides by n – p, where p is the number of estimated parameters.
- To obtain the mean of the normal distribution I use the predict function. predict when applied to a model returns the value of the linear predictor for each observations. I use residuals to obtain the residuals.
> norm.loglike(gala,model4)
[[1]]
[1] -134.1195
[[2]]
(Intercept) log(Area)
2.8338361 0.4042697 0.5346083
> norm.loglike(gala,model4)->model4.vals
> #correct AIC
> -2*model4.vals[[1]]+2*length(model4.vals[[2]])
[1] 274.2390
- This is the best value thus far. I plot the fitted model on a scatter plot of the data.
> plot(log(gala$Area), log(gala$Species), xlab='log(Area)', ylab='log(Species)')
> abline(model4,col=2,lty=2)
> mtext('Model 4: log-Arrhenius model', side=3, line=.5)
- The graph suggests a very good fit also.
Model 5: Poisson model
- I carry out Poisson regression with a log link using log(Area) as the predictor.
> model5<-glm(Species~log(Area),data=gala,family=poisson)
> logLik(model5)
'log Lik.' -398.0133 (df=2)
> AIC(model5)
[1] 800.0266
- From the AIC we can see the fit is quite poor. I don't bother to plot the fitted curve.
Model 6: Negative binomial model
- I carry out negative binomial regression with a log link using log(Area) as the predictor.
> model6<-glm.nb(Species~log(Area),data=gala)
> logLik(model6)
'log Lik.' -133.9963 (df=3)
> AIC(model6)
[1] 273.9926
> model6$theta
[1] 2.634311
- This is the lowest AIC obtained thus far. I plot the results. The value of theta suggests a modest amount of overdispersion explaining why the Poisson model provided such a poor fit.
> coef(model6)
(Intercept) log(Area)
3.1495864 0.3669952
> NB.func<-function(x) exp(coef(model6)[1]+ coef(model6)[2]*x)
> plot(log(gala$Area), gala$Species, xlab='log(Area)', ylab='Species')
> lines(seq(-5,9,.1), NB.func(seq(-5,9,.1)), lty=2, col=2)
> mtext('Model 6: negative binomial',side=3,line=.5)
A formal comparison of the models
- Because our sample size is small, n = 29, we should really be comparing models using AICc, rather than AIC. In truth because all but one of our models has the same number of parameters, this adjustment will not affect the results. I adapt our function from last time for carrying out the Burnham & Anderson (2002) protocol for model comparison. I begin by collecting the names of our models.
> model.names<-c('Linear','Gleason','Arrhenius',
'Log-Arrhenius','Poisson','Neg Binom')
- Next I concatenate all the calculated loglikelihoods together.
> loglike <-c(logLik(model1), logLik(model2), logLik(model3), model4.vals[[1]], logLik(model5), logLik(model6))
- Next I concatenate a list of the number of parameters estimated for each model. For the four normal-based models there are three parameters—β0, β1, and σ2. For the Poisson there are only two, β0 and β1. For the negative binomial there are three, β0, β1, and θ.
> numparms<-c(3,3,3,3,2,3)
- Finally I modify our function from last time to carry out the calculations.
> AIC.func<-function(LL,K,n,modelnames)
{
#LL is loglikelihood,
#K is number of estimated parameters
#n is the sample size
AIC<- -2*LL + 2*K
AICc<-AIC + 2*K*(K+1)/(n-K-1)
output<-cbind(LL,K,AIC,AICc)
colnames(output)<-c('LogL','K','AIC','AICc')
minAICc<-min(output[,"AICc"])
deltai<-output[,"AICc"]-minAICc
rel.like<-exp(-deltai/2)
wi<-round(rel.like/sum(rel.like),3)
out<-data.frame(modelnames,output,deltai,wi)
out
}
- The function is far simpler than last time. Rather than a loop I've vectorized most of the operations. The results of running the AIC.func function on the models we've fit in this exercise are shown below.
> AIC.func(loglike,numparms,dim(gala)[1],model.names)
modelnames LogL K AIC AICc deltai wi
1 Linear -171.5861 3 349.1721 350.1321 75.1794767 0.000
2 Gleason -164.5773 3 335.1547 336.1147 61.1620682 0.000
3 Arrhenius -158.8675 3 323.7350 324.6950 49.7423747 0.000
4 Log-Arrhenius -134.1195 3 274.2390 275.1990 0.2464048 0.469
5 Poisson -398.0133 2 800.0266 800.4882 525.5355434 0.000
6 Neg Binom -133.9963 3 273.9926 274.9526 0.0000000 0.531
- Based on the output we see there are only two models that have any empirical support, the log-Arrhenius model and the negative binomial model using log-transformed area. None of the other models are even players.
- While the negative binomial model just edges out the log-Arrhenius model, these two models are for all intents and purposes equivalent in terms of AIC. So using AIC alone I really have no basis for choosing one over the other.
- From a practical point of view the choice between these two models is a no-brainer.
- The negative binomial model operates within the original scale of the response. It predicts species richness.
- The log-Arrhenius model operates on a log-transformed scale of the response. While there is a temptation to just back-transform the log-Arrhenius model to the original scale of the response by exponentiating both sides of the equation, such an approach is technically invalid. Our fitted model is for the mean of the log of species richness. When we back transform we do not get the mean of species richness.

Cited References
- Burnham, K. P. and D. R. Anderson. 2002. Model Selection and Multimodel Inference. Springer-Verlag: New York.
- Crawley, Michael J. 2002. Statistical Computing: An Introduction to Data Analysis Using S-Plus. Wiley: New York.
- Hamilton, Terrell H., Ira Rubinoff, Robert H. Barth, Jr., and Guy L. Bush. 1963. Species abundance: natural regulation of insular variation. Science 142:1575–1576.
- Johnson, Michael P. and Peter H. Raven. 1973. Species number and endemism: The Galapagos archipelago revisited. Science 179: 893–895.
- Venables, W. N. and B. D. Ripley. 2002. Modern Applied Statistics with S. Springer-Verlag: New York.
Course Home Page