Lecture 23 (lab 6) —February 21, 2006

What was covered?

R functions and commands demonstrated

R function options

R packages used

Using GLIMs to model the Crawley slug data set

> #obtain data
> slugs<-read.table( 'http://www.bio.ic.ac.uk/research/mjcraw/statcomp/data/slugsurvey.txt', header=TRUE)
> names(slugs)
[1] "slugs" "field"

Poisson models

The "common mean Poisson model" as a GLIM.

> out1.glm<-glm(slugs~1, data=slugs, family=poisson)

> summary(out1.glm)

Call:
glm(formula = slugs ~ 1, family = poisson, data = slugs)

Deviance Residuals:
    Min      1Q  Median     3Q    Max
-1.8841 -1.8841 -0.6343 0.8360 4.2574

Coefficients:
           Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.57380    0.08391   6.838 8.03e-12 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for poisson family taken to be 1)

Null deviance:     224.86 on 79 degrees of freedom
Residual deviance: 224.86 on 79 degrees of freedom
AIC: 355.68

Number of Fisher Scoring iterations: 5

In order to obtain an estimate of the mean slug density we need to exponentiate the result. We can extract the coefficient estimates from the GLIM with the coef function.

> coef(out1.glm)
(Intercept)
0.5738004

> exp(coef(out1.glm))
(Intercept)
      1.775

> poi.1<-function(data,p) -sum(log(dpois(data$slugs,lambda=p)))
> nlm(function(p) poi.1(slugs,p),2)->out1
> out1$estimate

[1] 1.774999

> out1.glm.id<-glm(slugs~1,data=slugs,family=poisson(link=identity))
> coef(out1.glm.id)

(Intercept)
      1.775

names(out1.glm)
 [1]      "coefficients" "residuals" "fitted.values"       "effects"
 [5]                 "R"      "rank"            "qr"        "family"
 [9] "linear.predictors"  "deviance"           "aic" "null.deviance"
[13]              "iter"   "weights" "prior.weights"   "df.residual"
[17]           "df.null"         "y"     "converged"      "boundary"
[21]             "model"      "call"       "formula"         "terms"
[25]              "data"    "offset"       "control"        "method"
[29]         "contrasts"   "xlevels"

> out1.glm$deviance
[1] 224.8594
> out1.glm$df.residual
[1] 79
> out1.glm$aic
[1] 355.6766

These are also reported as part of the output from the summary function.

> out1.glm$deviance/out1.glm$df.residual
[1] 2.846321

> logLik(out1.glm)
'log Lik.' -176.8383 (df=1)
> AIC(out1.glm)
[1] 355.6766

The degrees of freedom reported for the loglikelihood refers to the number of parameters that were estimated in fitting the model. The numbers reported here match the values we obtained last time by minimizing the negative loglikelihood directly.

> out1$minimum
[1] 176.8383
> my.aic(out1)
[1] 355.6766

The "separate means Poisson model" as a GLIM.

> out2.glm<-glm(slugs~field,data=slugs,family=poisson)
> summary(out2.glm)

Call:
glm(formula = slugs ~ field, family = poisson, data = slugs)

Deviance Residuals:
    Min      1Q  Median     3Q    Max
-2.1331 -1.5969 -0.9519 0.4580 4.8727

Coefficients:
             Estimate Std. Error z value Pr(>|z|)
 (Intercept)   0.2429     0.1400  1.735 0.082744 .
fieldRookery   0.5790     0.1749  3.310 0.000932 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for poisson family taken to be 1)

Null deviance:     224.86 on 79 degrees of freedom
Residual deviance: 213.44 on 78 degrees of freedom
AIC: 346.26

Number of Fisher Scoring iterations: 6

> slugs$field
 [1] Nursery Nursery Nursery Nursery Nursery Nursery Nursery Nursery Nursery
[10] Nursery Nursery Nursery Nursery Nursery Nursery Nursery Nursery Nursery
[19] Nursery Nursery Nursery Nursery Nursery Nursery Nursery Nursery Nursery
[28] Nursery Nursery Nursery Nursery Nursery Nursery Nursery Nursery Nursery
[37] Nursery Nursery Nursery Nursery Rookery Rookery Rookery Rookery Rookery
[46] Rookery Rookery Rookery Rookery Rookery Rookery Rookery Rookery Rookery
[55] Rookery Rookery Rookery Rookery Rookery Rookery Rookery Rookery Rookery
[64] Rookery Rookery Rookery Rookery Rookery Rookery Rookery Rookery Rookery
[73] Rookery Rookery Rookery Rookery Rookery Rookery Rookery Rookery
Levels: Nursery Rookery

The statement at the end of the output: "Levels: Nursery Rookery" indicates that R recognizes the variable field is a categorical variable (what R calls a factor) with two levels. Internally the variable field is represented differently. To see the internal coding scheme that R has used use the contrasts function.

> contrasts(slugs$field)
        Rookery
Nursery       0
Rookery       1

> exp(coef(out2.glm)[1])
(Intercept)
      1.275

> exp(sum(coef(out2.glm)))
[1] 2.275

> out2.glm.alt<-glm(slugs~field,data=slugs,family=poisson(link=identity))
> coef(out2.glm.alt)
(Intercept) fieldRookery
      1.275        1.000

> # mean density for rookery slugs
> sum(coef(out2.glm.alt))
[1] 2.275

> poi.2<-function(data,p) {
field.dummy<-as.numeric(data$field)-1
mylambda<-p[1]+p[2]*field.dummy
negloglike<- -sum(log(dpois(data$slugs,lambda=mylambda)))
negloglike
}

nlm(function(p) poi.2(slugs,p),c(1.2,1))->out2

> out2$estimate
[1] 1.2749997 0.9999997

> out2.glm.alt2<-glm(slugs~field-1, data=slugs, family=poisson(link=identity))
> summary(out2.glm.alt2)

Call:
glm(formula = slugs ~ field - 1, family = poisson(link = identity),
data = slugs)

Deviance Residuals:
    Min      1Q  Median     3Q    Max
-2.1331 -1.5969 -0.9519 0.4580 4.8727

Coefficients:
             Estimate Std. Error z value Pr(>|z|)
fieldNursery   1.2750     0.1785   7.141 9.24e-13 ***
fieldRookery   2.2750     0.2385   9.539 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for poisson family taken to be 1)

Null deviance:        Inf on 80 degrees of freedom
Residual deviance: 213.44 on 78 degrees of freedom
AIC: 346.26

Number of Fisher Scoring iterations: 3

Now the estimated coefficients are the means in the two fields, as their labels suggest. Obtaining this simplification has come at a cost though. Now the statistical tests that appear in the coefficient table are tests of whether the individual estimates of mean density are significantly different from 0. Since we found slugs in both fields we already know the mean density is different from zero; a significance test of this fact is just a test of the obvious. When the model is formulated with an intercept at least one of the tests is interesting because it tests whether the mean density is the same in the two fields.

Log-transform normal models

> #fit as a generalized linear model
> glm(log(slugs+1)~1,data=slugs,family=gaussian)

Call: glm(formula = log(slugs + 1) ~ 1, family = gaussian, data = slugs)

Coefficients:
(Intercept)
     0.7332

Degrees of Freedom: 79 Total (i.e. Null); 79 Residual
Null Deviance:     43.43
Residual Deviance: 43.43 AIC: 182.2


> #fit as an ordinary regression model
> lm(log(slugs+1)~1,data=slugs)

Call:
lm(formula = log(slugs + 1) ~ 1, data = slugs)

Coefficients:
(Intercept)
     0.7332

Negative binomial regression

> library(MASS)
> out3.nb<-glm.nb(slugs~1,data=slugs)
> summary(out3.nb)

Call:
glm.nb(formula = slugs ~ 1, data = slugs, init.theta = 0.715567208373664,
link = log)

Deviance Residuals:
    Min      1Q  Median     3Q    Max
-1.3360 -1.3360 -0.3625 0.4198 1.8176

Coefficients:
            Estimate Std. Error z value Pr(>|z|)
(Intercept)   0.5738     0.1566   3.665 0.000247 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for Negative Binomial(0.7156) family taken to be 1)

Null deviance:     82.736 on 79 degrees of freedom
Residual deviance: 82.736 on 79 degrees of freedom
AIC: 292.80

Number of Fisher Scoring iterations: 1

Theta: 0.716
Std. Err.: 0.196

2 x log-likelihood: -288.796

> out4.nb<-glm.nb(slugs~field,data=slugs)
> summary(out4.nb)

Call:
glm.nb(formula = slugs ~ field, data = slugs, init.theta = 0.785931336346756,
link = log)

Deviance Residuals:
    Min      1Q  Median     3Q    Max
-1.4619 -1.2310 -0.5296 0.2241 2.3430

Coefficients:
             Estimate Std. Error z value Pr(>|z|)
(Intercept)    0.2429     0.2268   1.071 0.2840
fieldRookery   0.5790     0.3069   1.886 0.0592 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for Negative Binomial(0.7859) family taken to be 1)

Null deviance:     86.818 on 79 degrees of freedom
Residual deviance: 83.256 on 78 degrees of freedom
AIC: 291.35

Number of Fisher Scoring iterations: 1

Correlation of Coefficients:
(Intercept)
fieldRookery -0.74

Theta: 0.786
Std. Err.: 0.224

2 x log-likelihood: -285.350

Fitting species area curves to the Galapagos Islands plant richness data set

  1. a form in which species richness is linearly related to area, with normally distributed errors, and
  2. a form in which log-transformed species richness is linearly related to log-transformed area, with normally-distributed errors.
  1. Linear model: with identity link such that .
  2. Gleason model: with identity link such that .
  3. Arrhenius model: with identity link such that .
  4. Log-Arrhenius model: with identity link such that .
  5. Poisson GLIM: with log link such that .
  6. Negative binomial GLIM: with log link such that .

http://www.ibiblio.org/pub/academic/biology/ecology+evolution/teaching/weisberg/galapago.dat.

I've placed a cleaned up version of the data set at our class web site. The file is space-delimited with the variable names appearing in the first row.

> gala<- read.table('http://www.unc.edu/courses/2006spring/ecol/145/001/data/lab6/galapagos.txt', header=TRUE)
> names(gala)
[1]    "Island"      "Species"       "Endemics"              "Area"
[5] "Elevation" "Nearest.dist" "Santacruz.dist" "Adjacent.isl.area"

     

Fig. 1  Scatter plots of the species-area relation on different scales: S vs. A, S vs. log(A), and log(S) vs. log(A)

Model 1: Linear Model

> model1<-lm(Species~Area,data=gala)
> logLik(model1)
'log Lik.' -171.5861 (df=3)
> AIC(model1)
[1] 349.1721

> plot(gala$Area,gala$Species, xlab='Area', ylab='Species')
> abline(model1,col=2, lty=2)
> mtext('Model 1: linear model', side=3, line=.5)

From the scatter plot alone we would conclude that this is a terrible fit.

Model 2: Gleason Model

Fig. 3  Model 2

> model2<-lm(Species~log(Area),data=gala)
> logLik(model2)
'log Lik.' -164.5773 (df=3)
> AIC(model2)
[1] 335.1547

> plot(log(gala$Area), gala$Species, xlab='log(Area)', ylab='Species')
> abline(model2,col=2,lty=2)
> mtext( 'Model 2: Gleason model', side=3, line=.5)

Model 3: Arrhenius Model

Fig. 4  Model 3

> model3<-nls(Species~b0*Area^b1, data=gala, start=list(b0=1,b1=.5))
> model3
Nonlinear regression model
model: Species ~ b0 * Area^b1
data: gala
        b0        b1
33.3976021 0.2985702
residual sum-of-squares: 97323.73

> logLik(model3)
'log Lik.' -158.8675 (df=2)
> #incorrect AIC
> AIC(model3)
[1] 321.735

> #correct AIC
> -2*logLik(model3)+2*(length(coef(model3))+1)
[1] 323.735

> arrhenius.func<-function(x) coef(model3)[1]*x^(coef(model3)[2])
> plot(gala$Area,gala$Species,xlab='Area',ylab='Species')
> lines(seq(0,5000,100),arrhenius.func(seq(0,5000,100)),lty=2,col=2)
> mtext('Model 3: Arrhenius model',side=3,line=.5)

Model 4: Log-Arrhenius Model

Fig. 5  Model 4

> model4<-lm(log(Species)~log(Area), data=gala)

> norm.loglike<-function(data,model)
{
t.y<-log(data$Species)
sigma2<-(sum(residuals(model)^2))/dim(data)[1]
loglike<-sum(log(dnorm(t.y, mean=predict(model), sd=sqrt(sigma2))*1/(data$Species)))
out<-list(loglike, c(coef(model), sigma2))
out
}

> norm.loglike(gala,model4)
[[1]]
[1] -134.1195

[[2]]
(Intercept) log(Area)
2.8338361 0.4042697 0.5346083

> norm.loglike(gala,model4)->model4.vals
> #correct AIC
> -2*model4.vals[[1]]+2*length(model4.vals[[2]])
[1] 274.2390

> plot(log(gala$Area), log(gala$Species), xlab='log(Area)', ylab='log(Species)')
> abline(model4,col=2,lty=2)
> mtext('Model 4: log-Arrhenius model', side=3, line=.5)

Model 5: Poisson model

> model5<-glm(Species~log(Area),data=gala,family=poisson)
> logLik(model5)
'log Lik.' -398.0133 (df=2)
> AIC(model5)
[1] 800.0266

Fig. 6  Model 6

Model 6: Negative binomial model

> model6<-glm.nb(Species~log(Area),data=gala)
> logLik(model6)
'log Lik.' -133.9963 (df=3)
> AIC(model6)
[1] 273.9926
> model6$theta
[1] 2.634311

> coef(model6)
(Intercept) log(Area)
3.1495864 0.3669952

> NB.func<-function(x) exp(coef(model6)[1]+ coef(model6)[2]*x)
> plot(log(gala$Area), gala$Species, xlab='log(Area)', ylab='Species')
> lines(seq(-5,9,.1), NB.func(seq(-5,9,.1)), lty=2, col=2)
> mtext('Model 6: negative binomial',side=3,line=.5)

A formal comparison of the models

> model.names<-c('Linear','Gleason','Arrhenius',
'Log-Arrhenius','Poisson','Neg Binom')

> loglike <-c(logLik(model1), logLik(model2), logLik(model3), model4.vals[[1]], logLik(model5), logLik(model6))

> numparms<-c(3,3,3,3,2,3)

> AIC.func<-function(LL,K,n,modelnames)
{
#LL is loglikelihood,
#K is number of estimated parameters
#n is the sample size
AIC<- -2*LL + 2*K
AICc<-AIC + 2*K*(K+1)/(n-K-1)
output<-cbind(LL,K,AIC,AICc)
colnames(output)<-c('LogL','K','AIC','AICc')
minAICc<-min(output[,"AICc"])
deltai<-output[,"AICc"]-minAICc
rel.like<-exp(-deltai/2)
wi<-round(rel.like/sum(rel.like),3)
out<-data.frame(modelnames,output,deltai,wi)
out
}

> AIC.func(loglike,numparms,dim(gala)[1],model.names)
     modelnames      LogL K      AIC     AICc      deltai    wi
1        Linear -171.5861 3 349.1721 350.1321  75.1794767 0.000
2       Gleason -164.5773 3 335.1547 336.1147  61.1620682 0.000
3     Arrhenius -158.8675 3 323.7350 324.6950  49.7423747 0.000
4 Log-Arrhenius -134.1195 3 274.2390 275.1990   0.2464048 0.469
5       Poisson -398.0133 2 800.0266 800.4882 525.5355434 0.000
6     Neg Binom -133.9963 3 273.9926 274.9526   0.0000000 0.531

Cited References

Course Home Page


Jack Weiss
Phone: (919) 962-5930
E-Mail: jack_weiss@unc.edu
Address: Curriculum in Ecology, Box 3275, University of North Carolina, Chapel Hill, 27516
Copyright © 2006
Last Revised--March 13, 2006
URL: http://www.unc.edu/courses/2006spring/ecol/145/001/docs/lectures/lecture23.htm