Lecture 7 (lab 2) —Tuesday, January 24, 2006

What was covered?

R functions and commands demonstrated

R function options

Modeling the Mean-Variance Relation in Data

#read in data from Web
> corals<-read.table('http://www.unc.edu/courses/2006spring/ecol/145/001/data/lab1/corals.csv',
header=TRUE,sep=',')

Fig. 1 Scatter plot: Prevalence vs. WSSTA

> table(corals$WSSTA)

0 1 2 3 4 5 6 7 8 9 10 11 13
3 4 5 9 3 6 5 5 1 1  2  2  1

> quantile(corals$WSSTA)
0%  25% 50% 75% 100%
0.0 2.5 4.0 6.5 13.0

From the output we see that 25% of the data have a WSSTA 2.5 or below, 50% have a WSSTA value of 4.0 or below, etc. In choosing the categories we need to balance the need to have enough observations in each category to calculate a variance with the need to have enough categories to estimate a regression model for the variance. With only 47 observations the options are slim. I opt for five categories and so use R's quantile function to calculate quintiles.

> quantile(corals$WSSTA,seq(0,1,.2))
0% 20% 40% 60% 80% 100%
 0   2   3   5   7   13

> seq(0,1,.2)
[1] 0.0 0.2 0.4 0.6 0.8 1.0

> cut(corals$WSSTA,quantile(corals$WSSTA,seq(0,1,.2)))
 [1] (2,3] (5,7] <NA> (5,7] (0,2] (5,7] (0,2] (7,13] (5,7] (5,7] (3,5]
[12] (2,3] (7,13] (5,7] (5,7] (3,5] (5,7] (7,13] (7,13] (7,13] (7,13] (2,3]
[23] (2,3] (3,5] (3,5] (3,5] (3,5] (2,3] (7,13] (5,7] (3,5] (2,3] (0,2]
[34] (0,2] (3,5] (5,7] (0,2] (2,3] (3,5] (2,3] (0,2] (2,3] (0,2] <NA>
[45] (0,2] (0,2] <NA>
Levels: (0,2] (2,3] (3,5] (5,7] (7,13]

> cut(corals$WSSTA, quantile(corals$WSSTA,seq(0,1,.2)), include.lowest=TRUE)
 [1] (2,3] (5,7] [0,2] (5,7] [0,2] (5,7] [0,2] (7,13] (5,7] (5,7] (3,5]
[12] (2,3] (7,13] (5,7] (5,7] (3,5] (5,7] (7,13] (7,13] (7,13] (7,13] (2,3]
[23] (2,3] (3,5] (3,5] (3,5] (3,5] (2,3] (7,13] (5,7] (3,5] (2,3] [0,2]
[34] [0,2] (3,5] (5,7] [0,2] (2,3] (3,5] (2,3] [0,2] (2,3] [0,2] [0,2]
[45] [0,2] [0,2] [0,2]
Levels: [0,2] (2,3] (3,5] (5,7] (7,13]

Now things were done correctly. We can also see the numerical coding for the categories by applying the as.numeric function to the categories.

> as.numeric(cut(corals$WSSTA, quantile(corals$WSSTA, seq(0,1,.2)), include.lowest=TRUE))
 [1] 2 4 1 4 1 4 1 5 4 4 3 2 5 4 4 3 4 5 5 5 5 2 2 3 3 3 3 2 5 4 3 2 1 1 3 4 1 2 3 2
[41] 1 2 1 1 1 1 1

> table(cut(corals$WSSTA, quantile(corals$WSSTA, seq(0,1,.2)), include.lowest=TRUE))

[0,2] (2,3] (3,5] (5,7] (7,13]
   12    9     9     10      7

The counts are not balanced but it is probably the best that can be done with discrete data. Let's save these categories in a variable. I call it WSSTA.quints.

> WSSTA.quints<-cut(corals$WSSTA, quantile(corals$WSSTA, seq(0,1,.2)), include.lowest=TRUE)

Fig. 2 Variance vs Mean

#obtain means and variances by category
> tapply(corals$PREV_1, WSSTA.quints,mean)->means
> tapply(corals$PREV_1, WSSTA.quints,var)->vars
> plot(means, vars)

Math Symbols

> demo(plotmath)

This runs you through a series of screens in which all the possibilities involving mathematical expressions are displayed.

paste("Mean ",mu)

expression(paste("Mean ",mu))

which causes R to treat everything in the parentheses of expression as an expression rather than as separate objects. When added to a graph mu will get translated into the Greek letter μ. Note: you can't see the results of this code in the console window; it will only be translated correctly when it is added as text to a graph.

expression(paste("Variance ",sigma^2))

Using the standard English spelling of the Greek letters will correctly identify them to R so that sigma becomes σ. The ^2 creates the exponent. Notice that I've included a space at the end of the text string to ensure proper spacing between the text and the Greek letter. By default, R inserts spaces between elements being pasted together (unless you tell it otherwise with the sep argument), but I find that this does not seem to work with mathematical expressions so I specify the spacing explicitly as part of the quoted string.

> plot(means, vars, xlab=expression(paste("Mean ", mu)), ylab=expression(paste("Variance ", sigma^2)))

Customizing the Axes

Fig. 3 Graph with labels

#improve axis labels
> plot(means, vars, axes=FALSE, xlab=expression(paste("Mean ",mu)), ylab=expression(paste("Variance ",sigma^2)),xlim=c(0,150), cex=1.5)
> axis(1,cex.axis=.9)
> axis(2,cex.axis=.9)
> box()

Fitting the Models

which is the same answer we get when the exponent is omitted. R simply ignores the arithmetic we are trying to do when the I function is not used.

> lm(vars~offset(means)+means-1)

Call:
lm(formula = vars ~ offset(means) + means - 1)

Coefficients:
means
169.5

> lm(vars~means-1)

Call:
lm(formula = vars ~ means - 1)

Coefficients:
means
170.5

Observe that the estimated coefficient is now 1 more than it was in the model with an offset.

> lm((vars-means)~I(means^2)-1)

Call:
lm(formula = (vars - means) ~ I(means^2) - 1)

Coefficients:
I(means^2)
1.292

Plotting the Curves

> quad.coef<-coef(lm(vars~offset(means)+I(means^2)-1))
> quad.coef
I(means^2)
1.291726

It is always better to extract information from models directly rather than copying the numbers that appear on the screen and entering them by hand. The extracted numbers will have full precision (unlike the rounded values that are shown on the screen).

  1. A name for the function. I call mine quad.func
  2. A left pointing assignment arrow <-
  3. The keyword function followed by a pair of parentheses which contains a list of arguments separated by commas.
  4. The formula for the function.

Here's how I construct the quadratic function that corresponds to the negative binomial model. Observe that the coefficient of x2 is the regression coefficient that I previously saved.

> quad.func<-function(x) x+quad.coef*x^2
> quad.func(1)
I(means^2)
2.291726

> quad.func(1:2)
[1] 2.291726 7.166905

#NB2 model
> lines(seq(10,150,10), quad.func(seq(10,150,10)), col=1, lty=1, lwd=2)

Fig. 4 Estimated Models

> lm(vars~means-1)

Call:
lm(formula = vars ~ means - 1)

Coefficients:
means
170.5

I use the abline function to draw the regression line, specifying the intercept as zero and using the coef function to extract the slope from the model. I draw it as a red dotted line.

#add quasi-Poisson (NB1)
> abline(0, coef(lm(vars~means-1)), col=2, lty=2, lwd=2)

#add Poisson
> abline(0,1,col=3,lty=1,lwd=2)

Adding a legend

Fig. 5 Estimated Models with Legend

> legend(0,30000, c('Poisson', 'quasi-Poisson (NB1)', 'negative binomial (NB2)'), col=c(3,2,1), lwd=c(2,1,2), lty=c(1,2,1), cex=c(.8,.8,.8), bty='n')

> mtext('Mean-Variance Relationship for Disease Prevalence', side=3, line=.5, font=2, cex=.9)

Generating Data from Probability Distributions

> ?dpois

Density, distribution function, quantile function and random generation for the Poisson distribution with parameter lambda.

Usage
dpois(x, lambda, log = FALSE)
ppois(q, lambda, lower.tail = TRUE, log.p = FALSE)
qpois(p, lambda, lower.tail = TRUE, log.p = FALSE)
rpois(n, lambda)

> rpois(10,1)
[1] 0 0 0 1 1 0 2 0 1 4

Changing this to λ = 8 yields

> rpois(10,8)
[1] 7 8 10 12 8 6 9 4 10 10

If I try λ = 1 again, I get a completely different set of observations, as should be the case.

> rpois(10,1)
[1] 2 1 0 2 0 2 0 2 1 1

> set.seed(100)
> rpois(10,1)
[1] 0 0 1 0 1 1 2 1 1 0
> set.seed(100)
> rpois(10,1)
[1] 0 0 1 0 1 1 2 1 1 0

Fig. 6 Histogram of simulation data

Observe that by specifying the same seed, 100, before using the rpois function again, we get exactly the same set of "random" numbers.

> out.pois<-rpois(1000,2)

> hist(out.pois)

> table(out.pois)
out.pois
  
0   1   2   3  4  5  6 7 8
124 268 268 185 90 44 14 5 2

Fig. 7 Bar plot of simulated data

> barplot(table(out.pois))

> barplot(table(out.pois), ylim=c(0,300))
> box()

> dpois(0:8,2)
[1] 0.1353352832 0.2706705665 0.2706705665 0.1804470443 0.0902235222 0.0360894089
[7] 0.0120298030 0.0034370866 0.0008592716

> sum(dpois(0:8,2))
[1] 0.9997626

> expected.freq<-1000*dpois(0:9,2)
> expected.freq
[1] 135.3352832 270.6705665 270.6705665 180.4470443 90.2235222 36.0894089
[7]  12.0298030   3.4370866   0.8592716

> coords<-barplot(table(out.pois), ylim=c(0,300))
> coords
[,1]
[1,] 0.7
[2,] 1.9
[3,] 3.1
[4,] 4.3
[5,] 5.5
[6,] 6.7
[7,] 7.9
[8,] 9.1
[9,] 10.3

Fig. 8 Observed and expected frequencies

> points(coords, expected.freq ,pch=16, cex=1.5)
> lines(coords, expected.freq)
> legend(6, 300, 'expected frequencies', lty=1, pch=16, col=1, cex=.8, bty='n')

> mtext(expression(paste('1000 simulations from a Poisson ', (lambda==2), ' distribution')), side=3, line=.5)

Comparing Negative Binomial Distributions for Different μ and θ

> ?dnbinom

dnbinom(x, size, prob, mu, log = FALSE)

The alternative parametrization (often used in ecology) is by the mean mu, and size, the dispersion parameter, where prob = size/(size+mu). The variance is mu + mu^2/size in this parametrization or n (1-p)/p^2 in the first one.

> dnbinom(0:10,size=1,mu=1)
[1] 0.5000000000 0.2500000000 0.1250000000 0.0625000000 0.0312500000 0.0156250000
[7] 0.0078125000 0.0039062500 0.0019531250 0.0009765625 0.0004882812

> dnbinom(0:10,size=1,mu=5)
[1] 0.16666667 0.13888889 0.11574074 0.09645062 0.08037551 0.06697960 0.05581633
[8] 0.04651361 0.03876134 0.03230112 0.02691760

> plot(0:10, dnbinom(0:10, size=.1, mu=1), type='n', xlab='k', ylab='P(X=k)')

type option
Plot style
type="p"
Points (the default)
type="l"
Lines connecting the data
type="b"
Both (points and lines between points)
type="h"
Height bars (vertical lines connecting point to x-axis)
type="o"
Overlaid points and connecting lines
type="s"
Stairsteps
type="n"
Nothing

Since I specify type='n' I'm choosing to plot nothing. In doing so I'm just using the plot command to set up the limits on the x- and y-axes, draw the axes, and display the labels.

> points(0:10,dnbinom(0:10,size=.1,mu=1),pch=16)
> lines(0:10,dnbinom(0:10,size=.1,mu=1))
> points(0:10,dnbinom(0:10,size=10,mu=1),pch=16,col=2)
> lines(0:10,dnbinom(0:10,size=10,mu=1),col=2)
> points(0:10,dnbinom(0:10,size=.1,mu=5),pch=21)
> lines(0:10,dnbinom(0:10,size=.1,mu=5),lty=2)
> points(0:10,dnbinom(0:10,size=10,mu=5),pch=21,col=2)
> lines(0:10,dnbinom(0:10,size=10,mu=5),col=2,lty=2)

Fig. 9 Negative binomial distributions

> legend(6,.8, c(expression(paste(mu==1, ', ', theta==.1)),
expression(paste(mu==1,', ', theta==10)),
expression(paste(mu==5,', ', theta==.1)),
expression(paste(mu==5,', ', theta==10))), col=c(1,2,1,2),
pch=c(16,16,21,21), lty=c(1,1,2,2), cex=rep(.9,4), bty='n')

> rep(.9,4)
[1] 0.9 0.9 0.9 0.9

> mtext(expression(paste('Negative binomial distributions for varying ',mu,' and ',theta)), side=3, line=.5)

Course Home Page


Jack Weiss
Phone: (919) 962-5930
E-Mail: jack_weiss@unc.edu
Address: Curriculum in Ecology, Box 3275, University of North Carolina, Chapel Hill, 27516
Copyright © 2006
Last Revised--Jan 26, 2006
URL: http://www.unc.edu/courses/2006spring/ecol/145/001/docs/lectures/lecture7.htm