Lecture 7 (lab 2) —Tuesday, January 24, 2006
What was covered?
- R topics
- turning a continuous variable into a categorical one
- writing functions
- adding legends to graphs
- modifying axes on graphs
- incorporating mathematical expressions and Greek letters in graphics text
- bar plots
- probability functions
- modeling mean-variance relations in data
- generating data from probability distributions
- the influence of θ and μ on the negative binomial probability mass function
R functions and commands demonstrated
- axis graphics function that is used to modify features of graphics axes
- barplot produces a bar plot of frequency data
- box places a box around a graph
- coef extracts the coefficients from a regression model object
- cut defines categories for a continuous variable using user-specified cut points. By default categories are half-open intervals (open on left) and the first cut point (which defines the minimum value) is not part of the first interval
- demo is used to run various R demonstration scripts. demo(plotmath) illustrates samples and syntax of mathematical expressions
- dnbinom probability mass function for a negative binomial distribution
- dpois probability mass function for a Poisson distribution
- expression creates an object of type expression. Useful for incorporating mathematical expressions and Greek letters as part of text labels
- function keyword in defining a function
- hist draws a histogram
- I is the identity function. Used in formulas to force R to perform arithmetic on a variable before fitting the model
- legend adds a legend to the currently active graph
- offset is the offset function. It is used in model formula to prevent the coefficient of a term from being estimated, instead constraining the term to have coefficient 1
- paste used to concatenate disparate pieces of text, numeric values of variables, and mathematical expressions into a single string of text
- quantile calculates quantiles of a variable at user-specified values. By default, it returns quartiles, min, and max.
- rep generates a vector all of whose elements are the same
- rpois generates random numbers from a Poisson distribution
- set.seed can be used to specify the initial seed for random number generation. Useful if you wish to be able to generate the same "random" stream again
- seq used to generate a sequence of numbers starting from and ending at specified values and increasing (or decreasing) by specified increments
- tapply very useful function for calculating a statistic (mean, variance, etc.) for a variable separately for the groups specified by a grouping variable
- var calculates the variance of its argument
- -> the reverse assignment operator in R, a dash followed by a greater than symbol, that is supposed to symbolize an arrow. The arrow points in the direction of assignment.
R function options
- axes= (argument to plot) takes on values TRUE or FALSE. Specifying axes=FALSE prevents printing of axes and tick marks (but not axis labels) and the graphics box. Customized axes can be then added using the axis function
- bty= (argument to legend) specifies the type of box to be drawn around the legend. bty='n' suppresses the box
- cex= (argument to many graphics functions) specifies the character expansion ratio. Default value is 1. It controls size of plotting symbols in the functions plot and points. When used as an option in legend it controls the size of everything
- cex.axis= (argument to axis) adjusts size of labels of tick marks. Default value is 1
- include.lowest= (argument to cut) takes on values TRUE or FALSE. Alters first (or last, if right=FALSE) interval created by the cut function so that the interval is closed (rather than half-open)
- legend= (argument to legend) the text to appear in the legend
- mu= (argument to dnbinom) the μ parameter of the negative binomial distribution
- size= (argument to dnbinom) the θ parameter of the negative binomial distribution
- type= (argument to plot) controls what is plotted. type='p' are points, type='l' are lines, type='b' are both lines and points, type='o' overlays points on lines, type='n' plots nothing and just sets up the axes
- x= (argument to legend) the x-coordinate of the left edge of the legend
- xlim= (argument of plot) specifies the range on the x-axis for plotting, e.g., xlim=c(0,100)
- y= (argument to legend) the y-coordinate of the top edge of the legend
Modeling the Mean-Variance Relation in Data
- Read in the corals data set we used last time.
#read in data from Web
> corals<-read.table('http://www.unc.edu/courses/2006spring/ecol/145/001/data/lab1/corals.csv',
header=TRUE,sep=',')
- As before our interest is in modeling the relationship between disease prevalence (PREV_1) and the temperature metric developed by the researchers, WSSTA. As we noted last time, no matter how we model this relationship it's clear that the variability of prevalence about this regression line will not be constant. This is apparent from the triangular-shaped scatter of points in Fig. 1 with variability increasing from left to right. Thus the variability of prevalence at each level of WSSTA changes depending on the value of WSSTA. This is called heteroscedasticity (in contrast to homoscedasticity where the variance is constant).
- Of the discrete probability distributions we've studied we now know that each implicitly is also a model for heteroscedasticity when framed in terms of the mean of the response variable. In the Poisson distribution the mean and variance of the response are equal. In the negative binomial distribution (NB2) the variance of the response is related to the square of its mean.
- To evaluate these models of heteroscedasticity using actual data we need to be able to group the response in some way so that the mean and variance can be calculated. In simple linear regression the mean of the response is described as a function of the predictor. Thus if the regression model of interest for the corals data set is a reasonable one, we should be able to use the levels of WSSTA to categorize the response. The only requirement is that we have enough observations in each category to obtain a stable estimate of the variance. If we look at the ungrouped levels of WSSTA we see that this is not the case.
> table(corals$WSSTA)
0 1 2 3 4 5 6 7 8 9 10 11 13
3 4 5 9 3 6 5 5 1 1 2 2 1
- The categories that have only 1, 2, or 3 observations will give very poor estimates of the variance in that category. (A category with only one observation has no variance at all!)
- So we need to group these observations. We can do this haphazardly or systematically. A systematic approach that should yield roughly equal numbers of observations in each category is to group by quantiles. Quantiles are observational values that cut off a desired percentage of the data set. By default the quantile function in R returns quartiles.
> quantile(corals$WSSTA)
0% 25% 50% 75% 100%
0.0 2.5 4.0 6.5 13.0
From the output we see that 25% of the data have a WSSTA 2.5 or below, 50% have a WSSTA value of 4.0 or below, etc. In choosing the categories we need to balance the need to have enough observations in each category to calculate a variance with the need to have enough categories to estimate a regression model for the variance. With only 47 observations the options are slim. I opt for five categories and so use R's quantile function to calculate quintiles.
> quantile(corals$WSSTA,seq(0,1,.2))
0% 20% 40% 60% 80% 100%
0 2 3 5 7 13
- I use the seq function to generate the values at which to compute the quantiles. seq generates a vector of numbers starting at a specified minimum (first argument), stopping at a specified maximum (second argument), and moving in increments specified by the third argument. Thus seq(0,1,.2) starts at 0, ends at 1, and moves in increments of 0.2
> seq(0,1,.2)
[1] 0.0 0.2 0.4 0.6 0.8 1.0
- An equivalent way of generating quintiles would have been the following.
> quantile(corals$WSSTA,c(0,.2,.4,.6,.8,1))
0% 20% 40% 60% 80% 100%
0 2 3 5 7 13
- Next I use the quintiles to create the categories of WSSTA . The cut function of R can be used for this purpose. It's first argument is the variable to be cut and its second argument is a list of the places at which the cuts should be made. R will create intervals based on this list, so the first value in the list should be the left hand endpoint of the first interval and the last value of the list should be the right hand endpoint of the last interval. In the function call below I use the quintiles as the cut points.
> cut(corals$WSSTA,quantile(corals$WSSTA,seq(0,1,.2)))
[1] (2,3] (5,7] <NA> (5,7] (0,2] (5,7] (0,2] (7,13] (5,7] (5,7] (3,5]
[12] (2,3] (7,13] (5,7] (5,7] (3,5] (5,7] (7,13] (7,13] (7,13] (7,13] (2,3]
[23] (2,3] (3,5] (3,5] (3,5] (3,5] (2,3] (7,13] (5,7] (3,5] (2,3] (0,2]
[34] (0,2] (3,5] (5,7] (0,2] (2,3] (3,5] (2,3] (0,2] (2,3] (0,2] <NA>
[45] (0,2] (0,2] <NA>
Levels: (0,2] (2,3] (3,5] (5,7] (7,13]
- From the output we see that something has gone wrong. NA is R's missing code so some of the observations were not classified as belonging to any interval. The category labels tell you how the categories were formed. Each category is a half open interval, open on the left. The first interval is (0,2], which excludes zero, but from the output of the table function above we know that there are three observations for which WSSTA = 0. These correspond to the missing categories above. To fixe this we need to add an additional argument to the cut function, include.lowest=TRUE, which instructs R to make the first interval a closed one.
> cut(corals$WSSTA, quantile(corals$WSSTA,seq(0,1,.2)), include.lowest=TRUE)
[1] (2,3] (5,7] [0,2] (5,7] [0,2] (5,7] [0,2] (7,13] (5,7] (5,7] (3,5]
[12] (2,3] (7,13] (5,7] (5,7] (3,5] (5,7] (7,13] (7,13] (7,13] (7,13] (2,3]
[23] (2,3] (3,5] (3,5] (3,5] (3,5] (2,3] (7,13] (5,7] (3,5] (2,3] [0,2]
[34] [0,2] (3,5] (5,7] [0,2] (2,3] (3,5] (2,3] [0,2] (2,3] [0,2] [0,2]
[45] [0,2] [0,2] [0,2]
Levels: [0,2] (2,3] (3,5] (5,7] (7,13]
Now things were done correctly. We can also see the numerical coding for the categories by applying the as.numeric function to the categories.
> as.numeric(cut(corals$WSSTA, quantile(corals$WSSTA, seq(0,1,.2)), include.lowest=TRUE))
[1] 2 4 1 4 1 4 1 5 4 4 3 2 5 4 4 3 4 5 5 5 5 2 2 3 3 3 3 2 5 4 3 2 1 1 3 4 1 2 3 2
[41] 1 2 1 1 1 1 1
- To see how this worked, I apply the table function to the categories.
> table(cut(corals$WSSTA, quantile(corals$WSSTA, seq(0,1,.2)), include.lowest=TRUE))
[0,2] (2,3] (3,5] (5,7] (7,13]
12 9 9 10 7
The counts are not balanced but it is probably the best that can be done with discrete data. Let's save these categories in a variable. I call it WSSTA.quints.
> WSSTA.quints<-cut(corals$WSSTA, quantile(corals$WSSTA, seq(0,1,.2)), include.lowest=TRUE)
- The next step is to find the mean and variance of PREV_1 in each of these categories. A function ideally suited to this task is R's tapply function. tapply stands for "table apply" and it calculates statistics of one variable (first argument) separately for categories defined by a second variable (second argument). The third argument of tapply is the function, which can be any R function that returns a single number, or it can be a function of your own that you construct on the fly. Below I calculate the mean and variance of PREV_1 for each of the categories defined by WSSTA.quints. The variance function in R is var. I then plot the results (Fig. 2).
#obtain means and variances by category
> tapply(corals$PREV_1, WSSTA.quints,mean)->means
> tapply(corals$PREV_1, WSSTA.quints,var)->vars
> plot(means, vars)
- Observe that I chose to place the assignment operator at the end of the expression rather than at the beginning. When placed at the end of an expression the assignment arrow should point from left to right (-> , a dash followed by a greater than symbol) rather than from right to left.
Math Symbols
- Before fitting some models, let's spruce up the labels. R supports mathematical typesetting making it possible to include subscripts, superscripts, Greek letters, etc. in the text labels of graphs. To see what's available type the following.
> demo(plotmath)
This runs you through a series of screens in which all the possibilities involving mathematical expressions are displayed.
- On the x-axis I want the label to be "Mean μ" and on the y-axis I want it to be "Variance σ2". This requires combining text with mathematical symbols. R's paste function is used to combine objects of different types into a single text string. So to get, Mean μ, I might try the following.
paste("Mean ",mu)
- This would combine the string "Mean " with mu. As written, R would look in the current workspace for an object with the name mu. No such object exists so we would get an error message. What's missing is that we need to sandwich this function inside the expression function, as follows,
expression(paste("Mean ",mu))
which causes R to treat everything in the parentheses of expression as an expression rather than as separate objects. When added to a graph mu will get translated into the Greek letter μ. Note: you can't see the results of this code in the console window; it will only be translated correctly when it is added as text to a graph.
- To get the label I want on the y-axis a similar construction will work.
expression(paste("Variance ",sigma^2))
Using the standard English spelling of the Greek letters will correctly identify them to R so that sigma becomes σ. The ^2 creates the exponent. Notice that I've included a space at the end of the text string to ensure proper spacing between the text and the Greek letter. By default, R inserts spaces between elements being pasted together (unless you tell it otherwise with the sep argument), but I find that this does not seem to work with mathematical expressions so I specify the spacing explicitly as part of the quoted string.
- To implement all this I need to reissue a plot command with the above text inserted as the values of xlab and ylab.
> plot(means, vars, xlab=expression(paste("Mean ", mu)), ylab=expression(paste("Variance ", sigma^2)))
Customizing the Axes
- Before producing the plot I want to do one more thing. Observe that in Fig. 2 not all the tick marks on the y-axis have labels. This is because the labeling text is so long that it causes text at neighboring tick marks to interfere with each other. R "solves" the problem by not printing it. To get all the tick mark labels to appear we need to customize the axes. This is done through the use of the axis function. The process proceeds in four steps.
- Issue a plot command but include as one of the arguments axes=FALSE. This prevents the default x-axis and y-axis from being drawn. It also suppresses the default box that is drawn around the plotting area. Points are still plotted and labels still appear, but no axes are drawn.
- Specify axis(1) to draw the x-axis (1 denotes the x-axis). Place a comma after the 1 and follow it by whatever options you want to customize the x-axis.
- Specify axis(2) to draw the y-axis (2 denotes the y-axis). Place a comma after the 2 and follow it by whatever options you want to customize the y-axis.
- Specify box() to draw a box around the graph.
- The customization I choose to perform is to reduce the size of tick mark labels so that all of them will appear. The argument for this is cex.axis and I specify cex.axis=.9 to obtain labels that are 90% of normal size.
- In addition I modify the limits on the x-axis so that 0 is included. This uses the xlim option of plot. I specify xlim=c(0,150) as an argument to plot.
- I also specify cex=1.5 in the plot statement to make the plotting symbols 50% larger than normal.
- Below are all the commands that carry out the modifications described above. Fig. 3 shows the graph that results.
#improve axis labels
> plot(means, vars, axes=FALSE, xlab=expression(paste("Mean ",mu)), ylab=expression(paste("Variance ",sigma^2)),xlim=c(0,150), cex=1.5)
> axis(1,cex.axis=.9)
> axis(2,cex.axis=.9)
> box()
Fitting the Models
- From lecture we discovered that in a negative binomial probability model the variance is related to the square of the mean by the formula below.

- As a regression model this corresponds to
where we have the constraints that
and
. Thus we need to fit a model that has no intercept and in which the coefficient of x (corresponding to the coefficient of μ) is constrained to be 1. The only coefficient to be estimated is the coefficient of the quadratic term (
whose coefficient is the unspecified value
).
- To remove the intercept we specify –1 in the model formula.
- To constrain the coefficient of μ to be 1 we use the offset function. offset forces a predictor to be added to a model with a coefficient that is not estimated but is constrained to be 1.
- To include a quadratic term we can either create it before fitting the model or generate it on the fly in the model formula using the identity function I. I illustrate both methods below.
#Method 1: create a squared mean variable separately
> mean2<-means^2
> lm(vars~offset(means)+mean2-1)
Call:
lm(formula = vars ~ offset(means) + mean2 - 1)
Coefficients:
mean2
1.292
#Method 2: create the squared mean in the model formula
> lm(vars~offset(means)+I(means^2)-1)
Call:
lm(formula = vars ~ offset(means) + I(means^2) - 1)
Coefficients:
I(means^2)
1.292
- Observe that in Method 2 the I construction is absolutely essential. Observe what happens if it is omitted.
> lm(vars~offset(means)+means^2-1)
Call:
lm(formula = vars ~ offset(means) + means^2 - 1)
Coefficients:
means
169.5
which is the same answer we get when the exponent is omitted. R simply ignores the arithmetic we are trying to do when the I function is not used.
> lm(vars~offset(means)+means-1)
Call:
lm(formula = vars ~ offset(means) + means - 1)
Coefficients:
means
169.5
- In this last construction we can see the effect of the offset by refitting the model without it.
> lm(vars~means-1)
Call:
lm(formula = vars ~ means - 1)
Coefficients:
means
170.5
Observe that the estimated coefficient is now 1 more than it was in the model with an offset.
- It is also possible to fit the model without using the offset function at all. To constrain the coefficient of x to be 1 in the model
just subtract x from both sides of the equation and thus fit the model
. Fitting a model with a constrained coefficient is equivalent to fitting a model with a new response, one in which the variable with the constrained coefficient is subtracted from the original response.
> lm((vars-means)~I(means^2)-1)
Call:
lm(formula = (vars - means) ~ I(means^2) - 1)
Coefficients:
I(means^2)
1.292
Plotting the Curves
- To plot the estimated regression model the abline function we've used before will not work here because the model we need to plot is not a line. Instead we have to construct our own function, evaluate it at a sequence of x-values, and then plot it using the lines function.
- I begin by fitting the quadratic model and extracting the estimated coefficient using the coef function. The coef function when applied to a regression model object returns the vector of estimated coefficients. I save the coefficient in the variable quad.coef.
> quad.coef<-coef(lm(vars~offset(means)+I(means^2)-1))
> quad.coef
I(means^2)
1.291726
It is always better to extract information from models directly rather than copying the numbers that appear on the screen and entering them by hand. The extracted numbers will have full precision (unlike the rounded values that are shown on the screen).
- Next we write a function that uses the estimated regression model to calculate the reponse at whatever values we input. User-defined functions in R consist of the following four parts.
- A name for the function. I call mine quad.func
- A left pointing assignment arrow <-
- The keyword function followed by a pair of parentheses which contains a list of arguments separated by commas.
- The formula for the function.
Here's how I construct the quadratic function that corresponds to the negative binomial model. Observe that the coefficient of x2 is the regression coefficient that I previously saved.
> quad.func<-function(x) x+quad.coef*x^2
> quad.func(1)
I(means^2)
2.291726
> quad.func(1:2)
[1] 2.291726 7.166905
- To add the curve to the graph I generate a set of x-values using the seq function, enough so that a smooth curve can be drawn through them, and apply the function quad.func to these x-values. I do this all in arguments to the lines function.
#NB2 model
> lines(seq(10,150,10), quad.func(seq(10,150,10)), col=1, lty=1, lwd=2)
- I next estimate the NB1 model of the mean-variance relationship. In this model the variance is a linear function of the mean with intercept 0:
. The NB1 model is also called a quasi-Poisson model. The model can be fit as follows.
> lm(vars~means-1)
Call:
lm(formula = vars ~ means - 1)
Coefficients:
means
170.5
I use the abline function to draw the regression line, specifying the intercept as zero and using the coef function to extract the slope from the model. I draw it as a red dotted line.
#add quasi-Poisson (NB1)
> abline(0, coef(lm(vars~means-1)), col=2, lty=2, lwd=2)
- Finally I add the Poisson model,
, to the graph using the abline function specifying an intercept of 0 and a slope of 1. I draw it as a solid green line. Fig. 4 shows the results.
#add Poisson
> abline(0,1,col=3,lty=1,lwd=2)
Adding a legend
- Clearly the graph needs a legend to distinguish the different models that are displayed. The legend function in R can be used to add a legend to a graph. It has the following arguments (not all of which are required).
- x= the x-coordinate of the left edge of the legend.
- y= the y-coordinate of the top edge of the legend.
- legend= the text that should appear in the legend. This will be a vector whose individual values are the text elements describing the different objects in the graph.
- pch= followed by a list of plotting characters. I won't use this option here because I only want to distinguish the curves.
- lty= followed by a list of line types used
- lwd= followed by a list of line widths used
- col= followed by a list of the colors used
- cex= followed by a list magnification factors. I use this to make the legend text smaller than standard.
- bty= followed by a code for the type of box to be drawn around the legend. I prefer not to have a box so I specify bty='n' for no box.
- The vectors that specify the values of legend, pch, lwd, col, and cex must all have the same length. Legends can only appear in the interior of graphs unless one of the default graphics parameters, xpd, is changed using the par function. For now we'll restrict ourselves to legends that appear in the interior of graphs.
- I place the legend at the top left corner of the graph (where there is plenty of white space) as follows.
- Note: If you enter arguments in their default order, you don't need to label them. The default order in legend for the first three arguments is x= , y=, legend=. In the code below I use this order and hence I don't need to specify the first three argument names. For the remaining legend arguments I do use their names.
> legend(0,30000, c('Poisson', 'quasi-Poisson (NB1)', 'negative binomial (NB2)'), col=c(3,2,1), lwd=c(2,1,2), lty=c(1,2,1), cex=c(.8,.8,.8), bty='n')
> mtext('Mean-Variance Relationship for Disease Prevalence', side=3, line=.5, font=2, cex=.9)
- Fig. 5 shows the final result.
- The R2 values for both models are high but higher for NB2. Of course, NB2 has one additional term and therefore we would expect its R2 to be higher so we should to adjust for this fact.
- There are formal ways to compare these two models, and we'll explore some of these later. For the moment I would say there is support for both models, although visually the NB2 model would seem to provide a closer fit than does the NB1 model. Of course with only five data points it would be difficult to argue fervently for one model over the other. What is obvioius though is that we should reject the Poisson. My sole objective at this point is to provide some evidence to justify fitting a negative binomial regression model to these data (this would be next step) and I think we've done that. Ideally we should probably fit both an NB1 and an NB2 model to these data and then use other criteria for choosing between them.
Generating Data from Probability Distributions
- R provides probability functions for many different probability distributions. The basic form of these functions is the same no matter what the distribution is. We can see the basic form by using the Poisson as an example. Issue the command ?dpois or help(dpois) to bring up the help window for dpois. The first part of the help window is shown below.
> ?dpois
Density, distribution function, quantile function and random generation for the Poisson distribution with parameter lambda.
Usage
dpois(x, lambda, log = FALSE)
ppois(q, lambda, lower.tail = TRUE, log.p = FALSE)
qpois(p, lambda, lower.tail = TRUE, log.p = FALSE)
rpois(n, lambda)
- All probability functions in R start with one of the same four one letter prefixes: d, p, q, or r. The arguments for these functions may change slightly depending on how the probability distribution is parameterized.
- d specifies the density or probability mass function. dpois(k,λ) yields
, the probability that a Poisson random variable with parameter λ is equal to k.
- p specifies the cumulative distribution function. For the Poisson qpois(k,λ) yields
, the probability that a Poisson random variable with parameter λ is less than or equal to k.
- q refers to quantiles. For the Poisson you give qpois a cumulative probability and it returns the value of k that corresponds to this value. It is essentially the inverse function of ppois. The quantile functions are more useful for continuous probability distributions.
- r refers to random sample. Using rpois(n, λ) you can obtain a random sample of n observations from a Poisson distribution with parameter λ.
- So, to obtain 10 random observations from a Poisson(λ = 1) distribution do the following.
> rpois(10,1)
[1] 0 0 0 1 1 0 2 0 1 4
Changing this to λ = 8 yields
> rpois(10,8)
[1] 7 8 10 12 8 6 9 4 10 10
If I try λ = 1 again, I get a completely different set of observations, as should be the case.
> rpois(10,1)
[1] 2 1 0 2 0 2 0 2 1 1
- Sometimes it's useful to be able to regenerate exactly the same set of "random" observations. To do this we need to specify the seed that initializes the random number generator ourselves. In R this can be done with the set.seed function whose argument is an integer that specifies the seed.
> set.seed(100)
> rpois(10,1)
[1] 0 0 1 0 1 1 2 1 1 0
> set.seed(100)
> rpois(10,1)
[1] 0 0 1 0 1 1 2 1 1 0
Observe that by specifying the same seed, 100, before using the rpois function again, we get exactly the same set of "random" numbers.
- Let's see how closely the random numbers generated by R match the theoretical distribution. I generate 1000 observations from a Poisson(λ = 2) distribution.
> out.pois<-rpois(1000,2)
- To display the results we can use a histogram (Fig. 6).
> hist(out.pois)
- From the plot it's not clear how the cuts were made. For example, it's not clear whether x = 1 is in the first or second bar. Histograms are better suited to continuous data. With discrete data a bar plot is probably a better choice.
- R's barplot function expects the observations to be grouped categories in which the numeric values are the frequencies for those categories. Thus we need to preprocess the simulation data using the table function.
> table(out.pois)
out.pois
0 1 2 3 4 5 6 7 8
124 268 268 185 90 44 14 5 2
- From the output of the table function we can see that the histogram did group the zeros and ones into a single category. Next we construct a bar plot using the output of the table function.
> barplot(table(out.pois))
- This yields a far more reasonable display. Lastly I change the limits on the y-axis and add a box around the plot. Fig. 7 shows the results.
> barplot(table(out.pois), ylim=c(0,300))
> box()
- When comparing observed data to the results predicted by a model, it's useful to display both on the same plot. If we treat the simulation data as our observations, then the theoretical model would be the expected frequencies from a Poisson (λ = 2) distribution. To compute these we calculate the probabilities
when X ~ Poisson(2) and multiply these probabilities by 1000, the number of observations in our sample. We can obtain the theoretical probabilities with the dpois function. The observed count categories range from 0 through 8 so I calculate the theoretical probabilities for these.
> dpois(0:8,2)
[1] 0.1353352832 0.2706705665 0.2706705665 0.1804470443 0.0902235222 0.0360894089
[7] 0.0120298030 0.0034370866 0.0008592716
> sum(dpois(0:8,2))
[1] 0.9997626
- Observe that these probabilities don't sum to 1 meaning that there is a nonzero probability of observing a value bigger than 8. But this probability is very small (about 0.00024) and will not show up on the graph at the scale we are using it, so I ignore it.
- Note: if we were carrying out a formal goodness of fit test we should definitely not ignore these "tail" probabilities! A better solution is to group the tail probabilities into the last category which we would then label "8 and above".)
- I next calculate the expected frequencies.
> expected.freq<-1000*dpois(0:9,2)
> expected.freq
[1] 135.3352832 270.6705665 270.6705665 180.4470443 90.2235222 36.0894089
[7] 12.0298030 3.4370866 0.8592716
- I want to add these values to the graph as points connected by line segments. In order to do this I need to know the x-coordinates of the centers of the bars. You might think these are just the labels that appear below the bars, but in fact the labels shown need not correspond to the coordinate system R has used. Fortunately the coordinates used are returned by R when you assign the results of the barplot function to an object.
> coords<-barplot(table(out.pois), ylim=c(0,300))
> coords
[,1]
[1,] 0.7
[2,] 1.9
[3,] 3.1
[4,] 4.3
[5,] 5.5
[6,] 6.7
[7,] 7.9
[8,] 9.1
[9,] 10.3
- I use these to add the expected frequencies as points on each of the bars. I connect them with line segments and add a legend.
> points(coords, expected.freq ,pch=16, cex=1.5)
> lines(coords, expected.freq)
> legend(6, 300, 'expected frequencies', lty=1, pch=16, col=1, cex=.8, bty='n')
- Finally I add a title to the graph. Since I want to insert the string (λ = 2) in the text I have to use the expression and paste construction as before. I start with a string of text, followed by the math expression, followed by a string of text. I explicitly include a space at the end of the first string and at the beginning of the second string. Notice that to get the = sign in the mathematical expression I enter it as a double equals sign ==.
> mtext(expression(paste('1000 simulations from a Poisson ', (lambda==2), ' distribution')), side=3, line=.5)
Comparing Negative Binomial Distributions for Different μ and θ
- From the help window of dnbinom, the probability mass function for the negative binomial distribution, we learn the following.
> ?dnbinom
dnbinom(x, size, prob, mu, log = FALSE)
The alternative parametrization (often used in ecology) is by the mean mu, and size, the dispersion parameter, where prob = size/(size+mu). The variance is mu + mu^2/size in this parametrization or n (1-p)/p^2 in the first one.
- In terms of the second (ecological) parameterization we studied, the R equivalents are μ = mu and θ = size. As with the Poisson, dnbinom returns the probabilities
. I try a few values.
> dnbinom(0:10,size=1,mu=1)
[1] 0.5000000000 0.2500000000 0.1250000000 0.0625000000 0.0312500000 0.0156250000
[7] 0.0078125000 0.0039062500 0.0019531250 0.0009765625 0.0004882812
> dnbinom(0:10,size=1,mu=5)
[1] 0.16666667 0.13888889 0.11574074 0.09645062 0.08037551 0.06697960 0.05581633
[8] 0.04651361 0.03876134 0.03230112 0.02691760
- To get any sense of what's going on we need to plot these. After some experimentation I settle on μ =1 and 5, and θ = .1 and 10. The combination μ = 1, θ = 0.1 yields the largest probability (at k = 0), so I use it to define the limits of the graph.
> plot(0:10, dnbinom(0:10, size=.1, mu=1), type='n', xlab='k', ylab='P(X=k)')
- Notice the use of type='n'. The type argument specifies what to plot. The choices available are listed below.
type option |
Plot style |
type="p" |
Points (the default) |
type="l" |
Lines connecting the data |
type="b" |
Both (points and lines between points) |
type="h" |
Height bars (vertical lines connecting point to x-axis) |
type="o" |
Overlaid points and connecting lines |
type="s" |
Stairsteps |
type="n" |
Nothing |
Since I specify type='n' I'm choosing to plot nothing. In doing so I'm just using the plot command to set up the limits on the x- and y-axes, draw the axes, and display the labels.
- Next I add a points and lines statement for each combination of μ and θ.
- Note: it's probably easiest at this point to type these lines in a text processor, such as Notepad (Windows) or TextEdit (Mac). Then you can just copy and paste the lines repeatedly going back to make changes in colors, symbol types, and line types.
- I choose to use filled circles and solid lines for μ=1 and open circles and dashed lines for μ = 5. I use color (black and red) to distinguish different θ values (0.1 and 10).
> points(0:10,dnbinom(0:10,size=.1,mu=1),pch=16)
> lines(0:10,dnbinom(0:10,size=.1,mu=1))
> points(0:10,dnbinom(0:10,size=10,mu=1),pch=16,col=2)
> lines(0:10,dnbinom(0:10,size=10,mu=1),col=2)
> points(0:10,dnbinom(0:10,size=.1,mu=5),pch=21)
> lines(0:10,dnbinom(0:10,size=.1,mu=5),lty=2)
> points(0:10,dnbinom(0:10,size=10,mu=5),pch=21,col=2)
> lines(0:10,dnbinom(0:10,size=10,mu=5),col=2,lty=2)
- For the legend I again want to include mathematical expressions. This time I want the value of μ followed by a comma followed by the value of θ. To minimize errors I enter things in the same order I plotted them. Since there are four sets of objects on my graph, every component of the legend needs to be a vector of 4 elements. Once again entry of the code is done most easily in a text processor where I can copy and paste things repeatedly.
> legend(6,.8, c(expression(paste(mu==1, ', ', theta==.1)),
expression(paste(mu==1,', ', theta==10)),
expression(paste(mu==5,', ', theta==.1)),
expression(paste(mu==5,', ', theta==10))), col=c(1,2,1,2),
pch=c(16,16,21,21), lty=c(1,1,2,2), cex=rep(.9,4), bty='n')
- I used one new function in constructing the legend, rep. The rep function repeats a value the number of times specified thus creating a vector. In the code I use rep(.9,4) which does the following.
> rep(.9,4)
[1] 0.9 0.9 0.9 0.9
- Finally I add a title once again mixing text and Greek letters. So I use expression and paste again to link together the different elements of the title.
> mtext(expression(paste('Negative binomial distributions for varying ',mu,' and ',theta)), side=3, line=.5)
- Observe the very different effect changing the value of μ has on the probability mass function depending on the value of θ. I'll ask you to comment on this in the homework.
Course Home Page