Lecture 3 (lab) —Tuesday, January 17, 2006
What was covered?
R functions and commands demonstrated
- abline can be used to add a regression line, horizontal line, or vertical line to the currently active plot
- as.character converts data values (numeric or date) to character values
- as.date (from date package) converts character values of the proper form to dates
- as.numeric converts character, factor, and date values to numeric values
- attach adds a data frame to the default search path so that variables can be specified without reference to the data frame in which they reside
- c the catenation function that turns the elements making up its arguments into a single vector
- detach undoes attach, removes a data frame from the search path
- history opens up a history window in which you can view previously issued commands. For example, history(50) would display up to the last 50 commands issued.
- library loads a previously installed package into memory
- lines draws line segments between points (points are connected in the order given) to the currently active plot
- lm fits an ordinary least squares regression model
- lowess estimates a nonparametric regression curve; the output is the set of x- and y-coordinates of the fitted points
- mtext adds marginal text to the currently active plot
- names displays names of variables in a data frame
- objects displays all created objects currently in workspace
- par is used to set graphics parameters
- plot produces a scatter plot and/or sets up the options for a scatter plot
- points adds individual points to the currently active plot
- qq.plot (from car package) displays a normal probability plot of the given variable and adds the y=x line and confidence bands
- qqnorm displays a normal probability plot of a variable
- read.table reads in text data from an external file
- residuals extracts residuals from a regression object (such as one created by lm)
- sum calculates the sum of all entries of a vector or matrix
- summary displays summary information about regression objects, summary statistics of variables, etc.
- table displays a frequency table of all the unique values of a variable; it can also be used to construct contingency tables
- # indicates a given line of code is a comment and should be ignored
- <- the assignment operator in R, a less than symbol followed by a dash, that is supposed to symbolize an arrow. The arrow points in the direction of assignment.
- [ ] used for specifying elements of vectors or portions of data frames and matrices
- $ list notation symbol that can be used to reference columns of a data frame
- ~ symbol used in defining expressions in R for model fitting
R function options
- col= (argument to many graphics functions) specifies the color to use in plotting points and/or line segments
- data= (argument to lm and many other model functions) specifies the data set containing variables. Should not be used if the data frame containing the variables is included as part of a variable's name when specifying the model.
- f= (argument to lowess) changes the default fraction from 2/3 to a user-specified value
- h= (argument to abline) draws a horizontal line at the value specified, e.g., abline(h=0) draws a horizontal line at y=0.
- header= (argument to read.table) takes on values TRUE or FALSE, indicates whether the first line of a text file contains the variables names (TRUE) or not (FALSE)
- line= (argument to mtext) specifies the margin line on which the text should appear, line=0 is the plot edge
- lty= (argument to many graphics functions) specifies the line type to use in drawing line segments. lty=1 is the default, a solid line. Values 2, 3, 4, 5, 6 yield different kinds of dashed and dotted lines
- lwd= (argument to many graphics functions) specifies the line width relative to the default width (which is lwd=1)
- mfrow= (argument of par) as in par(mfrow=c(m,n)) which sets the graphic window so that plots are arranged in m rows and n columns; default is par(mfrow=c(1,1))
- pch= stands for print character and is used to designate the plotting symbol for use in various plotting functions: plot, points, etc.
- sep= (argument to read.table) specifies the character that is used to separate fields in the data set, e.g., sep=',' indicates that entries are comma-delimited
- side= (argument to mtext) specifies the side of the graph on which the text should appear: 1 is the bottom, 2 is left, 3 is top, and 4 is the right side
- xlab= (argument of plot) a user-specified value to be used as the label for the x-axis, e.g., xlab="WSSTA"
- ylab= (argument of plot) a user-specified value to be used as the label for the y-axis, e.g., ylab="Disease Prevalence "
- ylim= (argument of plot) allows specification of the range on the y-axis to plot, e.g., ylim=c(0,100)
R packages used
- car special regression functions. We used qq.plot
- date package with specialized data functions. We used as.date
Overview
- In this course we will be running R from AFS space. There is a shortcut to the program on the desktop of the lab computers. The path specified by the shortcut is J:\isis.unc.edu\pc-pkg\r-211\program\bin\Rgui.exe
- For your home computer you will need to get your own copy of R, or run it off AFS space yourself. R is free and can be downloaded from http://www.r-project.org.
- When at the web site from the menu on the left choose CRAN under Download.
- Choose a CRAN mirror site in the next window (there is one here at UNC).
- Select the operating system of your computer from the Precompiled Binary Distributions section.
- In the next window choose base.
- The latest release for Windows is R-2.2.1. You can download the installer program for this version or choose an older version by clicking old. The version we're running in the lab is R 2.1.1.
- After downloading it you'll need to double click the downloaded file to run the installer program.
- One of the advantages of running R from AFS space is that all the user-contributed packages available at the CRAN site have already been installed. As we use some of these packages you will need to install copies of them from the R site on your home computer also. This process is described below.
Data Entry and Manipulation
- To read data into R from a text file, use the read.table function. Today we'll read in a file directly from the class web site. The file is set up in such a way that the first row of the file contains the names of the variables and the subsequent rows contain the data values. Each individual data value is separated from the next by a comma. We can read it into R as follows. (Note: R treats any line that begins with a # symbol as a comment and ignores it. I will include occasionally mix comments in with the code to annotate what's being done and color them in green.
#read in data from Web
> corals<-read.table('http://www.unc.edu/courses/2006spring/ecol/145/001/data/lab1/corals.csv',
header=TRUE, sep=',')
- corals is the name I've chosen to give this data set in R and is the name I'll need to use when referring to these data in R.
- <- is the assignment operator in R. It points in the direction of the assignment. In this case we are assigning the output of the read.table function to the variable corals.
- Functions in R use the f(x) notation, i.e., a function name followed by parentheses that contain a list of arguments separated by commas. Here read.table is the function name.
- The first argument to read.table in the code above is the web location for the data set. Notice that this argument is enclosed in quotes. This is required. You may use single or double quotes, but you must not mix them as part of the same argument.
- Windows: If the file is located on your personal machine using the Windows operating system, then you must use a similar designation as shown for the web address, with one caveat, DO NOT USE A SINGLE BACKSLASH CHARACTER to separate the folders in the path designation. For example, if the file corals.csv is located on the c: drive in the folder ecol_145 of my hard drive, I could designate it with forward slashes as in 'c:/ecol_145/corals.csv' or using double backslashes as in 'c:\\ecol_145\\corals.csv', but not single backslashes as in 'c:\ecol_145\corals.csv'. The single backslash in R is a return code that allows you to modify how text enclosed in quotes will print to the screen.
- Macintosh: On a Macintosh the syntax is a little different. Suppose I have the file in a folder I call ecol145 in the Documents folder on my hard disk. The correct file designation for me would be: '/Users/jackweiss/Documents/ecol145/corals.csv'. You can see this path by clicking on the folder name at top of its window while holding down the command (apple) key (Fig. 1). It is not necessary to include any details about the path for anything above the Users folder.
- The argument header=TRUE tells R that the first row of the file contains the variable names. Notice the use of all capitals in TRUE. R is case-sensitive and all-caps for TRUE or FALSE is required.
- The last argument sep=',' tells R that the data values are separated by commas.
- The object corals in R is called a data frame. R does not have a spreadsheet interface to data frames. To look at data you have to print it to the screen. By default if we enter the name of an object in R and press the return key, R will print its contents. For large data sets we'll only want to look at selective portions. There are two different ways to do this.
- Method 1: The data frame can be treated as a matrix in which we display some subset of rows and columns. In this method the data frame name is followed by a pair of brackets inside of which is the list of rows followed by the list of columns desired.
The collection of rows is separated from the collection of columns by a comma.
- To print the first 5 rows and all columns of the data frame corals enter corals[1:5,]. Notice the use of the comma followed by nothing. When no columns are specified the assumption is you want all columns. (The same would work for rows if you wanted to see only certain columns but all rows.)
> corals[1:5,]
REEF_NAME SECTOR SHELF LAT_DD LON_DD DATE YEAR YEARII WEEK WSSTA TSASUMS_SM PREV_1 CORAL_COVE INC
1 MACGILLIV CL M -14.64833 145.4910 7/24/2002 2002 3-Feb 30 3 0 2 20.08802 2
2 MARTIN (1 CL I -14.75530 145.3722 7/26/2002 2002 3-Feb 30 6 1 32 19.47236 31
3 LIZARD IS CL M -14.69147 145.4692 7/28/2002 2002 3-Feb 31 0 0 3 15.71672 2
4 NORTH DIR CL M -14.74355 145.5158 7/28/2002 2002 3-Feb 31 7 2 9 24.63439 7
5 NO NAME CL O -14.62787 145.6453 7/30/2002 2002 3-Feb 31 2 1 75 49.20807 26
- To view only columns 2 and 3 of rows 1 through 5, enter corals[1:5,2:3]
> corals[1:5,2:3]
SECTOR SHELF
1 CL M
2 CL I
3 CL M
4 CL M
5 CL O
- If the rows or columns are not consecutive, use the c operator to enclose the list of rows or columns desired. For example, to get rows 1 through 5 and columns 2 and 4, use the following: corals[1:5,c(2,4)]
> corals[1:5,c(2,4)]
SECTOR LAT_DD
1 CL -14.64833
2 CL -14.75530
3 CL -14.69147
4 CL -14.74355
5 CL -14.62787
- Instead of using numbers, rows and columns can be specified by their names, enclosed in quotes. We can duplicate the above output with the following: corals[1:5,c("SECTOR","LAT_DD")]
> corals[1:5,c("SECTOR","LAT_DD")]
SECTOR LAT_DD
1 CL -14.64833
2 CL -14.75530
3 CL -14.69147
4 CL -14.74355
5 CL -14.62787
- Method 2: The columns of the data frame can be accessed using what's called list notation in which each column is treated as the element of a list. Using this method we specify the data frame name followed by a dollar sign, $, followed by a column name. As an example to look at the variable called SECTOR we could enter corals$SECTOR.
> corals$SECTOR
[1] CL CL CL CL CL CL CL CL CL SW SW SW SW SW SW SW CB CB CB CB CA CA CA CA CA CA CA CA CA CA WH
[32] WH WH WH WH WH WH WH WH TO TO TO TO TO TO TO TO
Levels: CA CB CL SW TO WH
- The variables inside the data frame corals are not accessible to R without also specifying the name of the data frame (unless a variable name happens to match a name already defined in R in which case you'll get the wrong thing anyway).
> SECTOR
Error: Object "SECTOR" not found
- You can change this condition by using the attach function.
> attach(corals)
> SECTOR
[1] CL CL CL CL CL CL CL CL CL SW SW SW SW SW SW SW CB CB CB CB CA CA CA CA CA CA CA CA CA CA WH
[32] WH WH WH WH WH WH WH WH TO TO TO TO TO TO TO TO
Levels: CA CB CL SW TO WH
- The function attach has added the data frame corals to the default search path R uses in searching for objects. I do not use attach in my work because I often have multiple data sets loaded at once even working with multiple copies of the same variable. To undo the action of attach, use the detach function.
> detach(corals)
> SECTOR
Error: Object "SECTOR" not found
Plotting Data
- The researchers responsible for the corals data set we're using are interested in determining if the emergence of the disease white syndrome is related to warming ocean waters. They write the following in their as yet unpublished manuscript.
We investigated whether white syndrome, an emerging disease of corals on the GBR was related to the number of warm temperature anomalies. WSSTA defines an anomaly as a week in which the ocean temperature exceeds the long-term (1985-2003) local mean temperature value for that calendar week by 1°C. Thus, a WSSTA is both a site- and week-specific measure of thermal stress. We quantified the thermal history of each reef by calculating the number of WSSTAs during the 52 weeks preceding each disease survey date. Our metric then allowed us to determine the number of weeks that the temperature was unusually warm for a particular reef in the year prior to the survey date.
The geographical locations of the reefs that make up their sample can be seen at the following link: australia.pdf. The graph you see there was created in R using the read.shape function from the maptools package and the plot function of base R. Source materials for the map are varioius shape and dbase data files I downloaded from an Australian government web site.
- To see the names of the variables contained in a data frame use the names function.
> names(corals)
[1] "REEF_NAME" "SECTOR" "SHELF" "LAT_DD" "LON_DD" "DATE" "YEAR"
[8] "YEARII" "WEEK" "WSSTA" "TSASUMS_SM" "PREV_1" "CORAL_COVE" "INC"
- The variable PREV_1 counts the number of diseased coral colonies observed in transects of standard length conducted on individual reefs. The variable WSSTA is the temperature metric created by the authors as a measure of thermal stress. The data frame we're examining includes the transect results for one year of obsservations with one observation made on each reef.
- To examine the relationship between WSSTA and PREV_1 we'll first plot the data. R's plot command requires at minimum two arguments: the variable to be plotted on the x-axis followed by the variable to be plotted on the y-axis (separated by a comma). Since we wish to predict disease using the temperature metric, WSSTA is the x-variable and PREV_1 is the y-variable.
> plot(corals$WSSTA,corals$PREV_1)
- By default R has used the arguments of the plot function as labels on the axes. We can fix that by using the arguments xlab and ylab to specify our own labels.
> plot(corals$WSSTA,corals$PREV_1,
xlab='WSSTA',ylab='Disease Prevalence')
- From the plot it would appear that there is some sort of relationship but it is fairly noisy. Let's fit an ordinary linear regression to the data. The R function for doing this is lm. The minimal arguments required are the response variable followed by a tilde, ~, followed by the predictor. An expression containing a ~ is called a formula in R.
> lm(corals$PREV_1~corals$WSSTA)
Call:
lm(formula = corals$PREV_1 ~ corals$WSSTA)Coefficients:
(Intercept) corals$WSSTA
-24.20 15.30
- The default printed output is pretty minimal. We can get more details by sandwiching the regression call inside the summary function.
> summary(lm(corals$PREV_1~corals$WSSTA))
Call:
lm(formula = corals$PREV_1 ~ corals$WSSTA)Residuals:
Min 1Q Median 3Q Max
-141.121 -34.460 -5.403 23.597 198.879 Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -24.201 19.124 -1.265 0.212
corals$WSSTA 15.302 3.403 4.496 4.82e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 72.53 on 45 degrees of freedom
Multiple R-Squared: 0.31, Adjusted R-squared: 0.2946
F-statistic: 20.21 on 1 and 45 DF, p-value: 4.821e-05
- Here we see all the usual output for regressions. Note that the regression is statistically significant. (Either the F-test reported in the last line of the output or the t-test for WSSTA in the table of coefficient estimates provides the evidence.) The Multiple R-Squared is just R2, the proportion of variance in the response explained by the predictor. The reported value is 0.31 meaning 31% of the variability of disease prevalence is explained by knowing the value of WSSTA. 31% is pretty good for ecology.
- Let's add the regression line to the plot. First refit the model this time assigning the results to a variable. In the code below I store the regression results in model1.
> model1<-lm(corals$PREV_1~corals$WSSTA)
- The lm function has a data argument that allows you to specify the data frame separate from the variables used in the model, so another way of fitting the above model is by issuing the following.
> model1<-lm(PREV_1~WSSTA,data=corals)
- To add the regression line use the abline function. The plot function is a primary graphics command. If a new plot command is issued R will redraw the screen (this default behavior can be changed). The function abline on the other hand is a secondary graphics command; it adds its output to the current plot without redrawing the plot. If abline is given a simple linear regression model as an argument it will find the intercept and slope estimates and plot the regression line.
> abline(model1)
Evaluating the Regression Model—Lowess
- The plot above suggests the model may have some problems. Of primary concern is whether we have specified the relationship between disease prevalence and WSSTA correctly. One way to address this is to add a nonparametric smooth to the graph. R provides two separate implementations of what is essentially the same nonparametric smooth in the functions lowess and loess. We'll only discuss lowess here. Lowess stands for "locally weighted smooth" and it essentially calculates a local estimate of the regression curve at each point by using only a neighborhood of points then weighting these points by how close they are to the point in question. If we enter help(lowess) at the R prompt, we get the following description.
This function performs the computations for the LOWESS smoother (see the reference below). lowess returns a list containing components x and y which give the coordinates of the smooth. The smooth should be added to a plot of the original points with the function lines.
- One useful option to lowess is f= that specifies the fraction of points to be used in calculating the individual local regression line estimates. The default value is 2/3. Values closer to 1 yield smoother curves while values closer to 0 yield noisier curves.
- The numeric output from lowess is uninteresting and we'll follow the above instructions and use it as input to the lines function. The lines function connects a set of ordered pairs by line segments in the order the ordered pairs appear as arguments to the function. Thus if the x-values are not sorted correctly, the result can be a mess. Fortunately lowess has already sorted things for us.
> lines(lowess(corals$PREV_1~corals$WSSTA),lwd=2,col=3)
- The syntax for lowess is the same as for lm, except the data= option is not supported. So lowess provides the (x,y) coordinates needed by the lines function. In the argument lwd=2 I specify a line width that is twice the normal width. The argument col=3 causes the line to be drawn using color code 3, which is green. Like abline, lines is a secondary graphics command that adds its output to the current active plot.
- There are 8 colors that can be specified by number and 657 colors that can be specified by name. The 8 numeric colors are shown below.
Fig. 6 R's numeric color codes
- We can see the lowess curve better if we limit the range on the plot. To do this we can specify the ylim argument in the plot function. The syntax is ylim=c(minval,maxval). I choose to plot from y=0 to y=50. We'll need to start from scratch with a new plot statement. Use the up arrow key on the keyboard to recall previous lines until you get to the plot command. Add the ylim option, ylim=c(0,50) as shown below. Then recall the abline and lines commands using the up arrow key to add the regression line and lowess curves to the plot.
> plot(corals$WSSTA,corals$PREV_1, xlab='WSSTA',
ylab='Disease Prevalence', ylim=c(0,50))
> abline(model1)
> lines(lowess(corals$PREV_1~corals$WSSTA), lwd=2, col=3)
- It's clear that the lowess and ordinary regression lines are in strong disagreement as to the underlying relationship between PREV_1 and WSSTA. The lowess curve suggests perhaps a quadratic model is more appropriate.
Evaluating the Regression Model—Residual Analysis
- Another useful tool in assessing model fit is residual analysis. Residuals are essentially the response after the relationship described by the model has been removed. The residuals
estimate the error
. Recall that the basic ordinary linear regression model with a single predictor has the following form:

Here i is the index of the observation.
- If Disease Prevalence has a significant linear relationship with WSSTA and our model is well-specified, the following should be true.
- The residuals should show no additional relationship with WSSTA.
- The residuals should be approximately normally distributed.
- The variance of the residuals should be same for all values of WSSTA
- We assess each of these assumptions in turn.
Constant Variance and Mispecification
- The residuals function when applied to a regression object will extract the residuals from it (the raw residuals by default). I plot them against WSSTA. I add a lowess curve and also include a horizontal line at y = 0.
> plot(corals$WSSTA,residuals(model1))
> lines(lowess(residuals(model1)~corals$WSSTA), lwd=2,col=3)
> abline(h=0,lty=2,col=4)
- In the abline function the argument h=0 creates a horizontal line at y = 0. The argument lty=2 specifies the line type. The default is 1 (solid) and specifying 2 for lty yields a dotted line. There are also line types 3, 4, 5, and 6 as well as user-created varieties. Fig. 9 shows the various line types available in R.

Fig. 9 Line type values, lty=
- The lowess curve shows a distinct trend to the residual plot, suggesting the model has been mispsecified. But there are other problems revealed by the plot. The residuals should show just random scatter about the line y = 0. What we see instead is a systematic increase in the scatter as we move from left to right. The term used to describe such a pattern is heteroscedasticity. Notice that the heteroscedasticity was quite obvious in the original plot (Figs. 1–4) also. There we saw the scatter about the regression line increase from left to right. The presence of heteroscedasticity means the constant variance assumption has been violated.
Normality
> library(car)
> qq.plot(residuals(model1))
- If the residuals arise from a normal distribution they should closely follow the the straight line drawn in the graph. In particular, nearly all of the points should at least plot within the confidence bands drawn on each side of the line. Clearly this is not the case here. The assumption of normality appears to be violated.
Accounting for Heteroscedasticity and Lack of Normality
- As we'll see later, there are some discrete probability distributions that would be good candidates to consider here as substitutes for the normal distribution, as well what is called weighted regression, or generalized least squares. But for now, let's use the time-honored method of transforming the response. Sometimes a transformation can solve both a problem of heteroscedasticity and a lack of normality. The problem with transformations from a practical standpoint is that they fundamentally change the relationships between the modeled variables and you need to be aware of these changes. We'll discuss this problem more later.
- A log transform is often used with count data but if we try it we immediately encounter a problem.
> log(corals$PREV_1)
[1] 0.6931472 3.4657359 1.0986123 2.1972246 4.3174881 4.4998097 3.4011974
[8] 3.8918203 0.0000000 2.9957323 3.4339872 3.5263605 1.0986123 2.0794415
[15] 1.3862944 5.0039463 5.3981627 5.8171112 5.8377304 5.7525726 0.6931472
[22] 2.1972246 3.0910425 2.9444390 4.0943446 4.5217886 4.6634391 2.7080502
[29] -Inf -Inf 3.0910425 2.4849066 1.0986123 0.0000000 2.7725887
[36] 0.0000000 3.3322045 2.9957323 3.1354942 2.5649494 0.0000000 -Inf
[43] -Inf 0.6931472 1.0986123 1.3862944 -Inf
- Notice the -Inf in the output. If I use the log-transformed response in a regression model, I get a fairly ominous looking error message.
> lm(log(corals$PREV_1)~corals$WSSTA)
Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
NA/NaN/Inf in foreign function call (arg 4)
- The problem is that the log of zero is undefined and there are zero values for the response in the data set yielding negative infinity. This is causing the lm function to fail. To examine the distribution of the variable PREV_1 among its different values I use the table function.
> table(corals$PREV_1)
0 1 2 3 4 8 9 12 13 15 16 19 20 22 23 28 30 31 32 34
5 4 3 4 2 1 2 1 1 1 1 1 2 2 1 1 1 1 1 1
49 60 75 90 92 106 149 221 315 336 343
1 1 1 1 1 1 1 1 1 1 1
- Observe that there are only a few zero values. If there were many zero values (a condition that is called zero-inflation), a log transformation would be an especially unwise choice. A somewhat unsatisfactory solution when there are zero values is to add a small constant to all the data values, that is, compute log(y+c) for some choice of c. Popular choices for c are 1 or 0.5. If you do this it's probably worthwhile to check that the choice of c doesn't affect the results.
> model2<- lm(log(corals$PREV_1+1)~ corals$WSSTA)
> summary(model2)
Call:
lm(formula = log(corals$PREV_1 + 1) ~ corals$WSSTA)
Residuals:
Min 1Q Median 3Q Max
-3.2626 -1.1688 0.3321 1.2968 2.3470
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.66543 0.41904 3.974 0.000253 ***
corals$WSSTA 0.19965 0.07457 2.677 0.010325 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.589 on 45 degrees of freedom
Multiple R-Squared: 0.1374, Adjusted R-squared: 0.1182
F-statistic: 7.167 on 1 and 45 DF, p-value: 0.01033
- The regression is significant but not markedly so and the R2 is rather small. Let's plot the data along with the regression line to see how things look. I try two different levels of smoothing in the nonparametric smooth (green and blue curves in Fig. 11). The small wiggles in the lowess curve are probably nothing to worry about. The trend of the nonparametric smooth looks pretty linear.
> plot(corals$WSSTA,log(corals$PREV_1+1), xlab='WSSTA', ylab='Log(Prevalence)')
> abline(model2)
> lines(lowess(log(corals$PREV_1+1)~corals$WSSTA), lwd=2,col=3)
> lines(lowess(log(corals$PREV_1+1)~corals$WSSTA,f=.9), lwd=2,col=4)
- Below I produce a normal probability plot of the residuals and also plot the model residuals against WSSTA. Both plots look good, in any case far better than what we saw with the untransformed data. The cubic trend shown by the lowess is probably nothing to worry about.
> qq.plot(residuals(model2))
> plot(corals$WSSTA,residuals(model2))
> lines(lowess(residuals(model2)~corals$WSSTA), lwd=2,col=3)
> abline(h=0,lty=2,col=4)

Fig. 12 Residual plots for the log-transformed response
Data Structure
- Having chosen a probability generating mechanism for our data, technically lognormal, the next step is to incorporate any additional structure that exists. By structure I mean patterns or relationships that exist between observations solely due to the way the data were collected. Since we've obtained the data secondhand we'll have to discover the presence of any structure on our own.
- Obvious sources for potential structure with ecological data are space and time. We've already seen in the map of the sampled reefs that they are not located randomly in space but in fact occur in six clusters that are identified in the data set by the variable SECTOR. There is also the question of whether there is any temporal structure. The variable that contains the sample time is DATE.
- At the moment R thinks DATE is a factor variable, a variable whose distinct values merely label the categories that this variable records. To make use of it as a date we will need to use R functions to convert it to that form.
- Although there are a number of built-in R date functions, I prefer to use those in the date package. The date package is also not part of the standard R installation and needs to be first downloaded and installed (See above for instructions.) After installing it load it into memory using the library function.
library(date)
- The function in the date package that we'll use is the as.date function. This function takes character values of the format 'mm/dd/yyyy' (where mm is the numeric month, dd is the numeric day, and yyyy is the year) and turns them into numerical dates. The problem for us is that R has already turned DATE in the corals data frame into a factor variable. We need to first turn it back into a character variable before the as.date function will work. We can do this with the as.character function.
- We can see the various conversions taking place in the series of commands below.
> corals$DATE
[1] 7/24/2002 7/26/2002 7/28/2002 7/28/2002 7/30/2002 7/31/2002 7/31/2002
[8] 8/1/2002 8/2/2002 9/5/2002 9/6/2002 9/7/2002 9/8/2002 9/9/2002
[15] 9/10/2002 9/11/2002 9/16/2002 9/17/2002 9/18/2002 9/19/2002 10/24/2002
[22] 10/25/2002 10/26/2002 10/28/2002 11/1/2002 11/2/2002 11/3/2002 11/5/2002
[29] 11/7/2002 11/8/2002 11/25/2002 11/26/2002 11/27/2002 11/28/2002 11/30/2002
[36] 12/1/2002 12/3/2002 12/4/2002 12/5/2002 3/7/2003 3/8/2003 3/10/2003
[43] 3/12/2003 3/15/2003 3/18/2003 3/19/2003 3/21/2003
45 Levels: 10/24/2002 10/25/2002 10/26/2002 10/28/2002 11/1/2002 ... 9/9/2002
> as.character(corals$DATE)
[1] "7/24/2002" "7/26/2002" "7/28/2002" "7/28/2002" "7/30/2002" "7/31/2002"
[7] "7/31/2002" "8/1/2002" "8/2/2002" "9/5/2002" "9/6/2002" "9/7/2002"
[13] "9/8/2002" "9/9/2002" "9/10/2002" "9/11/2002" "9/16/2002" "9/17/2002"
[19] "9/18/2002" "9/19/2002" "10/24/2002" "10/25/2002" "10/26/2002" "10/28/2002"
[25] "11/1/2002" "11/2/2002" "11/3/2002" "11/5/2002" "11/7/2002" "11/8/2002"
[31] "11/25/2002" "11/26/2002" "11/27/2002" "11/28/2002" "11/30/2002" "12/1/2002"
[37] "12/3/2002" "12/4/2002" "12/5/2002" "3/7/2003" "3/8/2003" "3/10/2003"
[43] "3/12/2003" "3/15/2003" "3/18/2003" "3/19/2003" "3/21/2003"
> as.date(as.character(corals$DATE))
[1] 24Jul2002 26Jul2002 28Jul2002 28Jul2002 30Jul2002 31Jul2002 31Jul2002 1Aug2002
[9] 2Aug2002 5Sep2002 6Sep2002 7Sep2002 8Sep2002 9Sep2002 10Sep2002 11Sep2002
[17] 16Sep2002 17Sep2002 18Sep2002 19Sep2002 24Oct2002 25Oct2002 26Oct2002 28Oct2002
[25] 1Nov2002 2Nov2002 3Nov2002 5Nov2002 7Nov2002 8Nov2002 25Nov2002 26Nov2002
[33] 27Nov2002 28Nov2002 30Nov2002 1Dec2002 3Dec2002 4Dec2002 5Dec2002 7Mar2003
[41] 8Mar2003 10Mar2003 12Mar2003 15Mar2003 18Mar2003 19Mar2003 21Mar2003
> as.numeric(as.date(as.character(corals$DATE)))
[1] 15545 15547 15549 15549 15551 15552 15552 15553 15554 15588 15589 15590 15591
[14] 15592 15593 15594 15599 15600 15601 15602 15637 15638 15639 15641 15645 15646
[27] 15647 15649 15651 15652 15669 15670 15671 15672 15674 15675 15677 15678 15679
[40] 15771 15772 15774 15776 15779 15782 15783 15785
> plot(as.date(as.character(corals$DATE)), corals$LAT_DD, xlab='Date', ylab='Latitude')
- From the plot we see there is a definite problem. Not only are reefs clustered in space they are also clustered in time. What's worse is that the spatial clustering and the temporal clustering are identical. Reefs that were near each other spatially were all sampled at roughly the same time of year.
- Now in practice it may turn out that this is not a problem if the variables of interest here, disease prevalence and WSSTA, do not also show a relationship with space and time. We investigate that question next.
- Let's examine the distribution of the variable WSSTA. The table function of R is useful here. The command table(corals$WSSTA) gives the frequency distribution of the variable WSSTA. It lists how many observations there are for each distinct value of WSSTA. I also use the sum function to add up the counts to find out how many reefs there are.
> table(corals$WSSTA)
0 1 2 3 4 5 6 7 8 9 10 11 13
3 4 5 9 3 6 5 5 1 1 2 2 1
> sum(table(corals$WSSTA))
[1] 47
So we see there are 47 reefs total whose values range from 0 to 13 for the variable WSSTA.
- Let's see when and where the reefs with the largest values of WSSTA occur. Somewhat arbitrarily, let's look only at reefs that scored 8 or above on WSSTA. An expression such as corals$WSSTA>=8 is an example of Boolean expression; in R it evaluates to either TRUE or FALSE.
> corals$WSSTA>=8
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE
[14] FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE
[27] FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[40] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
- We can make use of a Boolean expression to subset the data set. If I enter the expression corals$LAT_DD[corals$WSSTA>=8], what's returned are only the values of corals$LAT_DD for which the expression corals$WSSTA>=8 evalutes to TRUE. The values for which the Boolean expression is FALSE are not displayed.
> corals$LAT_DD[corals$WSSTA>=8]
[1] -14.78517 -21.46982 -23.48365 -23.88350 -23.24957 -16.77532 -16.38352
- How I'll use this is to color the points on the graph for which WSSTA>=8. To make our lives easier and simplify typing I first create a new variable in the data frame corals whose values are the numeric dates.
> #simplify our life and create a true date variable
> corals$numdate<-as.date(as.character(corals$DATE))
> names(corals)
[1] "REEF_NAME" "SECTOR" "SHELF" "LAT_DD" "LON_DD" "DATE"
[7] "YEAR" "YEARII" "WEEK" "WSSTA" "TSASUMS_SM" "PREV_1"
[13] "CORAL_COVE" "INC" "numdate"
- Notice that there is now a new variable in corals called numdate. Because I included corals$ as the first part of the name of numdate, the variable was created in the data frame corals.
- I next use the points function to color the points. Since points is a secondary graphics command, its output is added to the currently active graph.
> points(corals$numdate[corals$WSSTA>=8],corals$LAT_DD[corals$WSSTA>=8],col=2,pch=16)
- In the code above the option pch=16 asks for print character 16 which corresponds to filled circles. There are 25 different symbol types available in R. Fig. 15 lists them all. In addition it is possible to use any of the keyboard characters as well as special characters from various font libraries as plot symbols.

Fig. 15 Options for pch= in plots
- Observe that the "hot points" are at the southern and northern ends of the reef and occur early in the sample period. Next we check if disease prevalence shows the same pattern.
> table(corals$PREV_1)
0 1 2 3 4 8 9 12 13 15 16 19 20 22 23 28 30 31 32 34 49
5 4 3 4 2 1 2 1 1 1 1 1 2 2 1 1 1 1 1 1 1
60 75 90 92 106 149 221 315 336 343
1 1 1 1 1 1 1 1 1 1
- Choosing WSSTA>=8 above ended up choosing 7 reefs. If we selected the 7 reefs here with the largest value of PREV_1 we would end up using 92 as a cut-off. Since a break between 90 and 92 is somewhat arbitrary I elect to include reefs with PREV_1>=60. I use the points command again to add the points to the graph this time coloring them blue.
> points(corals$numdate[corals$PREV_1>=60], corals$LAT_DD[corals$PREV_1>=60], col=4, pch=16)
- Some of the points are superimposed on top of each other. We can get a better understanding of what's going on by doing two scatter plots side by side. The par function in R controls all sorts of graphics settings. One setting is how plots are arranged in the graphics window. An option that controls this is mfrow. The default setting is mfrow=c(1,1) meaning one row and one column. We'll change this to mfrow=c(1,2) to obtain one row and two columns. Then we'll do each plot in succession. We need to use two plot commands to get the two panels of the graphics window.
Here are the commands.
> #now let's put them side-by-side
> par(mfrow=c(1,2))
> #first temperature
> plot(as.date(as.character(corals$DATE)), corals$LAT_DD, xlab='Date', ylab='Latitude')
> points(corals$numdate[corals$WSSTA>=8],corals$LAT_DD[corals$WSSTA>=8], col=2, pch=16)
> mtext('Temperature Anomalies',line=.5,side=3)
> #next disease
> plot(as.date(as.character(corals$DATE)), corals$LAT_DD, xlab='Date', ylab='Latitude')
> points(corals$numdate[corals$PREV_1>=60], corals$LAT_DD[corals$PREV_1>=60], col=4, pch=16)
> mtext('Disease Anomalies',line=.5,side=3)
> #restore the default setting
> par(mfrow=c(1,1))
- I use the mtext graphics function to add a small title at the top of each graph. There is another function for titles, not surprisingly called title, but I find it uses it up too much of the graph window so I use mtext instead. mtext stands for "marginal text" and the function is used for adding text to any of the four margins of a graph. The sides of the graph are numbered from the bottom clockwise so that side=3 corresponds to the top. The option line=.5 controls how far from the border of the graph the text will appear. The two graphs are shown below.
- Windows: You may need to resize the graphics window to get everything to display correctly. Everything in the graph will usually resize correctly but sometimes not everything ends up being ideal. Try resubmitting all of the commands after resizing the window.
- Macintosh: The quartz device window cannot be resized manually. To resize it go to R>Preferences and click on the Quartz icon at the top of the window that appears. Check the box "Override R Quartz width/height parameters" and enter your own values for width and height. For the figure shown below I used width 8 inches and height 4.5 inches (the default height). These settings will not apply to the current window. Instead you must open a new quartz device window. Choose Window > New Quartz Device Window. Run the graphics commands listed above in the console window and the new graph will appear in the new window at the new size settings.

Fig. 17 Comparing temperature and disease anomalies
- Observe that both sets of anomalies occur in the same geographic sectors and at the same time of year. Curiously except for the southern-most sectors, the reefs with the high temperature anomalies are not the same reefs that have the high disease anomalies. They turn out to be their neighbors!
- The logical dilemma posed by the confounding of the relationship of interest with space and time is that we now have competing hypotheses that we are unable to distinguish. In particular we don't know if the pattern we observe is due to a causal relation between temperature and disease, or a correlative relation due to similar geography, or a correlative relation due to time of the year. We can't alter location but we could have altered the sample time. It's impossible to predict what we would have seen had the two middle latitude sectors been sampled say earlier in 2002 than at the times they were.
- To see what objects have been created during our R session use the objects command.
> objects()
[1] "corals" "model1" "model2"
- The collection of objects created during an R session can be saved by saving the R workspace. Then at a later time this workspace can be loaded and the objects restored for use.
- Windows: Select File>Save Workspace... and choose a name and place for storing the file. When R is started up again you can load this workspace by choosing File>Load Workspace. Alternatively you can start up R by double clicking on the icon of the workspace file that was previously created. The advantage of loading the workspace in this second method is that the folder that workspace is in becomes the default location for saving all graphs and files while R is running.
- Macintosh: Choose Workspace>Save Workspace File. To open this file when starting a new R session choose Workspace>Load Workspace File.
- When quitting R you will be prompted to save the workspace. If you've already saved the workspace there is no need to do this again. By default, R will store the current workspace as the default workspace. When you open R again this is the workspace that will be opened. I prefer to always keep this workspace empty so that I can start new projects with a clean slate, so I always save the workspace in before quitting R.
- Old commands can be viewed by opening the history window. The command to do this is history(n) where n is the number of old command lines you wish to see. Eventually the history buffer gets filled and old commands are lost, so you should issue this command frequently if you want to record your work. I usually paste the commands from the history window into a Notepad document (or a Textedit document on the Mac).
Course Home Page