# Lecture 42 (lab)—Friday, April 27, 2007

### What was covered?

• R topics covered
• Using the mantel function of the vegan package to carry out Mantel tests
• Using the gls function from the nlme package to account for residual correlation in regression models
• Estimating and fitting variogram models using the geoR and nlme packages
• Adding to regression models using the update function
• Carrying out Mantel tests
• Estimating an empirical semivariogram
• Fitting theoretical models to the empirical semivariogram
• Accounting for spatial correlation in regression models

### R functions and commands demonstrated

• as.geodata (from geoR) creates a geoR object
• dist calculates the set of pairwise dissimilarities for observations. It returns an object of type distance
• gls fits regression models using generalized least squares
• mantel (from vegan package) carries out Mantel tests on distance matrices
• variofit (from geoR) fits a theoretical semivariogram model to an empirical semivariogram
• variog (from geoR) estimates an empirical semivariogram
• Variogram (from nlme) estimates an empirical semivariogram

### R function options

• corr= (argument to gls) for specifying a correlation structure for the residuals
• cov.model= (argument to variofit) specifies the theoretical variogram model to be fit
• ini= (argument to variofit) specifies initial values for the partial sill and range of the theoretical variogram
• max.dist= (argument to variog) largest distance to use in constructing the empirical semivariogram
• nugget= (argument to variofit) is given values TRUE or FALSE and determines whether a nugget is estimated
• option= (argument to variog) the type of semivariogram desired. Choices included bin, cloud, and smooth.
• uvec= (argument to variog) specifies the endpoints of the bins for pooling the squared differences to produce a smooth semivariogram
• vario= (argument to variofit) specifies the variog object that is then used in fitting a theoretical variogram

### R packages used

• geoR for the as.geodata, variog, and variofit functions for constructing the semivariogram
• nlme for gls and Variogram functions. Also source of the Wheat2 data set.
• vegan for the mantel function

### Preliminaries

• Spatial statistics is a big topic and we'll only scratch the surface today. A large number of packages are available in R for spatial analysis and most of these overlap extensively in their features. We'll focus on two: the geoR package for variogram analysis and the vegan package for its Mantel test. The nlme package that we've used previously for multilevel modeling is also relevant here in that it allows the user to specify a spatial structure for regression models.
• A good source of information on R packages and what they contain can be found at the CRAN task views site. The spatial section at this site lists 40 different R packages for performing spatial analysis.
• In this session we'll learn how to check for spatial correlation in the residuals from regression. We'll also examine ways to account for such correlation in regression models.

### Checking for spatial correlation with a Mantel test

• The data we'll use for illustration is the Wheat2 data set in the nlme package taken from Stroup et al. (1994). It's described as an agronomic wheat trial in which the yields of 56 varieties of wheat were compared. The trials were arranged in four geographic blocks in which each variety was planted once. Thus the block variable is an attempt to control for heterogeneity in the data due to the large spatial extent of the experiment.

> library(nlme)
> data(Wheat2)
> names(Wheat2)

[1] "Block" "variety" "yield" "latitude" "longitude"

Fig. 1  Spatial arrangement of plots

• To understand the experimental layout I plot latitude against longitude coloring the points according to their block.

> plot(latitude~longitude, col=as.numeric(Block), data=Wheat2)

• As the figure shows the blocking is an attempt to control for the effect of latitude. It's not clear to me what the units for latitude and longitude represent. If the numbers are interpreted as degrees then the study region would have to be in Africa and encompass nearly half the continent. My understanding from reading Stroup et al. (1994) is that the study was done in Nebraska.
• Resolving this issue is important for calculating distances as we'll later be doing. For distant points on the earth's surface the proper way to calculate distances is as great circle distances for which latitude and longitude provide the spherical coordinates of points. For nearby points planar projections are satisfactory approximations of great circle distances.
• Stroup et al. (1994) provide no further explanation on how latitude and longitude were recorded. The Wheat2 data set is commonly used as an example in the documentation for mixed model software. For example, SAS Proc Mixed and the Splus nlme library both use it. In both cases the latitude and longitude values were used as raw numbers for calculating distances. Lacking information to do otherwise I will follow their lead.
• We can obtain a crude measure of spatial correlation by carrying out a Mantel test. The Mantel test works with dissimilarity matrices. With n observations there are distinct pairs of observations and hence the same number of dissimilarities. For each pair of observations we calculate their dissimilarity in two distinct ways and then compare the results. The Mantel test calculates the Pearson correlation coefficient of the two sets of dissimilarities and uses a permutation test to assess its statistical significance.
• Usually one set of dissimilarities is the set of geographic distances between the points. The other set uses the absolute difference in some attribute measured at each location.
• The permutation test is carried out by simultaneously permuting the rows and columns of one of the dissimilarity matrices. It is necessary to permute both rows and columns in order to preserve the distance structure of the data. This is equivalent to randomly reassigning attribute values to locations.
• To begin we need to calculate the distance matrices. We can use the R function dist for this purpose. The first argument to dist should be a vector or a matrix where each row corresponds to an observation.

#geographic distances
> dist1<-dist(cbind(Wheat2\$latitude, Wheat2\$longitude))
#differences in yield
> dist2<-dist(Wheat2\$yield)

• The package vegan contains the mantel function for carrying out the Mantel test. It by default uses 1000 permutations for the permutation test. This should be adequate for a crude estimate of statistical significance.

> library(vegan)
> mantel(dist1,dist2)

Mantel statistic based on Pearson's product-moment correlation

Call:
mantel(xdis = dist1, ydis = dist2)

Mantel statistic r: 0.2005
Significance: < 0.001

Empirical upper confidence limits of r:
90%    95%  97.5%    99%
0.0338 0.0416 0.0500 0.0596

Based on 1000 permutations

• The reported Mantel correlation is 0.20 and the p-value is 0.001. Since there were 1000 permutations this means that the observed Mantel correlation was greater than the Mantel correlation obtained with any of the data permutations.
• Of course it's not news that agricultural yield varies with geography. What's important statistically is that such spatial correlation is accounted for in our model, either explicitly or implicitly. We investigate whether this is the case by fitting a series of models that attempt to relate yield to variety controlling for geography. The simplest attempt at this is to add block as as a predictor in the regression model.

> model2<-lm(yield~variety+Block, data=Wheat2)
> dist3<-dist(residuals(model2))
> mantel(dist1,dist3)

Mantel statistic based on Pearson's product-moment correlation

Call:
mantel(xdis = dist1, ydis = dist3)

Mantel statistic r: 0.1061
Significance: < 0.001

Empirical upper confidence limits of r:
90%    95%  97.5%    99%
0.0302 0.0415 0.0472 0.0560

Based on 1000 permutations

• From the output we see that including block (and variety) as predictors has accounted for some of the original spatial variation in yield, but the residual spatial correlation is still significantly greater than zero.
• We can take this further by incorporating latitude and longitude directly in the model. Here are the results for a model that is additive in latitude and longitude.

> model4<-lm(yield~variety+Block+latitude+longitude, data=Wheat2)
> dist4<-dist(residuals(model4))
> mantel(dist1,dist4)

Mantel statistic based on Pearson's product-moment correlation

Call:
mantel(xdis = dist1, ydis = dist4)

Mantel statistic r: 0.04557
Significance: 0.024

Empirical upper confidence limits of r:
90%    95%  97.5%    99%
0.0313 0.0384 0.0444 0.0571

Based on 1000 permutations

• This has reduced the correlation even further, although it's still statistically significant.
• Notice that in all the cases we've looked at the Mantel correlation has been fairly small, yet still statistically significant. The significance arises at least partly from the large amount of data we have. The 224 observations translate into 24,976 distances.
• This Mantel correlation in the last model 0.045. An ordinary Pearson correlation this small would never be considered important regardless of its statistical significance. But given that we're looking at distances rather than the actual observational values it's unclear how to interpret correlations such as this.
• At this point we could continue to try to construct models involving latitude and longitude perhaps including interactions, quadratic terms, etc., but such modeling is clearly ad hoc and unmotivated by theory. Whatever is causing the spatial correlation is probably not related to latitude and longitude in any simple fashion. Instead of trying to account for the spatial correlation by further enhancing the deterministic portion of the model, a more parsimonious approach is to modify the random portion of the model. We do this by specifying a correlation structure for the errors based on the inter-observational distances. I illustrate this approach next.

### Fitting the empirical semivariogram

• The Mantel test is a crude, all or nothing, sort of tool that does little to help understand the nature of the spatial correlation except to say that it exists. To make further progress we need a way to visualize and model the correlation as a function of inter-observational distances. For geostatistical data the semivariogram is the preferred tool for doing this. While the nlme package has a Variogram function, for serious modeling I prefer to use dedicated spatial packages such as geoR to develop geostatistical models. The geoR package is a comprehensive spatial analysis package written by a geographer. It works well except when given extremely large data sets. In those cases I've found the more limited gstat package to be a good alternative. The variogram functions in gstat can be up to ten times faster in estimating the empirical semivariogram than the functions in geoR.
• For comparability with the models that will follow I first refit the randomized block design model using the gls function of the nlme package.

> gmodel2<-gls(yield~variety+Block, data=Wheat2, method='ML')

• The acronym gls stands for "generalized least squares" and it extends ordinary least squares estimation to error structures other than independence. If no structure is specified the model output is identical to that returned by the lm function. The feature of gls I wish to use is the ability to extract normalized residuals.
• I begin by creating a matrix that contains as columns each observation's latitude, its longitude, and its residual (normalized) from the regression model in that order. This is the order required by geoR. The response variable used in the semivariogram should appear in the third column while the vectors used in calculating the distances should appear in columns 1 and 2.

> newdat<-data.frame(Wheat2\$latitude, Wheat2\$longitude, residuals(gmodel2, type='n'))

• The geoR package like many R packages requires that data frames be converted into specialized objects before analysis. (Recall that the survey package had the same requirement. There the first step was to create a survey object.) In geoR this conversion is done with the as.geodata function.

> library(geoR)
> geodat<-as.geodata(newdat)

• The geoR function variog is used to estimate the empirical semivariogram. It accepts a large number of arguments of which I illustrate the following.
• geodata= specifies the geoR object previously created with the as.geodata function
• uvec= specifies a sequence of values that define the centers of the bins for binning the squared differences. An alternative to specifying uvec is to specify the argument breaks, the endpoints of the binning intervals
• max.dist= specifies the maximum distance used in constructing the variogram
• option= specifies the type of variogram, bin, cloud, or smooth.
• After some experimentation I settle on the following specifications.

> variog(geodat, uvec=1:30, max.dist=30, option='bin')->geodat.v1

• There are a number of useful results returned by variog. The component n records the number of distances that contributed to the smoothed estimate in each bin.

> names(geodat.v1)
[1] "u"                "v"              "n"
[4] "sd"               "bins.lim"       "ind.bin"
[7] "var.mark"         "beta.ols"       "output.type"
[10] "max.dist"         "estimator.type" "n.data"
[13] "lambda"           "trend"          "pairs.min"
[16] "nugget.tolerance" "direction"      "tolerance"
[19] "uvec"             "call"

> geodat.v1\$n
[1] 210  199  766 538 832 462  434 1406 695 633  589 1556 1000 813
[15] 628 1207 1200 837 731 474 1619  668 558 428 1131  547  465 196
[29] 165

• If we calculate the semivariogram out to a distance of 50, the number of observations in each bin drops off precipitously.

> variog(geodat, uvec=1:50, max.dist=50, option='bin')->geodat.v1
variog: computing omnidirectional variogram
> geodat.v1\$n
[1] 210  199  766 538 832 462  434 1406 695 633  589 1556 1000 813
[15] 628 1207 1200 837 731 474 1619  668 558 428 1131  547  465 196
[29] 855  480  322 192 417 449  259  124 117 386  130  122   58 106
[43]  56   30   23  13   9   5

Fig. 2  Empirical semivariogram for residuals of yield model

• The usual recommendation is that there should be at least 30 observations in each bin. It is also usually recommended that the variogram not be estimated beyond half the maximal distance in the data set. These recommendations are the basis of my choice of 30 for the limits. Based on the number of observations in each bin we clearly could choose narrower bins (and more of them), but as we'll see the pattern of correlation is already fairly clear from the data.
• The variog function has a plot method. Thus if we pass the variog object to the plot function we will obtain a plot of the list components v (the semivariogram estimate) against u (the distance) with the axes appropriately labeled.

> variog(geodat, uvec=1:50, max.dist=30, option='bin')-> geodat.v1
> plot(geodat.v1)

• The plot shows a fairly typical semivariogram. The semivariance increases with distance (meaning that the correlation decreases) and although there is considerable scatter it appears to level off somewhere around a distance of 20 or 25. Observe that the semivariance is nonzero near the origin. Based on the terminology discussed in lecture 41 we would estimate there to be a nugget effect of 0.4, a sill at 1.2 (giving a partial sill of 0.8), and a range somewhere between 20–25.

### Modeling spatial correlation with the semivariogram

• To incorporate spatial correlation into our regression model we need to estimate a theoretical semivariogram model. The nlme package offers five spatial correlation structures for this purpose: corExp (exponential), corGaus (Gaussian), corLin (linear), corRatio (rational quadratic), and corSpher (spherical). All except the rational quadratic can be fit in geoR using its variofit function.
• Fig. 3  Semivariogram models

• I specify four arguments to variofit
• vario= specifies a variog object
• ini.cov.pars= specifies a 2-vector of initial estimates for the partial sill and the range. The partial sill is the difference between the sill and the nugget.
• cov.model= specifies the spatial covariance model desired
• nugget= specifies an initial estimate of the nugget.
• I fit the four covariance models in succession and add them to the plotted empirical semivariogram with the lines function.

> ols2<-variofit(geodat.v1, ini=c(.8,30), nugget=.4, cov.model='exponential')
> lines(ols2,col=2)
> ols3<-variofit(geodat.v1, ini=c(.8,30), nugget=.4, cov.model='gaussian')
> lines(ols3,col=4)
> ols4<-variofit(geodat.v1, ini=c(.8,30), nugget=.4, cov.model='spherical')
> lines(ols4,col=3)
> ols5<-variofit(geodat.v1, ini=c(.8,30), nugget=.4, cov.model='linear')
> lines(ols5,col='gray70')
> legend(18,.5, c('exponential', 'Gaussian', 'spherical', 'linear'), col=c(2:4, 'gray70'), lty=rep(1,4), cex=.8, bty='n')

• As is clear from the plot, any of the semivariogram models, even the linear, could be a suitable model. In such a situation it makes sense to fit them all and compare the results using AIC.
• Formulas for the five spatial correlation structures available in nlme are shown below.
 Model Formula Exponential Gaussian Linear Rational quadratic Spherical

Here I( ) is the indicator function such that .

• A nugget effect is introduced into these functions as follows.

The approach in nlme is different than in geoR (where the nugget is an additive term rather than a multiplicative one).

• To add a correlation structure to our generalized least squares model I add the corr argument to the model specification. A simpler way than respecifying the model in its entirety again is to use the update function. The update function retains all the old features of the model and includes whatever new arguments are added.

#original model
> gmodel2<-gls(yield~variety+Block, data=Wheat2, method='ML')
> out1a<-update(gmodel2, corr=corSpher(c(20,.4), form=~latitude+longitude, nugget=T))

• The value given the corr argument is what's referred to as a correlation structure class, although it has the appearance of a function. For the model out1a I specify the corSpher class that produces a spherical correlation structure.
• The first argument to the corSpher class is a vector specifying initial values for the range and nugget (in a multiplicative sense).
• The second argument is a one-sided formula that specifies the coordinates of the position vector of an observation.
• The last argument is a Boolean value for the nugget argument that determines whether or not a nugget effect is estimated.
• I next fit models for the remaining four spatial correlation structures.

> out1b <-update(gmodel2, corr=corExp(c(20,.4), form=~latitude+longitude, nugget=T))
> out1c<-update(gmodel2, corr=corGaus(c(20,.4), form=~latitude+longitude, nugget=T))
> out1d<-update(gmodel2, corr=corLin(c(20,.4), form=~latitude+longitude, nugget=T))
> out1e<-update(gmodel2, corr=corRatio(c(20,.4), form=~latitude+longitude, nugget=T))

• The model with the linear correlation structure failed to converge. Since this was clearly the worst of the spatial models (Fig. 3) I just drop it from consideration. I compare the remaining models to the original model using AIC.

> sapply(list(gmodel2,out1a,out1b,out1c,out1e),AIC)
[1] 1561.622 1370.701 1372.923 1369.888 1369.571

• It's clear that all four models that include a spatial correlation structure are a huge improvement over the original model. In truth the rational quadratic correlation structure just edges out the Gaussian structure.
• We can compare this result to the model in which we attempted to account for the spatial correlation by including latitude and longitude as predictors.

> gmodel4<-update(gmodel2, .~.+latitude+longitude)
> AIC(gmodel4)

[1] 1459.622

• The spatial correlation models clearly beat this model. Thus including a spatial correlation in the specification of residual error structure has led to a model that is better than one in which the spatial correlation was modeled in an ad hoc fashion using the spatial coordinates as predictors.
• Note: in the update function above I specified the original model formula with the notation .~. to which I then added the two new predictors latitude and longitude.
• Finally we need to assess whether the spatial correlation in the residuals that we initially detected has been eliminated. We could extract the residuals from model out1e, cbind them to the latitude and longitude variables, and run them through geoR again, but we can also use the Variogram function in the nlme package. This function requires a model object, a formula specifying the spatial coordinates, and the type of residual to plot (here specified to be 'n' for normalized). The Variogram function has a plot method so I feed it directly to the plot function.

# plot variogram for the residuals original model
> plot(Variogram(gmodel2, form=~latitude+longitude, maxDist=30, resType='n'))
# plot variogram for residuals of final model
> plot(Variogram(out1e, form=~latitude+longitude, resType='n'))

 (a) (b) Fig. 4 (a) Semivariogram for residuals from an ordinary regression model in which the residuals are assumed to be uncorrelated (b) Semivariogram for residuals from a regression model in which the residuals are given a spatial correlation structure
• Alternatively we can generate the same plot in geoR.

> newdat2<-data.frame(Wheat2\$latitude, Wheat2\$longitude, residuals(out1e, type='n'))
> geodat<-as.geodata(newdat2)
> variog(geodat,uvec=1:30,max.dist=30,option='bin')->geodat.v2

> plot(geodat.v2)

### Cited reference

Stroup, W. W., P. S. Baenziger, and D. K. Mulitze. 1994. Removing spatial variation from wheat yield trials: a comparison of methods. Crop Science 86: 62–66.

 Jack Weiss Phone: (919) 962-5930 E-Mail: jack_weiss@unc.edu Address: Curriculum in Ecology, Box 3275, University of North Carolina, Chapel Hill, 27516 Copyright © 2007 Last Revised--May 1, 2007 URL: http://www.unc.edu/courses/2007spring/enst/562/001/docs/lectures/lecture42.htm