Geostatistics with genetic data : the R package ggene (part 3: anisotropy)

This third post about geostatistical analysis of genetic data deals with anisotropy. Anisotropy is the property of being directionally dependent. It occurs if the spatial pattern differs, when measured along different axes, either in extent or intensity (Goovaerts, 1997).

We illustrate here how the package ggene can be used to compute directional variograms and variogram maps.

 

Directional variograms

svariog computes omnidirectional variograms by default. The arguments direction, tolerance and unit.angle define the main direction, the tolerance and the unit of measure for the angle.

We use the simulated dataset aniso to illustrate this feature :

data(aniso)
va <- svariog(X=aniso, plot=TRUE)

plot of chunk unnamed-chunk-2

The omni-directional variogram reveals a clear spatial genetic structure. We then consider 4 directions : 0, 45, 90 and 135 degrees and use a tolerance of 22.5° in each case.

Does the spatial genetic structure differ according to direction?

# compute variograms in different directions
d0_225 <- svariog(X=aniso,direction=0, tolerance=22.5, unit.angle="degrees")
d45_225 <- svariog(X=aniso,direction=45, tolerance=22.5, unit.angle="degrees")
d90_225 <- svariog(X=aniso,direction=90, tolerance=22.5, unit.angle="degrees")
d135_225 <- svariog(X=aniso,direction=135, tolerance=22.5, unit.angle="degrees")

We then plot the resulting variograms:

plot(va$svario$u, va$svario$v, type="b", ylim=range(c(va$svario$v, d0_225$svario$v, d45_225$svario$v, d90_225$svario$v, d135_225$svario$v)) , xlab="distance", ylab="semivariance")

points(d0_225$svario$u, d0_225$svario$v, type="b", lty=2)
points(d45_225$svario$u, d45_225$svario$v, type="b", col="red", lty=2)
points(d90_225$svario$u, d90_225$svario$v, type="b", col="blue", lty=2)
points(d135_225$svario$u, d135_225$svario$v, type="b", col="green", lty=2)

legend("topleft", legend=c("omnidirectional", expression(0 * degree), 
 expression(45 * degree), expression(90 * degree), 
 expression(135 * degree)), lty=c(1,2,2,2,2,2), 
 col=c("black","black","red","blue","green"), bty="n")

plot of chunk unnamed-chunk-4

The variogram computed in the direction 45° differs from the omni-directional variogram as well as from the other directional variograms.

 

Variogram maps

Directional variograms are useful to describe the spatial variation in a given direction but variogram maps are more efficient to search for anisotropy when no a priori knowledge of the direction itself is available (Isaaks & Srivastava, 1989; Goovaerts, 1997).

In ggene variogram maps are computed using the function svarmap.

data(aniso)
map <- svarmap(X=aniso,cutoff=20, width=1)
plot(map)

plot of chunk unnamed-chunk-5

There is a clear directionality in the semivariance values that are lower in the direction 45°.

References

Goovaerts, P., 1997. Geostatistics for Natural Resources Evaluation. Oxford University Press

Isaaks, E.H., Srivastava, R.M., 1989. Applied geostatistics. Oxford University Press.

Advertisements

Managing missing values in ggene

Missing values is a big problem because they are so common in microsatellite datasets that removing them sometimes amounts to loose all the information (individuals or alleles).

ggene cannot handle NAs and thus users must remove individuals/alleles with missing data or convert them to something… else!

One common strategy is to replace NAs by the mean value of the corresponding column or by 0 (see the package adegenet). I won’t discuss these options and simply provide a simple solution to users who cannot remove NAs in their data and have to replace them.

We define below two functions NAtoMean and NAto0 which replace NA by the mean of the column (allele) or by 0, respectively.

NAtoMean <- function(x){
    x[is.na(x)] <- mean(x, na.rm = TRUE)
    return(x)}

NAto0 <- function(x){ 
    x[is.na(x)] <- 0
    return(x)}

Now we create a missing data in the $tab of a ggene object:

data(crypho)
crypho$tab[1,1]
## [1] 0
crypho$tab[1,1] <- NA
crypho$tab[1,1]
## [1] NA

If we try to compute the variogram with this altered dataset we get an error message:

var <- svariog(crypho, plot=TRUE)
## Error in FUN(X[[i]], ...): NA/NaN/Inf dans un appel à une fonction externe (argument 4)

Replacing NAs by the mean of the column in the data table is achieved using apply:

crypho$tab <- apply(X=crypho$tab, MARGIN=2, FUN=NAtoMean)
crypho$tab[1,1]
## [1] 0.2872727

The svariog can now compute the variogram:

var <- svariog(crypho, plot=TRUE)

plot of chunk unnamed-chunk-6

NAtoMean and NAto0 can easily be customised.

Geostatistics with genetic data : the R package ggene (part 2: variogram envelop)

In this second item of a set of posts introducing the package ggene, we address how departure form spatial randomness can be examined.

 

Variogram envelop

The function randsvariog allows to compute the variogram envelope by means of randomizations. Each randomization corresponds to a random reallocation of genotypes among individuals followed by the recomputation of the variogram. Readers are referred to Legendre and Legendre (1998) for details about randomization tests in the context of spatial analyses.

randsvariog randomizes the genetic data associated to each individual but genomes are not changed, only their spatial distribution are permuted between individuals, hence the sampling design is kept unchanged.

 

First example

We illustrate this feature using the dataset larix1350 where a set of 189 trees were genotyped (Nardin et al. 2015). Can we detect a spatial structure within this dataset?

First we load the dataset and have a look at the sampling design.

data(larix1300)
# check sample spatial features
plot(larix2300$coord[,1],larix2300$coord[,2], asp=1, xlab="x", ylab="y")

plot of chunk unnamed-chunk-2

Then we compute the variogram and plot it. The variogram shows very little structuration, if any.

# compute variogram
va <- svariog(X=larix1350, uvec=distlag(dist=larix1350$coord, dmin=0, distance.lag=3), plot=TRUE)

plot of chunk unnamed-chunk-3

We now compute the statistical envelop using randsvariog with a limited number of randomizations in this simple example (a large number is otherwise necessary).

# compute statistical envelope
env <- randsvariog(var=va, X=larix1350, nsim=30, bounds=c(0.025, 0.975), save.sim=FALSE)
## ..............................
## done
# plot results
plot(env$svario$u, env$svario$v, ylim=range(env$env), xlab="distance (m)",
     ylab="semivariance")
points(env$svario$u, env$env[,1], type="l")
points(env$svario$u, env$env[,2], type="l")

plot of chunk unnamed-chunk-4

The statistical envelope indicates that most of the semivariance estimates lie between the 0.025 and 0.975 quantiles showing the absence of spatial genetic structure.

 

Second example

In this second example we perform similar analyses using the dataset larix2300:

data(larix2300)

# compute variogram
va <- svariog(X=larix2300, uvec=distlag(dist=larix2300$coord, dmin=0,
  distance.lag=3), plot=FALSE)

# compute statistical envelope
env <- randsvariog(var=va, X=larix2300, nsim=30, bounds=c(0.025, 0.975),
  save.sim=FALSE)
## ..............................
## done
# plot results
plot(env$svario$u, env$svario$v, ylim=range(env$env), xlab="distance (m)",
     ylab="semivariance")
points(env$svario$u, env$env[,1], type="l")
points(env$svario$u, env$env[,2], type="l")

plot of chunk unnamed-chunk-5

In this case, the variogram indicates a clear departure from spatial randomness.

 

References

Legendre, P., Legendre, L., 2012. Numerical Ecology. Elsevier.

Nardin, M., Musch, B., Rousselle, Y., Guérin, V., Sanchez, L., Rossi, J.-P., Gerber, S., Marin, S., Pâques, L.E., Rozenberg, P., 2015. Genetic differentiation of European larch along an altitudinal gradient in the French Alps. Annals of Forest Science 72, 517-527.

Geostatistics with genetic data : the R package ggene (part 1)

Geostatistics is a branch of statistics focusing on spatial and spatiotemporal analyses. It was originally developed in mining geology and slowly percolated in numerous fields of life sciences such as soil science, soil ecology, oceanography and many more.

In 2005, Wagner et al. introduced the use of variogram (an important structure function in geostatistics) with microsatellite datasets but to our knowledge neither software nor R package were available to compute variograms or perform any of the main operations of geostatitics such as model fitting or anisotropy analysis.

We therefore developed the package ggene. It was released in early June 2016 through R-Forge and is currently available from https://r-forge.r-project.org/R/?group_id=2143

ggene allows the computation variogram of gene diversity and model fitting. In addition it offers two techniques to search for anisotropies in spatial genetic structure namely directional variograms and variogram maps. Both haploid and diploid datasets can be analysed and the package provides some functions allowing to weighting for repeated genotypes.

The purpose of the present post is to give an overview of the package and show how to install ggene. The package comes with 2 vignettes and several datasets that make learning easier if not attractive…

 

Downloading and installation

At the present time, ggene is not available from R CRAN and the current version can be installed from the R-Forge repository.

Mac OS users:

install.packages("ggene", repos="http://R-Forge.R-project.org", type="source")

Windows or Linux users:

install.packages("ggene", repos="http://R-Forge.R-project.org")

Once ggene is installed you can visualize the thematic documentation by typing in the R console:

library(ggene)
vignette("ggene_introduction")
vignette("ggene_datasets")

 

Reading haploid dataset

The chunk of code below shows how to read both coordinates and microsatellite data for an haploid dataset.

library(ggene)
# read genetic data
sim <- read.csv(system.file("extdata/sim_01.csv", package="ggene"), header=FALSE)
# read spatial coordinates
xy.sim <- read.csv(system.file("extdata/xysim_01.csv", package="ggene"), header=FALSE)
# create a ggene object
dat.sim <- tab2geo(X=sim, coord=xy.sim)
## The number of individuals is  625 
## The number of locus is  20

 

Reading diploid dataset

The function gene2geo operates the same way but relies on a genind object which are created with the package adegenet from various data input formats such as genepop or genetix (see the documentation of the package for details). Note that ggene has no function allowing to read data directly from files.

Here is an example where we firstly read the genetic data from a file in genepop format:

library(adegenet)
dat <- read.genepop(system.file("extdata/sim_03.gen", package="ggene"), ncode = 3)
## 
##  Converting data from a Genepop .gen file to a genind object... 
## 
## 
## File description:  Simulated data 
## 
## ...done.

The resulting object has various characteristics:

dat
## /// GENIND OBJECT /////////
## 
##  // 625 individuals; 20 loci; 502 alleles; size: 1.3 Mb
## 
##  // Basic content
##    @tab:  625 x 502 matrix of allele counts
##    @loc.n.all: number of alleles per locus (range: 15-36)
##    @loc.fac: locus factor for the 502 columns of @tab
##    @all.names: list of allele names for each locus
##    @ploidy: ploidy of each individual  (range: 2-2)
##    @type:  codom
##    @call: read.genepop(file = system.file("extdata/sim_03.gen", package = "ggene"), 
##     ncode = 3)
## 
##  // Optional content
##    @pop: population of each individual (group size range: 625-625)

There are 625 individuals.

Let’s now read the geographic coordinates:

xy <- read.csv(system.file("extdata/xysim_01.csv", package="ggene"), header=FALSE)
dim(xy)
## [1] 625   2

The package ggene requires objects of class ggene that are created using the function gene2geo:

data <- gene2geo(X=dat, coord=xy)
## The number of individuals is  625 
## The number of locus is  20
class(data)
## [1] "list"  "ggene"

 

Your first variogram!

svariog computes the variogram for the gene diversity and also returns the locus-by-locus semivariance and the conventional estimator of the gene diversity (Wagner et al. 2005).

The chunks of code below shows how to compute the variogram for the larix dataset (Nardin et al. 2015) using default options. It is good practice to check the sampling scheme before any analysis:

data(larix2300)
# check sample spatial features
plot(larix2300$coord[,1],larix2300$coord[,2], asp=1, xlab="x", ylab="y")

plot of chunk unnamed-chunk-10

If plot is set to TRUE svariog produces a plot of the variogram. The dashed line shows the conventional estimator of the gene diversity.

# compute variogram
va <- svariog(X=larix2300, plot=TRUE)

plot of chunk unnamed-chunk-11

The distance interval can easily be changed to explore how the variogram is affected. Increasing to distance interval tends to smooth out the signal and to better recognize the large-scale patterns. This is easily done using the function distlag:

d <- distlag(dist=larix2300$coord, dmin=0, distance.lag=1)
d
##  [1]  0.5  1.5  2.5  3.5  4.5  5.5  6.5  7.5  8.5  9.5 10.5 11.5 12.5 13.5
## [15] 14.5 15.5 16.5 17.5 18.5 19.5 20.5 21.5 22.5 23.5 24.5 25.5 26.5 27.5
## [29] 28.5 29.5 30.5 31.5 32.5 33.5 34.5 35.5 36.5 37.5 38.5 39.5 40.5 41.5
## [43] 42.5 43.5 44.5 45.5 46.5 47.5 48.5 49.5 50.5 51.5 52.5 53.5 54.5 55.5
## [57] 56.5 57.5 58.5 59.5 60.5 61.5 62.5 63.5 64.5 65.5 66.5 67.5 68.5 69.5
## [71] 70.5 71.5 72.5 73.5 74.5 75.5 76.5 77.5 78.5 79.5 80.5 81.5 82.5 83.5
## [85] 84.5 85.5 86.5 87.5 88.5 89.5 90.5 91.5 92.5 93.5 94.5 95.5 96.5 97.5
## [99] 98.5

The variogram is computed with uvec=d:

va <- svariog(X=larix2300, uvec=d, plot=TRUE)

plot of chunk unnamed-chunk-13

The resulting variogram reveals a very clear spatial genetic structure mostly occuring at distances < ca. 18 m.

 

References:

Nardin, M., Guerin, V., Musch, B., Rousselle, Y., Sanchez, L., Rossi, J.-P., Gerber, S., Paques, L., Rozenberg, P. 2015. Genetic differentiation of European larch along an altitudinal gradient in the French Alps. Annals of Forest Science 72, 517-527.

Journel, A.G., and C.J. Huijbregts. 1978. Mining Geostatistics. Academic Press.

Wagner, H.H., R. Holderegger, S. Werth, F. Gugerli, S.E. Hoebee, and C. Scheidegger. 2005. Variogram Analysis of the Spatial Genetic Structure of Continuous Populations Using Multilocus Microsatellite Data. Genetics 169: 1739–52.

Extracting circular buffers from a raster in R

31 January 2014 1 comment

Cropping circular buffers centered on a point is pretty easy in R, at least when you’re dealing with rasters. Two packages allow to do this task, if not more. Let’s see how we can do with the package raster.

Note that in the following example, we assume that the package adehabitat is properly installed.


library(raster)
f<-system.file("ascfiles/elevation.asc", package="adehabitat")
raster<-rasterbis<-raster(f)# read example file from adehabitat folder
plot(raster)# have a look
rasterbis[]<-NA # empty one of the rasters
points<-data.frame(704000, 3162000);names(points)<-c("x","y")# create coordinates of a sample point
cell<-cellFromXY(object=raster,xy=points[1,])# get the cell number corresponding to that sample point
radius<-1000 # set the buffer radius
rasterbis[cell]<-1 # set the value of the corresponding cell to 1 (the others remaining NAs)
plot(raster)
points(points$x,points$y,pch=3)# one more look...
buf <- buffer(rasterbis, width=radius) # buffer extraction per se
plot(buf)
p<-which(buf[]==1) # identify which cells belong to the buffer
buf[p]<-raster[p]# pick up original values
# Now we crop the buffer :
xmin<-points$x[1] - 2*radius
xmax<-points$x[1] + 2*radius
ymin<-points$y[1] - 2*radius
ymax<-points$y[1] + 2*radius
ev<-c(xmin, xmax, ymin, ymax)
e<-extent(ev)
buffer<-crop(buf,e) # that's it :
plot(buffer)
points(points$x,points$y,pch=3)

I hope this might be helpful !
Cheers.
Next post : same thing with the package adehabitat

————————————–
EDIT: Robert Hijmans send me two sets of R commands which are more compact and very efficient ! Thank you Robert ! I add this code below so interested readers can make use of it. Thanks also to Curlew for the contribution (see comments below).

————————————–

library(raster)
library(rgeos)
points <- SpatialPoints(cbind(704000, 3162000))
pbuf <- gBuffer(points, widt=1000)
buf <- mask(raster, pbuf)
buffer <- trim(buf, pad=2)

library(raster)
raster <- raster(system.file(“ascfiles/elevation.asc”, package=”adehabitat”))
r1 <- raster(f)
points <- cbind(704000, 3162000)
r2 <- rasterize(points, r1, field=1)
pbuf <- buffer(r2, width=1000)
buf <- mask(raster, pbuf)
buffer <- trim(buf, pad=2)

 

Categories: Non classé, R

When Google Street View helps studying species geographical distribution

This post is about the way we could use the Google street view (GSV) data base to gather data allowing to describe species geographical distribution. There have been some attempts at using the panoramic imagery provided by GSV in social science [1] and preventive medicine [2] but to my knowledge, very few ecological applications have been published so far.

A pine processionary moth silk nest. Photo by Jérôme rousselet.

A pine processionary moth silk nest. Photo by Jérôme rousselet.

Obviously, only organisms that can be reliably detected by road sampling can be assessed using street imagery. We have recently published our first results [Rousselet et al 2013 PLOS ONE e74918] showing how the GSV imagery could be used to perform in silico sampling of species occurrences. Our biological model was the pine processionary moth Thaumetopoea pityocampa, a species easily visible from the roads during winter because it builds white silk nests in its host tree foliage. Readers can get the paper for free from the journal that is an open access publication.

Reference:
Rousselet, J., Imbert, C.-E., Dekri, A., Garcia, J., Goussard, F., Vincent, B., Denux, O., Robinet, C., Dorkeld, F., Roques, A., Rossi, J.-P., 2013, Assessing Species Distribution Using Google Street View: A Pilot Study with the Pine Processionary Moth, PLoS One 8(10):e74918.

A press release was issued today (in French) and is available from my institute’s web site.

Abstract
Mapping species spatial distribution using spatial inference and prediction requires a lot of data. Occurrence data are generally not easily available from the literature and are very time-consuming to collect in the field. For that reason, we designed a survey to explore to which extent large-scale databases such as Google maps and Google street view could be used to derive valid occurrence data. We worked with the Pine Processionary Moth (PPM) Thaumetopoea pityocampa because the larvae of that moth build silk nests that are easily visible. The presence of the species at one location can therefore be inferred from visual records derived from the panoramic views available from Google street view. We designed a standardized procedure allowing evaluating the presence of the PPM on a sampling grid covering the landscape under study. The outputs were compared to field data. We investigated two landscapes using grids of different extent and mesh size. Data derived from Google street view were highly similar to field data in the large-scale analysis based on a square grid with a mesh of 16 km (96% of matching records). Using a 2 km mesh size led to a strong divergence between field and Google-derived data (46% of matching records). We conclude that Google database might provide useful occurrence data for mapping the distribution of species which presence can be visually evaluated such as the PPM. However, the accuracy of the output strongly depends on the spatial scales considered and on the sampling grid used. Other factors such as the coverage of Google street view network with regards to sampling grid size and the spatial distribution of host trees with regards to road network may also be determinant.


1. Odgers CL, Caspi A, Bates CJ, Sampson RJ, Moffitt TE (2012) Systematic social observation of children’s neighborhoods using Google street view: a reliable and cost-effective method. J Child Psychol Psychiatry 53: 1009-1017.
2. Rundle AG, Bader MDM, Richards CA, Neckerman KM, Teitler JO (2011) Using Google street view to audit neighborhood environments. Am J Prev Med 40: 94-100.

Launching of the SESAME project (landScapE dynamicS and pest ManagEment)

We have recently launched our project SESAME (for landScapE dynamicS and pest ManagEment) and published an associated web site so as to make available our methodology and results. At the moment the web site is only available in French (my native langage) but I hope I will find the time to provide a translation in English.

Links to the web site:
landScapE dynamicS and pest ManagEment

SESAME overview:
SESAME (landScapE dynamicS and pest ManagEment) est projet pluridisciplinaire intégrant des approches d’épidémiologie du paysage, de génétique des populations, de modélisation et d’écologie afin de comprendre comment la dynamique des paysages affecte la distribution, la dispersion et l’adaptation de différents types de bioagresseurs.

Categories: Project