Archive

Archive for the ‘package’ Category

Geostatistics with genetic data : the R package ggene (part 3: anisotropy)

This third post about geostatistical analysis of genetic data deals with anisotropy. Anisotropy is the property of being directionally dependent. It occurs if the spatial pattern differs, when measured along different axes, either in extent or intensity (Goovaerts, 1997).

We illustrate here how the package ggene can be used to compute directional variograms and variogram maps.

 

Directional variograms

svariog computes omnidirectional variograms by default. The arguments direction, tolerance and unit.angle define the main direction, the tolerance and the unit of measure for the angle.

We use the simulated dataset aniso to illustrate this feature :

data(aniso)
va <- svariog(X=aniso, plot=TRUE)

plot of chunk unnamed-chunk-2

The omni-directional variogram reveals a clear spatial genetic structure. We then consider 4 directions : 0, 45, 90 and 135 degrees and use a tolerance of 22.5° in each case.

Does the spatial genetic structure differ according to direction?

# compute variograms in different directions
d0_225 <- svariog(X=aniso,direction=0, tolerance=22.5, unit.angle="degrees")
d45_225 <- svariog(X=aniso,direction=45, tolerance=22.5, unit.angle="degrees")
d90_225 <- svariog(X=aniso,direction=90, tolerance=22.5, unit.angle="degrees")
d135_225 <- svariog(X=aniso,direction=135, tolerance=22.5, unit.angle="degrees")

We then plot the resulting variograms:

plot(va$svario$u, va$svario$v, type="b", ylim=range(c(va$svario$v, d0_225$svario$v, d45_225$svario$v, d90_225$svario$v, d135_225$svario$v)) , xlab="distance", ylab="semivariance")

points(d0_225$svario$u, d0_225$svario$v, type="b", lty=2)
points(d45_225$svario$u, d45_225$svario$v, type="b", col="red", lty=2)
points(d90_225$svario$u, d90_225$svario$v, type="b", col="blue", lty=2)
points(d135_225$svario$u, d135_225$svario$v, type="b", col="green", lty=2)

legend("topleft", legend=c("omnidirectional", expression(0 * degree), 
 expression(45 * degree), expression(90 * degree), 
 expression(135 * degree)), lty=c(1,2,2,2,2,2), 
 col=c("black","black","red","blue","green"), bty="n")

plot of chunk unnamed-chunk-4

The variogram computed in the direction 45° differs from the omni-directional variogram as well as from the other directional variograms.

 

Variogram maps

Directional variograms are useful to describe the spatial variation in a given direction but variogram maps are more efficient to search for anisotropy when no a priori knowledge of the direction itself is available (Isaaks & Srivastava, 1989; Goovaerts, 1997).

In ggene variogram maps are computed using the function svarmap.

data(aniso)
map <- svarmap(X=aniso,cutoff=20, width=1)
plot(map)

plot of chunk unnamed-chunk-5

There is a clear directionality in the semivariance values that are lower in the direction 45°.

References

Goovaerts, P., 1997. Geostatistics for Natural Resources Evaluation. Oxford University Press

Isaaks, E.H., Srivastava, R.M., 1989. Applied geostatistics. Oxford University Press.

Advertisements

Managing missing values in ggene

Missing values is a big problem because they are so common in microsatellite datasets that removing them sometimes amounts to loose all the information (individuals or alleles).

ggene cannot handle NAs and thus users must remove individuals/alleles with missing data or convert them to something… else!

One common strategy is to replace NAs by the mean value of the corresponding column or by 0 (see the package adegenet). I won’t discuss these options and simply provide a simple solution to users who cannot remove NAs in their data and have to replace them.

We define below two functions NAtoMean and NAto0 which replace NA by the mean of the column (allele) or by 0, respectively.

NAtoMean <- function(x){
    x[is.na(x)] <- mean(x, na.rm = TRUE)
    return(x)}

NAto0 <- function(x){ 
    x[is.na(x)] <- 0
    return(x)}

Now we create a missing data in the $tab of a ggene object:

data(crypho)
crypho$tab[1,1]
## [1] 0
crypho$tab[1,1] <- NA
crypho$tab[1,1]
## [1] NA

If we try to compute the variogram with this altered dataset we get an error message:

var <- svariog(crypho, plot=TRUE)
## Error in FUN(X[[i]], ...): NA/NaN/Inf dans un appel à une fonction externe (argument 4)

Replacing NAs by the mean of the column in the data table is achieved using apply:

crypho$tab <- apply(X=crypho$tab, MARGIN=2, FUN=NAtoMean)
crypho$tab[1,1]
## [1] 0.2872727

The svariog can now compute the variogram:

var <- svariog(crypho, plot=TRUE)

plot of chunk unnamed-chunk-6

NAtoMean and NAto0 can easily be customised.

Geostatistics with genetic data : the R package ggene (part 2: variogram envelop)

In this second item of a set of posts introducing the package ggene, we address how departure form spatial randomness can be examined.

 

Variogram envelop

The function randsvariog allows to compute the variogram envelope by means of randomizations. Each randomization corresponds to a random reallocation of genotypes among individuals followed by the recomputation of the variogram. Readers are referred to Legendre and Legendre (1998) for details about randomization tests in the context of spatial analyses.

randsvariog randomizes the genetic data associated to each individual but genomes are not changed, only their spatial distribution are permuted between individuals, hence the sampling design is kept unchanged.

 

First example

We illustrate this feature using the dataset larix1350 where a set of 189 trees were genotyped (Nardin et al. 2015). Can we detect a spatial structure within this dataset?

First we load the dataset and have a look at the sampling design.

data(larix1300)
# check sample spatial features
plot(larix2300$coord[,1],larix2300$coord[,2], asp=1, xlab="x", ylab="y")

plot of chunk unnamed-chunk-2

Then we compute the variogram and plot it. The variogram shows very little structuration, if any.

# compute variogram
va <- svariog(X=larix1350, uvec=distlag(dist=larix1350$coord, dmin=0, distance.lag=3), plot=TRUE)

plot of chunk unnamed-chunk-3

We now compute the statistical envelop using randsvariog with a limited number of randomizations in this simple example (a large number is otherwise necessary).

# compute statistical envelope
env <- randsvariog(var=va, X=larix1350, nsim=30, bounds=c(0.025, 0.975), save.sim=FALSE)
## ..............................
## done
# plot results
plot(env$svario$u, env$svario$v, ylim=range(env$env), xlab="distance (m)",
     ylab="semivariance")
points(env$svario$u, env$env[,1], type="l")
points(env$svario$u, env$env[,2], type="l")

plot of chunk unnamed-chunk-4

The statistical envelope indicates that most of the semivariance estimates lie between the 0.025 and 0.975 quantiles showing the absence of spatial genetic structure.

 

Second example

In this second example we perform similar analyses using the dataset larix2300:

data(larix2300)

# compute variogram
va <- svariog(X=larix2300, uvec=distlag(dist=larix2300$coord, dmin=0,
  distance.lag=3), plot=FALSE)

# compute statistical envelope
env <- randsvariog(var=va, X=larix2300, nsim=30, bounds=c(0.025, 0.975),
  save.sim=FALSE)
## ..............................
## done
# plot results
plot(env$svario$u, env$svario$v, ylim=range(env$env), xlab="distance (m)",
     ylab="semivariance")
points(env$svario$u, env$env[,1], type="l")
points(env$svario$u, env$env[,2], type="l")

plot of chunk unnamed-chunk-5

In this case, the variogram indicates a clear departure from spatial randomness.

 

References

Legendre, P., Legendre, L., 2012. Numerical Ecology. Elsevier.

Nardin, M., Musch, B., Rousselle, Y., Guérin, V., Sanchez, L., Rossi, J.-P., Gerber, S., Marin, S., Pâques, L.E., Rozenberg, P., 2015. Genetic differentiation of European larch along an altitudinal gradient in the French Alps. Annals of Forest Science 72, 517-527.

Geostatistics with genetic data : the R package ggene (part 1)

Geostatistics is a branch of statistics focusing on spatial and spatiotemporal analyses. It was originally developed in mining geology and slowly percolated in numerous fields of life sciences such as soil science, soil ecology, oceanography and many more.

In 2005, Wagner et al. introduced the use of variogram (an important structure function in geostatistics) with microsatellite datasets but to our knowledge neither software nor R package were available to compute variograms or perform any of the main operations of geostatitics such as model fitting or anisotropy analysis.

We therefore developed the package ggene. It was released in early June 2016 through R-Forge and is currently available from https://r-forge.r-project.org/R/?group_id=2143

ggene allows the computation variogram of gene diversity and model fitting. In addition it offers two techniques to search for anisotropies in spatial genetic structure namely directional variograms and variogram maps. Both haploid and diploid datasets can be analysed and the package provides some functions allowing to weighting for repeated genotypes.

The purpose of the present post is to give an overview of the package and show how to install ggene. The package comes with 2 vignettes and several datasets that make learning easier if not attractive…

 

Downloading and installation

At the present time, ggene is not available from R CRAN and the current version can be installed from the R-Forge repository.

Mac OS users:

install.packages("ggene", repos="http://R-Forge.R-project.org", type="source")

Windows or Linux users:

install.packages("ggene", repos="http://R-Forge.R-project.org")

Once ggene is installed you can visualize the thematic documentation by typing in the R console:

library(ggene)
vignette("ggene_introduction")
vignette("ggene_datasets")

 

Reading haploid dataset

The chunk of code below shows how to read both coordinates and microsatellite data for an haploid dataset.

library(ggene)
# read genetic data
sim <- read.csv(system.file("extdata/sim_01.csv", package="ggene"), header=FALSE)
# read spatial coordinates
xy.sim <- read.csv(system.file("extdata/xysim_01.csv", package="ggene"), header=FALSE)
# create a ggene object
dat.sim <- tab2geo(X=sim, coord=xy.sim)
## The number of individuals is  625 
## The number of locus is  20

 

Reading diploid dataset

The function gene2geo operates the same way but relies on a genind object which are created with the package adegenet from various data input formats such as genepop or genetix (see the documentation of the package for details). Note that ggene has no function allowing to read data directly from files.

Here is an example where we firstly read the genetic data from a file in genepop format:

library(adegenet)
dat <- read.genepop(system.file("extdata/sim_03.gen", package="ggene"), ncode = 3)
## 
##  Converting data from a Genepop .gen file to a genind object... 
## 
## 
## File description:  Simulated data 
## 
## ...done.

The resulting object has various characteristics:

dat
## /// GENIND OBJECT /////////
## 
##  // 625 individuals; 20 loci; 502 alleles; size: 1.3 Mb
## 
##  // Basic content
##    @tab:  625 x 502 matrix of allele counts
##    @loc.n.all: number of alleles per locus (range: 15-36)
##    @loc.fac: locus factor for the 502 columns of @tab
##    @all.names: list of allele names for each locus
##    @ploidy: ploidy of each individual  (range: 2-2)
##    @type:  codom
##    @call: read.genepop(file = system.file("extdata/sim_03.gen", package = "ggene"), 
##     ncode = 3)
## 
##  // Optional content
##    @pop: population of each individual (group size range: 625-625)

There are 625 individuals.

Let’s now read the geographic coordinates:

xy <- read.csv(system.file("extdata/xysim_01.csv", package="ggene"), header=FALSE)
dim(xy)
## [1] 625   2

The package ggene requires objects of class ggene that are created using the function gene2geo:

data <- gene2geo(X=dat, coord=xy)
## The number of individuals is  625 
## The number of locus is  20
class(data)
## [1] "list"  "ggene"

 

Your first variogram!

svariog computes the variogram for the gene diversity and also returns the locus-by-locus semivariance and the conventional estimator of the gene diversity (Wagner et al. 2005).

The chunks of code below shows how to compute the variogram for the larix dataset (Nardin et al. 2015) using default options. It is good practice to check the sampling scheme before any analysis:

data(larix2300)
# check sample spatial features
plot(larix2300$coord[,1],larix2300$coord[,2], asp=1, xlab="x", ylab="y")

plot of chunk unnamed-chunk-10

If plot is set to TRUE svariog produces a plot of the variogram. The dashed line shows the conventional estimator of the gene diversity.

# compute variogram
va <- svariog(X=larix2300, plot=TRUE)

plot of chunk unnamed-chunk-11

The distance interval can easily be changed to explore how the variogram is affected. Increasing to distance interval tends to smooth out the signal and to better recognize the large-scale patterns. This is easily done using the function distlag:

d <- distlag(dist=larix2300$coord, dmin=0, distance.lag=1)
d
##  [1]  0.5  1.5  2.5  3.5  4.5  5.5  6.5  7.5  8.5  9.5 10.5 11.5 12.5 13.5
## [15] 14.5 15.5 16.5 17.5 18.5 19.5 20.5 21.5 22.5 23.5 24.5 25.5 26.5 27.5
## [29] 28.5 29.5 30.5 31.5 32.5 33.5 34.5 35.5 36.5 37.5 38.5 39.5 40.5 41.5
## [43] 42.5 43.5 44.5 45.5 46.5 47.5 48.5 49.5 50.5 51.5 52.5 53.5 54.5 55.5
## [57] 56.5 57.5 58.5 59.5 60.5 61.5 62.5 63.5 64.5 65.5 66.5 67.5 68.5 69.5
## [71] 70.5 71.5 72.5 73.5 74.5 75.5 76.5 77.5 78.5 79.5 80.5 81.5 82.5 83.5
## [85] 84.5 85.5 86.5 87.5 88.5 89.5 90.5 91.5 92.5 93.5 94.5 95.5 96.5 97.5
## [99] 98.5

The variogram is computed with uvec=d:

va <- svariog(X=larix2300, uvec=d, plot=TRUE)

plot of chunk unnamed-chunk-13

The resulting variogram reveals a very clear spatial genetic structure mostly occuring at distances < ca. 18 m.

 

References:

Nardin, M., Guerin, V., Musch, B., Rousselle, Y., Sanchez, L., Rossi, J.-P., Gerber, S., Paques, L., Rozenberg, P. 2015. Genetic differentiation of European larch along an altitudinal gradient in the French Alps. Annals of Forest Science 72, 517-527.

Journel, A.G., and C.J. Huijbregts. 1978. Mining Geostatistics. Academic Press.

Wagner, H.H., R. Holderegger, S. Werth, F. Gugerli, S.E. Hoebee, and C. Scheidegger. 2005. Variogram Analysis of the Spatial Genetic Structure of Continuous Populations Using Multilocus Microsatellite Data. Genetics 169: 1739–52.