Home > ggene, package, R > Geostatistics with genetic data : the R package ggene (part 1)

Geostatistics with genetic data : the R package ggene (part 1)

Geostatistics is a branch of statistics focusing on spatial and spatiotemporal analyses. It was originally developed in mining geology and slowly percolated in numerous fields of life sciences such as soil science, soil ecology, oceanography and many more.

In 2005, Wagner et al. introduced the use of variogram (an important structure function in geostatistics) with microsatellite datasets but to our knowledge neither software nor R package were available to compute variograms or perform any of the main operations of geostatitics such as model fitting or anisotropy analysis.

We therefore developed the package ggene. It was released in early June 2016 through R-Forge and is currently available from https://r-forge.r-project.org/R/?group_id=2143

ggene allows the computation variogram of gene diversity and model fitting. In addition it offers two techniques to search for anisotropies in spatial genetic structure namely directional variograms and variogram maps. Both haploid and diploid datasets can be analysed and the package provides some functions allowing to weighting for repeated genotypes.

The purpose of the present post is to give an overview of the package and show how to install ggene. The package comes with 2 vignettes and several datasets that make learning easier if not attractive…

 

Downloading and installation

At the present time, ggene is not available from R CRAN and the current version can be installed from the R-Forge repository.

Mac OS users:

install.packages("ggene", repos="http://R-Forge.R-project.org", type="source")

Windows or Linux users:

install.packages("ggene", repos="http://R-Forge.R-project.org")

Once ggene is installed you can visualize the thematic documentation by typing in the R console:

library(ggene)
vignette("ggene_introduction")
vignette("ggene_datasets")

 

Reading haploid dataset

The chunk of code below shows how to read both coordinates and microsatellite data for an haploid dataset.

library(ggene)
# read genetic data
sim <- read.csv(system.file("extdata/sim_01.csv", package="ggene"), header=FALSE)
# read spatial coordinates
xy.sim <- read.csv(system.file("extdata/xysim_01.csv", package="ggene"), header=FALSE)
# create a ggene object
dat.sim <- tab2geo(X=sim, coord=xy.sim)
## The number of individuals is  625 
## The number of locus is  20

 

Reading diploid dataset

The function gene2geo operates the same way but relies on a genind object which are created with the package adegenet from various data input formats such as genepop or genetix (see the documentation of the package for details). Note that ggene has no function allowing to read data directly from files.

Here is an example where we firstly read the genetic data from a file in genepop format:

library(adegenet)
dat <- read.genepop(system.file("extdata/sim_03.gen", package="ggene"), ncode = 3)
## 
##  Converting data from a Genepop .gen file to a genind object... 
## 
## 
## File description:  Simulated data 
## 
## ...done.

The resulting object has various characteristics:

dat
## /// GENIND OBJECT /////////
## 
##  // 625 individuals; 20 loci; 502 alleles; size: 1.3 Mb
## 
##  // Basic content
##    @tab:  625 x 502 matrix of allele counts
##    @loc.n.all: number of alleles per locus (range: 15-36)
##    @loc.fac: locus factor for the 502 columns of @tab
##    @all.names: list of allele names for each locus
##    @ploidy: ploidy of each individual  (range: 2-2)
##    @type:  codom
##    @call: read.genepop(file = system.file("extdata/sim_03.gen", package = "ggene"), 
##     ncode = 3)
## 
##  // Optional content
##    @pop: population of each individual (group size range: 625-625)

There are 625 individuals.

Let’s now read the geographic coordinates:

xy <- read.csv(system.file("extdata/xysim_01.csv", package="ggene"), header=FALSE)
dim(xy)
## [1] 625   2

The package ggene requires objects of class ggene that are created using the function gene2geo:

data <- gene2geo(X=dat, coord=xy)
## The number of individuals is  625 
## The number of locus is  20
class(data)
## [1] "list"  "ggene"

 

Your first variogram!

svariog computes the variogram for the gene diversity and also returns the locus-by-locus semivariance and the conventional estimator of the gene diversity (Wagner et al. 2005).

The chunks of code below shows how to compute the variogram for the larix dataset (Nardin et al. 2015) using default options. It is good practice to check the sampling scheme before any analysis:

data(larix2300)
# check sample spatial features
plot(larix2300$coord[,1],larix2300$coord[,2], asp=1, xlab="x", ylab="y")

plot of chunk unnamed-chunk-10

If plot is set to TRUE svariog produces a plot of the variogram. The dashed line shows the conventional estimator of the gene diversity.

# compute variogram
va <- svariog(X=larix2300, plot=TRUE)

plot of chunk unnamed-chunk-11

The distance interval can easily be changed to explore how the variogram is affected. Increasing to distance interval tends to smooth out the signal and to better recognize the large-scale patterns. This is easily done using the function distlag:

d <- distlag(dist=larix2300$coord, dmin=0, distance.lag=1)
d
##  [1]  0.5  1.5  2.5  3.5  4.5  5.5  6.5  7.5  8.5  9.5 10.5 11.5 12.5 13.5
## [15] 14.5 15.5 16.5 17.5 18.5 19.5 20.5 21.5 22.5 23.5 24.5 25.5 26.5 27.5
## [29] 28.5 29.5 30.5 31.5 32.5 33.5 34.5 35.5 36.5 37.5 38.5 39.5 40.5 41.5
## [43] 42.5 43.5 44.5 45.5 46.5 47.5 48.5 49.5 50.5 51.5 52.5 53.5 54.5 55.5
## [57] 56.5 57.5 58.5 59.5 60.5 61.5 62.5 63.5 64.5 65.5 66.5 67.5 68.5 69.5
## [71] 70.5 71.5 72.5 73.5 74.5 75.5 76.5 77.5 78.5 79.5 80.5 81.5 82.5 83.5
## [85] 84.5 85.5 86.5 87.5 88.5 89.5 90.5 91.5 92.5 93.5 94.5 95.5 96.5 97.5
## [99] 98.5

The variogram is computed with uvec=d:

va <- svariog(X=larix2300, uvec=d, plot=TRUE)

plot of chunk unnamed-chunk-13

The resulting variogram reveals a very clear spatial genetic structure mostly occuring at distances < ca. 18 m.

 

References:

Nardin, M., Guerin, V., Musch, B., Rousselle, Y., Sanchez, L., Rossi, J.-P., Gerber, S., Paques, L., Rozenberg, P. 2015. Genetic differentiation of European larch along an altitudinal gradient in the French Alps. Annals of Forest Science 72, 517-527.

Journel, A.G., and C.J. Huijbregts. 1978. Mining Geostatistics. Academic Press.

Wagner, H.H., R. Holderegger, S. Werth, F. Gugerli, S.E. Hoebee, and C. Scheidegger. 2005. Variogram Analysis of the Spatial Genetic Structure of Continuous Populations Using Multilocus Microsatellite Data. Genetics 169: 1739–52.

Advertisements
  1. No comments yet.
  1. No trackbacks yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: