Archive

Archive for the ‘R’ Category

Geostatistics with genetic data : the R package ggene (part 3: anisotropy)

This third post about geostatistical analysis of genetic data deals with anisotropy. Anisotropy is the property of being directionally dependent. It occurs if the spatial pattern differs, when measured along different axes, either in extent or intensity (Goovaerts, 1997).

We illustrate here how the package ggene can be used to compute directional variograms and variogram maps.

 

Directional variograms

svariog computes omnidirectional variograms by default. The arguments direction, tolerance and unit.angle define the main direction, the tolerance and the unit of measure for the angle.

We use the simulated dataset aniso to illustrate this feature :

data(aniso)
va <- svariog(X=aniso, plot=TRUE)

plot of chunk unnamed-chunk-2

The omni-directional variogram reveals a clear spatial genetic structure. We then consider 4 directions : 0, 45, 90 and 135 degrees and use a tolerance of 22.5° in each case.

Does the spatial genetic structure differ according to direction?

# compute variograms in different directions
d0_225 <- svariog(X=aniso,direction=0, tolerance=22.5, unit.angle="degrees")
d45_225 <- svariog(X=aniso,direction=45, tolerance=22.5, unit.angle="degrees")
d90_225 <- svariog(X=aniso,direction=90, tolerance=22.5, unit.angle="degrees")
d135_225 <- svariog(X=aniso,direction=135, tolerance=22.5, unit.angle="degrees")

We then plot the resulting variograms:

plot(va$svario$u, va$svario$v, type="b", ylim=range(c(va$svario$v, d0_225$svario$v, d45_225$svario$v, d90_225$svario$v, d135_225$svario$v)) , xlab="distance", ylab="semivariance")

points(d0_225$svario$u, d0_225$svario$v, type="b", lty=2)
points(d45_225$svario$u, d45_225$svario$v, type="b", col="red", lty=2)
points(d90_225$svario$u, d90_225$svario$v, type="b", col="blue", lty=2)
points(d135_225$svario$u, d135_225$svario$v, type="b", col="green", lty=2)

legend("topleft", legend=c("omnidirectional", expression(0 * degree), 
 expression(45 * degree), expression(90 * degree), 
 expression(135 * degree)), lty=c(1,2,2,2,2,2), 
 col=c("black","black","red","blue","green"), bty="n")

plot of chunk unnamed-chunk-4

The variogram computed in the direction 45° differs from the omni-directional variogram as well as from the other directional variograms.

 

Variogram maps

Directional variograms are useful to describe the spatial variation in a given direction but variogram maps are more efficient to search for anisotropy when no a priori knowledge of the direction itself is available (Isaaks & Srivastava, 1989; Goovaerts, 1997).

In ggene variogram maps are computed using the function svarmap.

data(aniso)
map <- svarmap(X=aniso,cutoff=20, width=1)
plot(map)

plot of chunk unnamed-chunk-5

There is a clear directionality in the semivariance values that are lower in the direction 45°.

References

Goovaerts, P., 1997. Geostatistics for Natural Resources Evaluation. Oxford University Press

Isaaks, E.H., Srivastava, R.M., 1989. Applied geostatistics. Oxford University Press.

Managing missing values in ggene

Missing values is a big problem because they are so common in microsatellite datasets that removing them sometimes amounts to loose all the information (individuals or alleles).

ggene cannot handle NAs and thus users must remove individuals/alleles with missing data or convert them to something… else!

One common strategy is to replace NAs by the mean value of the corresponding column or by 0 (see the package adegenet). I won’t discuss these options and simply provide a simple solution to users who cannot remove NAs in their data and have to replace them.

We define below two functions NAtoMean and NAto0 which replace NA by the mean of the column (allele) or by 0, respectively.

NAtoMean <- function(x){
    x[is.na(x)] <- mean(x, na.rm = TRUE)
    return(x)}

NAto0 <- function(x){ 
    x[is.na(x)] <- 0
    return(x)}

Now we create a missing data in the $tab of a ggene object:

data(crypho)
crypho$tab[1,1]
## [1] 0
crypho$tab[1,1] <- NA
crypho$tab[1,1]
## [1] NA

If we try to compute the variogram with this altered dataset we get an error message:

var <- svariog(crypho, plot=TRUE)
## Error in FUN(X[[i]], ...): NA/NaN/Inf dans un appel à une fonction externe (argument 4)

Replacing NAs by the mean of the column in the data table is achieved using apply:

crypho$tab <- apply(X=crypho$tab, MARGIN=2, FUN=NAtoMean)
crypho$tab[1,1]
## [1] 0.2872727

The svariog can now compute the variogram:

var <- svariog(crypho, plot=TRUE)

plot of chunk unnamed-chunk-6

NAtoMean and NAto0 can easily be customised.

Geostatistics with genetic data : the R package ggene (part 2: variogram envelop)

In this second item of a set of posts introducing the package ggene, we address how departure form spatial randomness can be examined.

 

Variogram envelop

The function randsvariog allows to compute the variogram envelope by means of randomizations. Each randomization corresponds to a random reallocation of genotypes among individuals followed by the recomputation of the variogram. Readers are referred to Legendre and Legendre (1998) for details about randomization tests in the context of spatial analyses.

randsvariog randomizes the genetic data associated to each individual but genomes are not changed, only their spatial distribution are permuted between individuals, hence the sampling design is kept unchanged.

 

First example

We illustrate this feature using the dataset larix1350 where a set of 189 trees were genotyped (Nardin et al. 2015). Can we detect a spatial structure within this dataset?

First we load the dataset and have a look at the sampling design.

data(larix1300)
# check sample spatial features
plot(larix2300$coord[,1],larix2300$coord[,2], asp=1, xlab="x", ylab="y")

plot of chunk unnamed-chunk-2

Then we compute the variogram and plot it. The variogram shows very little structuration, if any.

# compute variogram
va <- svariog(X=larix1350, uvec=distlag(dist=larix1350$coord, dmin=0, distance.lag=3), plot=TRUE)

plot of chunk unnamed-chunk-3

We now compute the statistical envelop using randsvariog with a limited number of randomizations in this simple example (a large number is otherwise necessary).

# compute statistical envelope
env <- randsvariog(var=va, X=larix1350, nsim=30, bounds=c(0.025, 0.975), save.sim=FALSE)
## ..............................
## done
# plot results
plot(env$svario$u, env$svario$v, ylim=range(env$env), xlab="distance (m)",
     ylab="semivariance")
points(env$svario$u, env$env[,1], type="l")
points(env$svario$u, env$env[,2], type="l")

plot of chunk unnamed-chunk-4

The statistical envelope indicates that most of the semivariance estimates lie between the 0.025 and 0.975 quantiles showing the absence of spatial genetic structure.

 

Second example

In this second example we perform similar analyses using the dataset larix2300:

data(larix2300)

# compute variogram
va <- svariog(X=larix2300, uvec=distlag(dist=larix2300$coord, dmin=0,
  distance.lag=3), plot=FALSE)

# compute statistical envelope
env <- randsvariog(var=va, X=larix2300, nsim=30, bounds=c(0.025, 0.975),
  save.sim=FALSE)
## ..............................
## done
# plot results
plot(env$svario$u, env$svario$v, ylim=range(env$env), xlab="distance (m)",
     ylab="semivariance")
points(env$svario$u, env$env[,1], type="l")
points(env$svario$u, env$env[,2], type="l")

plot of chunk unnamed-chunk-5

In this case, the variogram indicates a clear departure from spatial randomness.

 

References

Legendre, P., Legendre, L., 2012. Numerical Ecology. Elsevier.

Nardin, M., Musch, B., Rousselle, Y., Guérin, V., Sanchez, L., Rossi, J.-P., Gerber, S., Marin, S., Pâques, L.E., Rozenberg, P., 2015. Genetic differentiation of European larch along an altitudinal gradient in the French Alps. Annals of Forest Science 72, 517-527.

Geostatistics with genetic data : the R package ggene (part 1)

Geostatistics is a branch of statistics focusing on spatial and spatiotemporal analyses. It was originally developed in mining geology and slowly percolated in numerous fields of life sciences such as soil science, soil ecology, oceanography and many more.

In 2005, Wagner et al. introduced the use of variogram (an important structure function in geostatistics) with microsatellite datasets but to our knowledge neither software nor R package were available to compute variograms or perform any of the main operations of geostatitics such as model fitting or anisotropy analysis.

We therefore developed the package ggene. It was released in early June 2016 through R-Forge and is currently available from https://r-forge.r-project.org/R/?group_id=2143

ggene allows the computation variogram of gene diversity and model fitting. In addition it offers two techniques to search for anisotropies in spatial genetic structure namely directional variograms and variogram maps. Both haploid and diploid datasets can be analysed and the package provides some functions allowing to weighting for repeated genotypes.

The purpose of the present post is to give an overview of the package and show how to install ggene. The package comes with 2 vignettes and several datasets that make learning easier if not attractive…

 

Downloading and installation

At the present time, ggene is not available from R CRAN and the current version can be installed from the R-Forge repository.

Mac OS users:

install.packages("ggene", repos="http://R-Forge.R-project.org", type="source")

Windows or Linux users:

install.packages("ggene", repos="http://R-Forge.R-project.org")

Once ggene is installed you can visualize the thematic documentation by typing in the R console:

library(ggene)
vignette("ggene_introduction")
vignette("ggene_datasets")

 

Reading haploid dataset

The chunk of code below shows how to read both coordinates and microsatellite data for an haploid dataset.

library(ggene)
# read genetic data
sim <- read.csv(system.file("extdata/sim_01.csv", package="ggene"), header=FALSE)
# read spatial coordinates
xy.sim <- read.csv(system.file("extdata/xysim_01.csv", package="ggene"), header=FALSE)
# create a ggene object
dat.sim <- tab2geo(X=sim, coord=xy.sim)
## The number of individuals is  625 
## The number of locus is  20

 

Reading diploid dataset

The function gene2geo operates the same way but relies on a genind object which are created with the package adegenet from various data input formats such as genepop or genetix (see the documentation of the package for details). Note that ggene has no function allowing to read data directly from files.

Here is an example where we firstly read the genetic data from a file in genepop format:

library(adegenet)
dat <- read.genepop(system.file("extdata/sim_03.gen", package="ggene"), ncode = 3)
## 
##  Converting data from a Genepop .gen file to a genind object... 
## 
## 
## File description:  Simulated data 
## 
## ...done.

The resulting object has various characteristics:

dat
## /// GENIND OBJECT /////////
## 
##  // 625 individuals; 20 loci; 502 alleles; size: 1.3 Mb
## 
##  // Basic content
##    @tab:  625 x 502 matrix of allele counts
##    @loc.n.all: number of alleles per locus (range: 15-36)
##    @loc.fac: locus factor for the 502 columns of @tab
##    @all.names: list of allele names for each locus
##    @ploidy: ploidy of each individual  (range: 2-2)
##    @type:  codom
##    @call: read.genepop(file = system.file("extdata/sim_03.gen", package = "ggene"), 
##     ncode = 3)
## 
##  // Optional content
##    @pop: population of each individual (group size range: 625-625)

There are 625 individuals.

Let’s now read the geographic coordinates:

xy <- read.csv(system.file("extdata/xysim_01.csv", package="ggene"), header=FALSE)
dim(xy)
## [1] 625   2

The package ggene requires objects of class ggene that are created using the function gene2geo:

data <- gene2geo(X=dat, coord=xy)
## The number of individuals is  625 
## The number of locus is  20
class(data)
## [1] "list"  "ggene"

 

Your first variogram!

svariog computes the variogram for the gene diversity and also returns the locus-by-locus semivariance and the conventional estimator of the gene diversity (Wagner et al. 2005).

The chunks of code below shows how to compute the variogram for the larix dataset (Nardin et al. 2015) using default options. It is good practice to check the sampling scheme before any analysis:

data(larix2300)
# check sample spatial features
plot(larix2300$coord[,1],larix2300$coord[,2], asp=1, xlab="x", ylab="y")

plot of chunk unnamed-chunk-10

If plot is set to TRUE svariog produces a plot of the variogram. The dashed line shows the conventional estimator of the gene diversity.

# compute variogram
va <- svariog(X=larix2300, plot=TRUE)

plot of chunk unnamed-chunk-11

The distance interval can easily be changed to explore how the variogram is affected. Increasing to distance interval tends to smooth out the signal and to better recognize the large-scale patterns. This is easily done using the function distlag:

d <- distlag(dist=larix2300$coord, dmin=0, distance.lag=1)
d
##  [1]  0.5  1.5  2.5  3.5  4.5  5.5  6.5  7.5  8.5  9.5 10.5 11.5 12.5 13.5
## [15] 14.5 15.5 16.5 17.5 18.5 19.5 20.5 21.5 22.5 23.5 24.5 25.5 26.5 27.5
## [29] 28.5 29.5 30.5 31.5 32.5 33.5 34.5 35.5 36.5 37.5 38.5 39.5 40.5 41.5
## [43] 42.5 43.5 44.5 45.5 46.5 47.5 48.5 49.5 50.5 51.5 52.5 53.5 54.5 55.5
## [57] 56.5 57.5 58.5 59.5 60.5 61.5 62.5 63.5 64.5 65.5 66.5 67.5 68.5 69.5
## [71] 70.5 71.5 72.5 73.5 74.5 75.5 76.5 77.5 78.5 79.5 80.5 81.5 82.5 83.5
## [85] 84.5 85.5 86.5 87.5 88.5 89.5 90.5 91.5 92.5 93.5 94.5 95.5 96.5 97.5
## [99] 98.5

The variogram is computed with uvec=d:

va <- svariog(X=larix2300, uvec=d, plot=TRUE)

plot of chunk unnamed-chunk-13

The resulting variogram reveals a very clear spatial genetic structure mostly occuring at distances < ca. 18 m.

 

References:

Nardin, M., Guerin, V., Musch, B., Rousselle, Y., Sanchez, L., Rossi, J.-P., Gerber, S., Paques, L., Rozenberg, P. 2015. Genetic differentiation of European larch along an altitudinal gradient in the French Alps. Annals of Forest Science 72, 517-527.

Journel, A.G., and C.J. Huijbregts. 1978. Mining Geostatistics. Academic Press.

Wagner, H.H., R. Holderegger, S. Werth, F. Gugerli, S.E. Hoebee, and C. Scheidegger. 2005. Variogram Analysis of the Spatial Genetic Structure of Continuous Populations Using Multilocus Microsatellite Data. Genetics 169: 1739–52.

Extracting circular buffers from a raster in R

31 January 2014 1 comment

Cropping circular buffers centered on a point is pretty easy in R, at least when you’re dealing with rasters. Two packages allow to do this task, if not more. Let’s see how we can do with the package raster.

Note that in the following example, we assume that the package adehabitat is properly installed.


library(raster)
f<-system.file("ascfiles/elevation.asc", package="adehabitat")
raster<-rasterbis<-raster(f)# read example file from adehabitat folder
plot(raster)# have a look
rasterbis[]<-NA # empty one of the rasters
points<-data.frame(704000, 3162000);names(points)<-c("x","y")# create coordinates of a sample point
cell<-cellFromXY(object=raster,xy=points[1,])# get the cell number corresponding to that sample point
radius<-1000 # set the buffer radius
rasterbis[cell]<-1 # set the value of the corresponding cell to 1 (the others remaining NAs)
plot(raster)
points(points$x,points$y,pch=3)# one more look...
buf <- buffer(rasterbis, width=radius) # buffer extraction per se
plot(buf)
p<-which(buf[]==1) # identify which cells belong to the buffer
buf[p]<-raster[p]# pick up original values
# Now we crop the buffer :
xmin<-points$x[1] - 2*radius
xmax<-points$x[1] + 2*radius
ymin<-points$y[1] - 2*radius
ymax<-points$y[1] + 2*radius
ev<-c(xmin, xmax, ymin, ymax)
e<-extent(ev)
buffer<-crop(buf,e) # that's it :
plot(buffer)
points(points$x,points$y,pch=3)

I hope this might be helpful !
Cheers.
Next post : same thing with the package adehabitat

————————————–
EDIT: Robert Hijmans send me two sets of R commands which are more compact and very efficient ! Thank you Robert ! I add this code below so interested readers can make use of it. Thanks also to Curlew for the contribution (see comments below).

————————————–

library(raster)
library(rgeos)
points <- SpatialPoints(cbind(704000, 3162000))
pbuf <- gBuffer(points, widt=1000)
buf <- mask(raster, pbuf)
buffer <- trim(buf, pad=2)

library(raster)
raster <- raster(system.file(“ascfiles/elevation.asc”, package=”adehabitat”))
r1 <- raster(f)
points <- cbind(704000, 3162000)
r2 <- rasterize(points, r1, field=1)
pbuf <- buffer(r2, width=1000)
buf <- mask(raster, pbuf)
buffer <- trim(buf, pad=2)

 

Categories: Non classé, R

Getting and managing occurrence data from gbif.org

The R package dismo by Hijman et al. offers a very powerful function named gbif that allows to get occurence data from the gbif dabase.

For example if we are to map the geographical range of the Mediterranean fruit fly, Ceratitis capitata we simply do:

library(dismo)
distr <- gbif(genus="Ceratitis", species="capitata*", geo=TRUE,removeZeros = TRUE)

Note that “capitata” and “capitata*” won’t lead to similar results because the species name is often associated to author’s name in the database i.e. Ceratitis capitata or Ceratitis capitata (Wiedemann, 1824).
The results can be plotted:

plot(distr$lon,distr$lat, pch=3, cex=1, col="red", asp=1, xlab="longitude", ylab="latitude")
map("world", resolution = 0.5, add=T)

gbif occurrences for Ceratitis capitata

gbif occurrences for Ceratitis capitata

Alas! many occurrences correspond to the same geographical location i.e. there are many duplicated points which might be a source of trouble in subsequent data analyses and distribution modeling. This problem can be easily solved using the R package spatstat by Baddeley and Turner (2005) and its function duplicated.ppp.

We first have to create a spatial point pattern object from the occurrences points using the function ppp and a window created with the ripras function. Then duplicated.ppp will identify duplicated points and we will be able to remove them. The code is:

library(spatstat)
x<-distr$lon ; y<-distr$lat
w <- ripras(x,y)
wp<-ppp(x,y,window=w)
dupv<-duplicated.ppp(wp)
x2<-x[which(dupv==FALSE)] ; y2<-y[which(dupv==FALSE)] # coordinates of points with no duplicates

In the case of Ceratitis capitata gbif returned a total of 1279 occurrences that included not less that 982 duplicated points.


References
A. Baddeley and R. Turner (2005). Spatstat: an R package for analyzing spatial point patterns. Journal of Statistical Software 12 (6), 1-42. ISSN: 1548-7660. URL: http://www.jstatsoft.org

Robert J. Hijmans, Steven Phillips, John Leathwick and Jane Elith (2012). dismo: Species distribution modeling. R package version 0.7-23. http://CRAN.R-project.org/package=dismo



Categories: R Tags: ,

Retrieve spatial coordinates from location’s name

It is a common task to retrieve the spatial coordinates (longitude and latitude) from the sole name of a place. This occurs for example when one wants to map spatial occurrences of a species that are reported in the literature in the form of a list of places.

This work is strongly facilitated by the R package titled gooJSON developped by Christopher Steven Marcum.

Let’s say that we want to retrieve the coordinates of a place reported as Fort Myers in Florida where J. C. Denmark reported the presence of Homalodisca vitripennis Germar, 1821 in 1957.


library(gooJSON)
a<-gooadd(address = list("Fort Myers","Florida"))
b<-goomap(a)
b$Placemark[[1]]$Point$coordinates

[1] -81.87231 26.64063 0.00000

Here we are, Google API tells us (through R) that Fort Myers in Florida is located at -81.87231 longitude and 26.64063 latitude.

Now imagine that we want to repeat that operation 500 times because we have a bunch of places where our study species has been recorded. The goomap function can be put into a loop and the Google API questioned 500 times, but here comes the twist: Google will not answer 500 times in a row and will send no data for some records. This problem can simply be solved by slowing your loop so that Google API won’t receive your calling to quickly and will answer to all queries. This is done by adding a call to the function Sys.sleep in the loop to suspend execution of R expressions for a given number of seconds. I used Sys.sleep(0.2) and it worked perfectly!


Reference
Christopher Steven Marcum (2012). gooJSON: Google JSON Data Interpreter for R. R package version 1.0.01.http://CRAN.R-project.org/package=gooJSON



Categories: R Tags: , ,