Bias in Occurrence Data

Most modeling assumes random samples of data. In Biodiversity informatics that is rarely the case. Humans tend to go where it is easy for us to get and/or where we are going to go anyways. For that reason spatial occurrence data tends to be clustered towards the more commonly visited places.

Today we will explore visualization tools for examining patterns in occurrence data samples.

rasterExtras & RSpatial

rasterExtras GitHub rasterExtras Tutorials

RSpatial GitHub

Install the R package rasterExtras using

devtools::install_github('rsh249/rasterExtras')
devtools::install_github('oshea-patrick/RSpatial')

Sampling Density

Start with the species that you downloaded data on for homework and load the WorldClim data again.

library(raster)
## Loading required package: sp
library(ggplot2)
library(rasterExtras)
## Warning: replacing previous import 'parallel::clusterApplyLB' by
## 'snow::clusterApplyLB' when loading 'rasterExtras'
## Warning: replacing previous import 'parallel::parApply' by 'snow::parApply'
## when loading 'rasterExtras'
## Warning: replacing previous import 'parallel::stopCluster' by
## 'snow::stopCluster' when loading 'rasterExtras'
## Warning: replacing previous import 'parallel::parCapply' by
## 'snow::parCapply' when loading 'rasterExtras'
## Warning: replacing previous import 'parallel::parRapply' by
## 'snow::parRapply' when loading 'rasterExtras'
## Warning: replacing previous import 'parallel::parSapply' by
## 'snow::parSapply' when loading 'rasterExtras'
## Warning: replacing previous import 'parallel::clusterEvalQ' by
## 'snow::clusterEvalQ' when loading 'rasterExtras'
## Warning: replacing previous import 'parallel::makeCluster' by
## 'snow::makeCluster' when loading 'rasterExtras'
## Warning: replacing previous import 'parallel::clusterSplit' by
## 'snow::clusterSplit' when loading 'rasterExtras'
## Warning: replacing previous import 'parallel::parLapply' by
## 'snow::parLapply' when loading 'rasterExtras'
## Warning: replacing previous import 'parallel::clusterMap' by
## 'snow::clusterMap' when loading 'rasterExtras'
## Warning: replacing previous import 'parallel::clusterExport' by
## 'snow::clusterExport' when loading 'rasterExtras'
## Warning: replacing previous import 'parallel::splitIndices' by
## 'snow::splitIndices' when loading 'rasterExtras'
## Warning: replacing previous import 'parallel::clusterApply' by
## 'snow::clusterApply' when loading 'rasterExtras'
## Warning: replacing previous import 'parallel::clusterCall' by
## 'snow::clusterCall' when loading 'rasterExtras'
## Warning: replacing previous import 'raster::density' by 'stats::density'
## when loading 'rasterExtras'
## Warning: replacing previous import 'raster::weighted.mean' by
## 'stats::weighted.mean' when loading 'rasterExtras'
## Warning: replacing previous import 'raster::predict' by 'stats::predict'
## when loading 'rasterExtras'
## Warning: replacing previous import 'raster::aggregate' by
## 'stats::aggregate' when loading 'rasterExtras'
## Warning: replacing previous import 'raster::quantile' by 'stats::quantile'
## when loading 'rasterExtras'
## Warning: replacing previous import 'raster::update' by 'stats::update' when
## loading 'rasterExtras'
library(RSpatial)
library(spocc)
## Warning in fun(libname, pkgname): rgeos: versions of GEOS runtime 3.7.1-CAPI-1.11.1
## and GEOS at installation 3.7.0-CAPI-1.11.0differ
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:raster':
## 
##     intersect, select, union
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
wc = getData('worldclim', var='bio', res = 5)
ext = extent(-125, -55, 20, 60)
wc = crop(wc, ext)

wc_df = as.data.frame(wc, xy=TRUE)

taxon = 'Vaccinium angustifolium'

spdist <- occ(query=taxon, limit=6500)
## Registered S3 method overwritten by 'crul':
##   method                 from
##   as.character.form_file httr
sp_df = occ2df(spdist)

sp_df = sp_df %>% filter(longitude>=ext[1], longitude<=ext[2], latitude>=ext[3], latitude <=ext[4]) #dplyr filter points to study area

gk = gkde(wc[[1]], 
          sp_df[,c('longitude', 'latitude')], 
          parallel=T, 
          nclus = 12, 
          dist.method='Haversine', 
          maxram=20, 
          bw=50)
## BEGIN PARALLEL COMPUTATION
## Core count:  12 
## Cells/iteration:  5513 of 403200 
## Iterations total:  73.13622 
## Points:  3381 
## Maximum RAM per proc.:  1.666667 
## Distance Method:  Haversine 
## 
## Time elapsed:  33.757
gk_df=as.data.frame(gk, xy=TRUE)
ggplot() +
  geom_raster(data = gk_df, aes(x = x, y = y, fill = layer)) +
  #geom_point(data=sp_df, aes(x=longitude, y=latitude), col='green', cex=0.1) +
  coord_quickmap() +
  theme_bw() + 
  scale_fill_gradientn(colours=c('darkred', 'grey', 'navy'),
  na.value = "black")

ggplot() +
  geom_raster(data = gk_df, aes(x = x, y = y, fill = layer^0.1)) +
  coord_quickmap() +
  theme_bw() + 
  scale_fill_gradientn(colours=c('darkred', 'grey', 'navy'),
  na.value = "black")

SAVE YOUR MAP

I like the last density map (the 10th root one). To save any plot in R:

dplot = ggplot() +
  geom_raster(data = gk_df, aes(x = x, y = y, fill = layer^0.1)) +
  coord_quickmap() +
  theme_bw() + 
  scale_fill_gradientn(colours=c('darkred', 'grey', 'navy'),
  na.value = "black")

ggsave(dplot, filename = paste("figure/", taxon, "_density.png", sep=""))
## Saving 7 x 5 in image
#or
ggsave(dplot, filename = "figure/Vaccinium_density.png")
## Saving 7 x 5 in image
#and you can change the size and resolution (dpi) if you want a higher quality figure.
ggsave(dplot, 
       filename = paste("figure/", taxon, "_density.png", sep=""),
       height=7.25, width = 7.25, units='in',
       dpi = 300)

To save the DATA you generated as a raster file (which could be reused in R later):

writeRaster(gk, filename = paste('data/', taxon, '_density', sep=''), overwrite=TRUE)
## class      : RasterLayer 
## dimensions : 480, 840, 403200  (nrow, ncol, ncell)
## resolution : 0.08333333, 0.08333333  (x, y)
## extent     : -125, -55, 20, 60  (xmin, xmax, ymin, ymax)
## crs        : +proj=longlat +datum=WGS84 +ellps=WGS84 +towgs84=0,0,0 
## source     : /home/rharbert/GitHub/bioinformatics/docs/data/Vaccinium angustifolium_density.grd 
## names      : layer 
## values     : 0, 0.000569195  (min, max)

Minimum Distance

Alternate visualization: The minimum distance to the nearest occurrence point. Identifies undersampled areas by mapping the distance to a point in the dataset.

d2pt = dist2point(wc[[1]], sp_df[,c('longitude', 'latitude')], parallel=TRUE, nclus = 12, dist.method='Haversine', maxram = 20)
## BEGIN PARALLEL COMPUTATION
## Core count:  12 
## Cells/iteration:  5513 of 403200 
## Iterations total:  73.13622 
## Points:  3381 
## Maximum RAM per proc.:  1.666667 
## Distance Method:  Haversine 
## 
## Time elapsed:  19.96
d2_df = as.data.frame(d2pt, xy=T)

ggplot() +
  geom_raster(data = d2_df, aes(x = x, y = y, fill = sqrt(layer))) +
  #geom_point(data=sp_df, aes(x=longitude, y=latitude), col='green', cex=0.2) +
  coord_quickmap() +
  theme_bw() + 
  scale_fill_gradientn(colours=c('darkred', 'grey', 'navy'),
  na.value = "black")

Save it

distplot = ggplot() +
  geom_raster(data = d2_df, aes(x = x, y = y, fill = sqrt(layer))) +
  #geom_point(data=sp_df, aes(x=longitude, y=latitude), col='green', cex=0.2) +
  coord_quickmap() +
  theme_bw() + 
  scale_fill_gradientn(colours=c('darkred', 'grey', 'navy'),
  na.value = "black")

ggsave(distplot, 
       filename = paste("figure/", taxon, "_mindist.png", sep=""),
       height=7.25, width = 7.25, units='in',
       dpi = 300)

writeRaster(gk, filename = paste('data/', taxon, '_mindist', sep=''), overwrite=TRUE )
## class      : RasterLayer 
## dimensions : 480, 840, 403200  (nrow, ncol, ncell)
## resolution : 0.08333333, 0.08333333  (x, y)
## extent     : -125, -55, 20, 60  (xmin, xmax, ymin, ymax)
## crs        : +proj=longlat +datum=WGS84 +ellps=WGS84 +towgs84=0,0,0 
## source     : /home/rharbert/GitHub/bioinformatics/docs/data/Vaccinium angustifolium_mindist.grd 
## names      : layer 
## values     : 0, 0.000569195  (min, max)

Fixing bias: Spatial Thinning

One way to reduce bias in your sample is to thin the data. That is, by removing points from the dataset that fall too close together we can smooth out the sampling.

#from RSpatial
library(RSpatial)
occ2thin = poThin(
  df = sp_df[c('longitude', 'latitude')],
  spacing = 25,
  dimension = nrow(sp_df),
  lon = 'longitude',
  lat = 'latitude'
)

sp_df = sp_df[-occ2thin,]

gk = gkde(wc[[1]], 
          sp_df[,c('longitude', 'latitude')], 
          parallel=T, 
          nclus = 12, 
          dist.method='Haversine', 
          maxram=20, 
          bw=50)
## BEGIN PARALLEL COMPUTATION
## Core count:  12 
## Cells/iteration:  27657 of 403200 
## Iterations total:  14.57859 
## Points:  674 
## Maximum RAM per proc.:  1.666667 
## Distance Method:  Haversine 
## 
## Time elapsed:  26.8
gk_df=as.data.frame(gk, xy=TRUE)
ggplot() +
  geom_raster(data = gk_df, aes(x = x, y = y, fill = layer)) +
  #geom_point(data=sp_df, aes(x=longitude, y=latitude), col='green', cex=0.1) +
  coord_quickmap() +
  theme_bw() + 
  scale_fill_gradientn(colours=c('darkred', 'grey', 'navy'),
  na.value = "black")

Homework

Change the extent of your study area to focus on New England and generate density and distance to point maps for your species.

In your dataset whata areas look undersampled? What patterns in the distribution do not look biological?

HINT:

ext = extent(-74, -69, 40, 45)
#crop

#Or filter points to study area and run gkde() and dist2point() again

home