Primary Biodiversity Data

Observations of the occurrence of a species are a fundamental unit of biodiversity data. We will explore in this unit, where to look for open-access occurrence data, how to access those sources from R, and tools for visualizing point distributions of species.

Library ‘spocc’

A great tool from the rOpenSci consortium (a group of developers building R capacity for open science).

Package details on GitHub

Tutorial here

We should all have spocc installed, but if not try:

install.packages('spocc')

With spocc installed we can try a simple query of the GBIF database that we have seen briefly before.

library(spocc)
## Warning in fun(libname, pkgname): rgeos: versions of GEOS runtime 3.7.1-CAPI-1.11.1
## and GEOS at installation 3.7.0-CAPI-1.11.0differ
spdist <- occ(query='Crotalus horridus', from='gbif')
## Registered S3 method overwritten by 'crul':
##   method                 from
##   as.character.form_file httr

The data are returned as an “S3 class” object. Somewhere in there is a tidyverse tibble (like a data frame table, but not).

print(spdist) ## Not obvious what or where the data are
View(spdist)

Maybe it’s still not obvious how we get in. To view an element of the data returned we use the “$” operator and call each by name. In general it’s easier to convert these to regular R data frame objects since not everything we want to do with these data is compatible with the tidyverse/spocc formatting.

df = as.data.frame(occ2df(spdist$gbif))

#Also try:
#head(df)
#colnames(df) #!! That's a lot of columns!!

mapr: Exploratory interactive mapping of species distribution data.

To create interactive graphics showing species occurrence locations and some metadata we can use ‘mapr’. This library uses a JavaScript library known as leaflet and Open Street Maps services (and others!) to create interactive maps that you can navigate through and click on points to pop-up metadata about each occurrence.

If not already done:

install.packages('mapr')

Then call map_leaflet() either on the spocc object:

library(mapr)
map_leaflet(spdist)

OR with the data.frame:

map_leaflet(df)

‘mapr’ shows the data for the first few columns in each pop-up tab. We can control what is shown there by only passing some columns to map_leaflet().

map_leaflet(df[,c('name', 'longitude', 'latitude', 'stateProvince', 'country', 'year', 'occurrenceID')])

Specifying columns makes it much easier to sift through large amounts of data to check sources and look for patterns of bias.

NOTE: mapr only works with data formatted by spocc and related libraries.

More with spocc queries.

Do you notice something odd when you run:

nrow(df)
## [1] 500

Check how many records are returned for the same search on the GBIF website

Our query only returned the first 500 records because that is the default for the occ() function.

We can fix that:

spdist2 <- occ(query='Crotalus horridus', limit=2500)
map_leaflet(spdist2)

Meeting up with WorldClim

R gives us the tools to plot spatial point data like our occurrence records over a geospatial raster object like the WorldClim data. R also gives us tools to access the raster data for our occurrence points.

Load WorldClim again: (If you are in the same folder then you will not have to download the data again so this should be quick).

library(raster)
## Loading required package: sp
wc = getData('worldclim', var='bio', res = 5)

Plot raster with points over:

library(ggplot2)
ext = extent(-125, -55, 20, 60)
wc = crop(wc, ext)

wc_df = as.data.frame(wc, xy=TRUE)
sp_df = occ2df(spdist2)
ggplot() +
  geom_raster(data = wc_df, aes(x = x, y = y, fill = bio1/10)) +
  geom_point(data=sp_df, aes(x=longitude, y=latitude), col='green') +
  coord_quickmap() +
  theme_bw() + 
  scale_fill_gradientn(colours=c('navy', 'white', 'darkred'),
  na.value = "black")
## Warning: Removed 1085 rows containing missing values (geom_point).

Extracting climate data

The ‘raster’ library has a function for referencing the climate data stored in our raster for every occurrence point. These data are the primary input to our species distribution models.

extr = extract(wc, sp_df[,c('longitude', 'latitude')])
summary(extr)
##       bio1           bio2            bio3            bio4      
##  Min.   : 56    Min.   : 93.0   Min.   :26.00   Min.   : 4885  
##  1st Qu.:118    1st Qu.:122.0   1st Qu.:34.00   1st Qu.: 6850  
##  Median :153    Median :127.0   Median :38.00   Median : 7612  
##  Mean   :149    Mean   :126.1   Mean   :37.11   Mean   : 7631  
##  3rd Qu.:181    3rd Qu.:132.0   3rd Qu.:40.00   3rd Qu.: 8374  
##  Max.   :213    Max.   :149.0   Max.   :48.00   Max.   :11244  
##  NA's   :1088   NA's   :1088    NA's   :1088    NA's   :1088   
##       bio5            bio6              bio7            bio8      
##  Min.   :234.0   Min.   :-169.00   Min.   :253.0   Min.   : 16.0  
##  1st Qu.:294.0   1st Qu.: -58.25   1st Qu.:317.0   1st Qu.:122.0  
##  Median :320.0   Median : -22.00   Median :337.0   Median :184.0  
##  Mean   :314.8   Mean   : -23.46   Mean   :338.3   Mean   :171.9  
##  3rd Qu.:334.0   3rd Qu.:  17.00   3rd Qu.:361.0   3rd Qu.:224.0  
##  Max.   :363.0   Max.   :  75.00   Max.   :453.0   Max.   :274.0  
##  NA's   :1088    NA's   :1088      NA's   :1088    NA's   :1088   
##       bio9            bio10           bio11            bio12     
##  Min.   :-91.00   Min.   :168.0   Min.   :-91.00   Min.   : 804  
##  1st Qu.: 28.75   1st Qu.:222.0   1st Qu.: 10.75   1st Qu.:1086  
##  Median :138.50   Median :249.0   Median : 50.00   Median :1209  
##  Mean   :122.33   Mean   :243.8   Mean   : 46.77   Mean   :1227  
##  3rd Qu.:199.00   3rd Qu.:267.0   3rd Qu.: 91.00   3rd Qu.:1358  
##  Max.   :286.00   Max.   :286.0   Max.   :146.00   Max.   :2001  
##  NA's   :1088     NA's   :1088    NA's   :1088     NA's   :1088  
##      bio13           bio14            bio15          bio16      
##  Min.   : 94.0   Min.   : 20.00   Min.   : 7.0   Min.   :274.0  
##  1st Qu.:120.0   1st Qu.: 58.00   1st Qu.:14.0   1st Qu.:323.0  
##  Median :134.0   Median : 71.00   Median :18.0   Median :359.0  
##  Mean   :139.5   Mean   : 70.03   Mean   :19.9   Mean   :374.4  
##  3rd Qu.:160.2   3rd Qu.: 79.00   3rd Qu.:23.0   3rd Qu.:422.0  
##  Max.   :202.0   Max.   :150.00   Max.   :48.0   Max.   :547.0  
##  NA's   :1088    NA's   :1088     NA's   :1088   NA's   :1088   
##      bio17           bio18           bio19      
##  Min.   : 72.0   Min.   :181.0   Min.   : 72.0  
##  1st Qu.:206.0   1st Qu.:283.0   1st Qu.:228.0  
##  Median :243.0   Median :315.0   Median :269.0  
##  Mean   :239.8   Mean   :323.2   Mean   :286.7  
##  3rd Qu.:269.0   3rd Qu.:346.0   3rd Qu.:367.0  
##  Max.   :468.0   Max.   :547.0   Max.   :512.0  
##  NA's   :1088    NA's   :1088    NA's   :1088
extr=as.data.frame(na.omit(extr))

ggplot(data=extr) +
  geom_histogram(aes(x=bio12), bins=30) + 
  theme_bw()

Challenge/Homework:

Choose a species observed in the Stonehill Biodiversity iNaturalist Project Query GBIF using ‘spocc’ for occurrence data records. Create a ggplot based map using bio1 as a background to show the distribution of this species.

Post the script and your map to your GitHub repo (you may want to create a repository for this course on your account). When that is up post a link to your repository on Slack using the biodiversity channel.

Reading:

By Monday, read the ENMeval paper. To go with this write a ~1 page reflection on the paper and submit that by email before class.

home