Species Distribution Modeling

A key method in ecology, evolutionary biology, conservation, biogeography, and climate change biology (among other fields!) is the Species Distribution Model (SDM). SDMs are a tool for understanding where a species can occur based on presence-only correlative modeling of where a species is known to occur. These models can be used to infer niche overlap between species, assess conservation risk, identify past ranges, or study invasive potential.

Today we will be using the common “Maxent” method to estimate the current distribution of your species of interest.

Focus activity

Bring up data for your species of interest in R, filter it to our the study area using dplyr, thin the data using RSpatial, and plot those points over a WorldClim climate layer using ggplot.

library(raster)
library(ggplot2)
library(rasterExtras)
library(RSpatial)
library(spocc)
library(dplyr)

taxon = 'Vaccinium angustifolium'

wc = getData('worldclim', var='bio', res = 5)
ext = extent(-125, -55, 20, 60)
wc = crop(wc, ext)
wc_df = as.data.frame(wc, xy=TRUE) # for plotting

#downloading
spdist <- occ(query=taxon, limit=6500) # check limit for your species
sp_df = occ2df(spdist)

#filtering
sp_df = sp_df %>% filter(longitude>=ext[1], longitude<=ext[2], latitude>=ext[3], latitude <=ext[4]) #dplyr filter points to study area

#thinning
occ2thin = poThin(
  df = sp_df[c('longitude', 'latitude')],
  spacing = 25, #minimum distance between points in thinned data
  dimension = nrow(sp_df),
  lon = 'longitude',
  lat = 'latitude'
)

sp_df = sp_df[-occ2thin,] #thin using index returned from occ2thin

#plot
ggplot() +
  geom_raster(data = wc_df, aes(x = x, y = y, fill = bio12)) +
  geom_point(data=sp_df, aes(x=longitude, y=latitude), col='green', cex=0.1) +
  coord_quickmap() +
  theme_bw() + 
  scale_fill_gradientn(colours=c('darkred', 'grey', 'navy', 'green'),
  na.value = "black")

ENMeval

Load ENMeval

library(ENMeval)
## Loading required package: dismo

If ENMeval is not installed already, run:

install.packages('ENMeval')

Picking predictive variables

Selecting environmental parameters that actually impact the distribution of a species is a significant challenge to SDM methods. Here we will use bio1, bio2, bio3, bio12, and bio16. These are not necessarily the best combination of cliamte parameters for your species and may require some tweaking.

predvars = c(1,2,3,12,16)
preds = wc[[predvars]]

Model Testing

ENMeval runs the Maxent SDM software with many potential model settings, data partitioning, and model assessments. We will work through just one possible way to use these tools here. Depending on your use case more model testing or different model evaluation methods may be useful.

We will be testing two Maxent feature classes “Linear” (L), “Quadratic” (Q), and the hybrid (LQ) model. Higher degree model features can lead to overfitting, but ENMeval allows any combination of features from the set: “Linear” (L), “Quadratic” (Q), “Hinge” (H), “Product” (P), and “Threshold” (T). (e.g., “LQHPT” is a valid feature class). Maxent tests these model features across each set of data to build composite models with high fit to the data. The simplest model is the “Linear” feature, these are very similar mathematically to Generalized Linear Models (GLM).

ENMeval will also test a range of smoothing parameters in the Regularization Multiplier (RM) argument (values 0.5, 1, 1.5, 2). The higher the RMvalue the smoother the model prediction. Lower RMvalues are prone to overfitting.

Setting up ENMeval works as follows:

eval = ENMevaluate(occ=sp_df[,c('longitude', 'latitude')], env = preds, method='randomkfold', kfolds=10, parallel=TRUE, numCores = 12, fc=c("L", "Q", "LQ"), RMvalues=seq(0.5, 2, 0.5), rasterPreds=T)
## *** Running ENMevaluate using maxnet v.0.1.2 ***
## Doing random k-fold evaluation groups...
## Of 64 total cores using 12
## Running in parallel...
## ENMeval completed in 0 minutes 33.5 seconds.

There’s a lot of moving parts in ENMevaluate. Here are a few of the highlights explained:

  • occ .-> occurrence points as a 2 column data frame containing longitude then latitude
  • env .-> environmental predictor variables (raster object). Using WorldClim bioclim here.
  • fc .-> Maxent feature classes to use. These are model types that Maxent has access to in the model fitting stage. Options are any combination of ‘L’ for linear, ‘Q’ for quadratic, ‘H’ for hinge, ‘P’ for power, and ‘T’ for threshold but common and recommended settings here are c(“L”, “LQ”, “LQH”). For more explanation see the paper “A statistical explanation of MaxEnt for ecologists“.
  • RMvalues .-> A smoothing parameter. The higher the number the smoother your model. Low values can lead to overfitting and low transferability to other times and spaces.

Picking a “good” model

ENMeval has now built and tested several model combinations in Maxent. But which is the ‘best’?

Look at the eval@results object for more. Pay close attention to the AICc column and the avg.test.AUC columns. Can you find which model (1st column) or models minimizes AICc and maximizes AUC? Is it the same model settings for both?

eval@results

Which model features/RMvalues made for your best model?

bestmod = which(eval@results$AICc==min(eval@results$AICc))
eval@results[bestmod,]

Plotting heatmap model output

ENMeval has already calculated a predicted distribution map for each model setting. We can plot the best model(s) with:

Best model AICc: #2

# make prediction

pr = predict(preds, eval@models[[bestmod]], type = 'cloglog')
pr_df = as.data.frame(pr, xy=T)

#heatmap
ggplot() +
  geom_raster(data = pr_df, aes(x = x, y = y, fill = layer)) +
   geom_point(data=sp_df, aes(x=longitude, y=latitude), col='red', cex=0.05) +
  coord_quickmap() +
  theme_bw() + 
  scale_fill_gradientn(colours=viridis::viridis(99),
  na.value = "black")

Predicting Species’ Range

One way we can use this output is to create a potential range for your species. We can do this by evaluating model thresholds for presence/absence designation.

To do this we need access to the background points and the Maxent model objects. Both of which we get from ENMeval.

#extract model estimated suitability for occurrence localities
est.loc = extract(pr,  eval@occ.pts)
#extract model estimated suitability for background
est.bg = extract(pr, eval@bg.pts)
#evaluate predictive ability of model
ev = evaluate(est.loc, est.bg)
#detect possible thresholds 
thr = threshold(ev)
#plot using "equal sensitivity and specificity" criteria
pr_thr = pr>thr$sensitivity
pr_thr_df = as.data.frame(pr_thr, xy=TRUE)

ggplot() +
  geom_raster(data = pr_thr_df, aes(x = x, y = y, fill = layer)) +
   geom_point(data=sp_df, aes(x=longitude, y=latitude), col='red', cex=0.2) +
  scale_fill_manual(values = c('black', 'blue')) +
  coord_quickmap() +
  theme_bw() 
## Warning: Removed 151391 rows containing missing values (geom_raster).

There are many possible ways to establish these kinds of maps. Here we work through one common approach but it will not work in every case.

Homework

Create a potential distribution for your species. Record all commands in an R script and save plots to your Git repository. Post links to #biodiversity

THEN, download future climate data for one modeling scenario (see ?getData).

e.g., :

future = getData('CMIP5', var='bio', res=5, rcp = 85, model='AC', year=70, path='tmp/')

Crop to your region of interest and use the predict() (see above) function to predict your species’ future distribution.

Post questions, issues, and future projected maps (both range and heatmap) with a short description of what you found to the #biodiversity.

Home