rasterExtras: An R library for supporting parallel calculations of high-density geographic data.

Outline: rasterExtras is a set of functions and a development library that is one solution to errors related to large vector allocation when performing raster operations with large datasets. Right now the only two functions that are implemented are gkde(), a 2-dimensional spatial kernel density estimator, and dist2point() a function to calculate the minimum distance of each raster cell to a set of coordinates. These support paralellization and rudimentary RAM management that allows the analysis of very large raster grids and/or matrices containing many coordinates (up to millions of unique coordinates).

RAM issues

The problem with these calculations arises because both require calculating a distance matrix between every cell in the raster and every point in the coordinates matrix. For example, the 2.5 arcminute WorldClim raster has 31,104,000 cells. If you have even a modest matrix of 10,000 coordinates this will generate a distance matrix of 31,104,000 * 10,000 = 3.11*10^11 (300 billion values) and will require vectors that are too long for most modern systems.

require(raster)
require(rasterExtras)
grid = raster::raster(nrows=3600, ncols=8600, xmn=-180, xmx=180, ymn=-60, ymx=80, vals=NULL) #dimensions of the Worldclim 2.5 arcminute grid
grid = raster::setValues(grid,values=(as.vector(seq(1:raster::ncell(grid)))))
points = cbind(
       c(seq(xmin(grid), xmax(grid), length.out=5000),
                seq(xmax(grid), xmin(grid), length.out=5000)),
       c(seq(ymin(grid), ymax(grid), length.out=100),
                seq(ymin(grid), ymax(grid), length.out=100))
                )


latlon=latlonfromcell(extent = as.vector(extent(grid)), cells=as.vector(seq(1:ncell(grid))), nrow=nrow(grid), ncol=ncol(grid)) #get lat and lon from cell position
d = distance(points, latlon)  #calculate the distance matrix manually. Next this will be done inside the main functions

Ok, that didn’t work because the vectors in that distance matrix maxed out. The error I got back was “Error: vector memory exhausted (limit reached?)”. It will be this or the “cannot allocate vector of size X” version.

So what can we do? Parallelize… Or at least chop the job up and use an apply function linearly. Fortunately all that is wrapped into some functions here.

Calculate the minimum distance to points: dist2point

For the sake of my laptop lets work with a smaller example and then show a bigger case cooking-show-style at the end with figures.

#The smaller example
require(raster)
## Loading required package: raster
## Loading required package: sp
require(rasterExtras)
## Loading required package: rasterExtras
## 
## Attaching package: 'rasterExtras'
## The following object is masked from 'package:raster':
## 
##     distance
grid = raster::raster(nrows=360, ncols=860, xmn=-180, xmx=180, ymn=-60, ymx=80, vals=NULL)
grid = raster::setValues(grid,values=(as.vector(seq(1:raster::ncell(grid)))))
points = cbind(
       c(seq(xmin(grid), xmax(grid), length.out=5000),
                seq(xmax(grid), xmin(grid), length.out=5000)),
       c(seq(ymin(grid), ymax(grid), length.out=1000),
                seq(ymin(grid), ymax(grid), length.out=1000))
                )

di = dist2point(grid, points, parallel=TRUE, maxram = 2, nclus = 4, dist.method='Haversine')
## BEGIN PARALLEL COMPUTATION
## Core count:  4 
## Cells/iteration:  6711 of 309600 
## Iterations total:  46.13321 
## Points:  10000 
## Maximum RAM per proc.:  0.5 
## Distance Method:  Haversine 
## 
## Time elapsed:  121.2
plot(di, col = viridis::viridis(9))
points(points, cex=0.01, col='lightgrey') #points are close together, will show up as a faint line

The figure above is a plot of the minimum distance of each position in this raster to a point in the coordinates matrix ‘points’ (the grey line).

Spatial density estimation: gkde

Here is what it looks like if we are calculating a probability surface instead of minimum distances.

den = gkde(grid, points, parallel=TRUE, maxram = 2, nclus = 4, dist.method='Haversine', bw=1000)
## BEGIN PARALLEL COMPUTATION
## Core count:  4 
## Cells/iteration:  6711 of 309600 
## Iterations total:  46.13321 
## Points:  10000 
## Maximum RAM per proc.:  0.5 
## Distance Method:  Haversine 
## 
## Time elapsed:  154.218
plot(den, col = viridis::viridis(9))
points(points, cex=0.01, col='lightgrey') #points are close together, will show up as a faint line

eBird

eBird is a massive database of bird observations run out of the Cornell Lab of Ornithology. I was interested in how these functions operate with millions of observations and a relatively high-resolution raster (2.5 arcminutes).

The eBird database was trimmed to look only at the western US and northern Mexico, but still included ~64 million observations.

These jobs were run on a cluster using 24 cores and allocating 128Gb of RAM. Even so each took several days to run.

Distance to eBird observation:

Distance to eBird observation scaled (0,1):

Density of eBird: *note that the coloring only shows the highest density localities, mostly cities.

Log-likelihood of the eBird sampling probability density: Taking the log of the probability surface allows us to see on an order-of-magnitude scale the pattern of sampling.

Technical notes:

The rasterExtras repository can be found at https://github.com/rsh249/rasterExtras.git and can be downloaded with devtools:

devtools::install_git('https://github.com/rsh249/rasterExtras')

Part of what helps rasterExtras run efficiently is a set of C++ functions that do the heavy lifting with calculating distance matrices. This also limits the type of calculations that can be done to Pythagorean and Haversine distances.

There is much to be done with this code still. RAM usage is still a problem in some cases. I want to support MPI for more cluster compute options.

Summary

rasterExtras is still under development but feel free to make copies for testing on your own data. I have am sure that there are use-cases that I have not accounted for yet and I have not really pushed the limits on dataset size just yet (i.e., Global density estimates for all GBIF records). However, I hope from these examples some will find this code useful.