Many computational jobs can be split into small, repeatable, chunks. In modern computing systems we have access to plenty of computing power so it is often a good approach to split up a job into its individual pieces.
Think about the ENMevaluate() function we have been using. That gives us options to run the code in parallel. We have already seen how this helps speed up overall execution time for ENMeval. IF we look at the source code we would see that they split up the execution of each model to a different process and run for at a time. (https://github.com/bobmuscarella/ENMeval/blob/master/R/ENMevaluate.R)
#Example:
thinres = ENMevaluate(
occ = newthin[, 2:3],
env = Env,
method = 'block',
parallel = T, #This tells the function to turn on the parallel code option
numCores = 4, #And how many parallel processes to run
bg.coords = bg2,
fc = c("L", "LQ", "H"),
RMvalues = seq(1, 4, 1),
rasterPreds = F
)
R has a number of useful libraries for writing parallel code.
Logical flow of parallel computing:
Open a CPU cluster (size n) Split job into discrete ‘queues’ Run job queues across each process (1:n) Collect results data objects
We can build some example code with parallel processing using “apply” type functions:
factorial = function(x){ }
require(parallel)
## Loading required package: parallel
do = seq(1, 10000000)
p = proc.time()
l_works = sapply(do, sqrt)
proc.time() - p
## user system elapsed
## 17.074 0.344 17.532
nclus = 4
cl = makeCluster(nclus, type ='FORK'); #"Forking" is the most basic type of paralellization using the system "FORK" command
p = proc.time()
splits = clusterSplit(cl, do)
p_works2 = parSapply(cl, splits, sqrt)
proc.time() - p
## user system elapsed
## 2.953 0.732 3.760
stopCluster(cl)
You do want to make sure that you are sending a job big enough to justify the parallel overhead. It is very easy to make things worse.
nclus = 4
cl = makeCluster(nclus, type ='FORK'); #"Forking" is the most basic type of paralellization using the system "FORK" command
p = proc.time()
p_works2 = parSapply(cl, do, sqrt)
proc.time() - p; #No faster than non-parallel
## user system elapsed
## 11.320 0.927 14.271
stopCluster(cl)
Socket (SOCK) clusters can be a little more efficient because only the data within the function is copied. FORK clusters copy the entire R environment.
nclus = 4
cl = makeCluster(nclus, type ='SOCK');
p = proc.time()
splits = clusterSplit(cl, do)
p_works2 = parSapply(cl, splits, sqrt)
proc.time() - p; #About 25% quicker?
## user system elapsed
## 2.910 0.165 3.233
stopCluster(cl)
NOTE: parSapply returns a list of vectors. The list has nclus elements:
length(p_works2)
## [1] 4
length(unlist(p_works2))
## [1] 10000000
#we need a simplified function that accepts a single argument to use the parSapply method
wrap_max = function(x){
require(spocc)
require(raster)
occ=data.frame(occ2df(occ(x, limit=3000, from='gbif')));
max = dismo::maxent(Env, occ[,2:3])
return(max)
}
library(raster)
## Loading required package: sp
Env = raster::stack('~/bio.gri')
ext = extent(c(-100, -40, 25, 45))
Env = crop(Env, ext)
splist = c("Quercus virginiana", "Quercus geminata")
p=proc.time()
max_lin = lapply(splist, wrap_max)
## Loading required package: spocc
## Warning in C_valid_tz(tzone): System timezone name is unknown. Please set
## environment variable TZ.
## Warning in .local(x, p, ...): 7 (2.22%) of the presence points have NA
## predictor values
## Loading required namespace: rJava
## Warning in .local(x, p, ...): 3 (2.91%) of the presence points have NA
## predictor values
proc.time() - p
## user system elapsed
## 51.545 0.726 56.975
library(parallel)
nclus = 2
cl = makeCluster(nclus, type ='SOCK');
clusterExport(cl, list("Env"))
p = proc.time()
max_par = parLapply(cl, splist, wrap_max); #We want parLapply this time because the data frames don't simplify well
proc.time() - p;
## user system elapsed
## 0.035 0.013 36.828
stopCluster(cl)
print(max_par)
## [[1]]
## class : MaxEnt
## variables: bio1 bio2 bio3 bio4 bio5 bio6 bio7 bio8 bio9 bio10 bio11 bio12 bio13 bio14 bio15 bio16 bio17 bio18 bio19
##
## [[2]]
## class : MaxEnt
## variables: bio1 bio2 bio3 bio4 bio5 bio6 bio7 bio8 bio9 bio10 bio11 bio12 bio13 bio14 bio15 bio16 bio17 bio18 bio19
## output html file no longer exists