home

Advanced R

Repeating actions with loops

Many of the functions we have seen so far will automatically apply the calculation across a vector of data (vector, column of matrix or data frame). But, if you need to do some kind of action across these objects that does not work that way you will need to understand loops.

Loops are programming structures that control the execution of some code in a cycle for a set number of rounds (called iterations).

Example:

for(i in 1:10){
  print(i)
} #Will print the value of i ten times and iterate i from 1 to 10 with each round
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
## [1] 6
## [1] 7
## [1] 8
## [1] 9
## [1] 10

Instead of doing this:

#Generate 5 random samples of numbers between 1 and 1000
sample1=sample(1:1000,10, replace=F)
sample2=sample(1:1000,10, replace=F)
sample3=sample(1:1000,10, replace=F)
sample4=sample(1:1000,10, replace=F)
sample5=sample(1:1000,10, replace=F)
sam.mat = cbind(sample1, sample2, sample3, sample4, sample5)

Use a loop:

sam.mat = matrix(ncol=5, nrow=10)
for (i in 1:5){
  sam.mat[,i] = sample(1:1000, 10, replace=F);##<- This code is only written ONCE in a loop. Less chance for error.
}

Challenge: Loops

Write a loop that counts even numbers backwards from 1000.

Write a loop that generates 100 random samples of integers and then calculates and stores the median value.

The ‘apply’ functions

Loops but better*.

*Better == more efficient in most cases

See: apply, lapply, tapply, sapply for various data types.

sapply for simple operations on vectors

s=c(1:10000)
sq=sapply(s, sqrt)

apply for tables

m = matrix(nrow=length(s), ncol=5)
for(i in 1:ncol(m)){
  m[,i]=s
}
head(m)
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    1    1    1    1    1
## [2,]    2    2    2    2    2
## [3,]    3    3    3    3    3
## [4,]    4    4    4    4    4
## [5,]    5    5    5    5    5
## [6,]    6    6    6    6    6
sum.row = apply(m, 1, sum)
sum.col = apply(m, 2, sum)

tapply - for ‘ragged’ arrays –> arrays with grouping factors

census = read.csv('/usr/share/data/kaggle/census.csv')
head(census)
state.mean.poverty = tapply(census$Poverty, census$State,mean)

Parallel Processing

R has a number of useful libraries for writing parallel code. At the end of the day these are just fancy loops

Logical flow of parallel computing:

Open a CPU cluster (R daughter processes pool of size n) Split job into discrete ‘queues’ Run job queues across each process (1:n) Collect results data objects

We can build some example code with parallel processing using “apply” type functions:

require(parallel)
## Loading required package: parallel
do = seq(1, 10000000)
p = proc.time()
l_works = sapply(do, sqrt)
proc.time() - p
##    user  system elapsed 
##   10.44    0.28   10.75
nclus = 4
cl = makeCluster(nclus, type ='SOCK'); 
  p = proc.time()
  splits = clusterSplit(cl, do)
  p_works2 = parSapply(cl, splits, sqrt)
  proc.time() - p
##    user  system elapsed 
##    1.33    0.36    1.83
stopCluster(cl)

You do want to make sure that you are sending a job big enough to justify the parallel overhead. It is very easy to make things worse.

nclus = 4
cl = makeCluster(nclus, type ='SOCK'); 
  p = proc.time()
  p_works2 = parSapply(cl, do, sqrt)
  proc.time() - p; #No faster than non-parallel
##    user  system elapsed 
##    9.28    5.44   20.03
stopCluster(cl)

NOTE: parSapply returns a list of vectors. The list has nclus elements:

length(p_works2)
## [1] 10000000
length(unlist(p_works2))
## [1] 10000000

If/else: Conditional statements

Often we want to tell R to execute some command on our data if a certain criteria is met. (i.e., IF we have data for state X we want to calculate some summary value)

To do this we use logical operators within ‘if’ statements. Often within loops.

citizenRatio = vector()
for(i in 1:nrow(census)){
  if(census[i,'State']=='Arizona'){
    citizenRatio[i]=census[i,'Citizen']/census[i,'TotalPop']
  }
}
summary(citizenRatio)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  0.1992  0.6194  0.7060  0.6918  0.7768  1.0000    1354

Maybe we also want to do something in the cases where that condition is NOT met. For that we use ‘else’

AZcitizenRatio = vector()
OtherRatio=vector()
for(i in 1:nrow(census)){
  if(census[i,'State']=='Arizona'){
    AZcitizenRatio[i]=census[i,'Citizen']/census[i,'TotalPop']
  } else {
    OtherRatio[i]=census[i,'Citizen']/census[i,'TotalPop']
  }
}
summary(AZcitizenRatio)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  0.1992  0.6194  0.7060  0.6918  0.7768  1.0000    1354
summary(OtherRatio)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  0.0000  0.6698  0.7372  0.7147  0.7813  1.0000    2210

And still other times we may want to test MULTIPLE conditions and apply different actions to each:

AZcitizenRatio = vector()
CAcitizenRatio = vector()
OtherRatio=vector()
for(i in 1:nrow(census)){
  if(census[i,'State']=='Arizona'){
    AZcitizenRatio[i]=census[i,'Citizen']/census[i,'TotalPop']
  } else if(census[i,"State"]=='California') {
    CAcitizenRatio[i]=census[i,'Citizen']/census[i,'TotalPop']
  } else {
    OtherRatio[i]=census[i,'Citizen']/census[i,'TotalPop']
  }
}
summary(AZcitizenRatio)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  0.1992  0.6194  0.7060  0.6918  0.7768  1.0000    1354
summary(CAcitizenRatio)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.140   0.558   0.665   0.642   0.742   1.000    3605
summary(OtherRatio)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   0.683   0.742   0.724   0.784   1.000   10222

Warning:

Multiple layers of if/else statements can get confusing and difficult to understand. If you end up having an error inside this code it can be a nightmare!

Functions

Any time you write a bit of code that you find yourself repeating more than a couple of times you should consider writing that as a function instead. The more complicated the code that’s repeating is, the more likely you will benefit from a function.

To declare a function:

citratio=function(x){
  #calculate the ratio of citizens to the total pop and return the median
  #pass census to this function 'citratio(census)'
  #then census is refered to within as 'x'
  
  ratio = x$Citizen/x$TotalPop
  med=median(ratio, na.rm=T) #Ignore missing values
  return(med)
}

citratio(census)
## [1] 0.7366694
citratio(census[census$State=='Virginia',])
## [1] 0.7452957

Then you can use this function anywhere after it’s declaration:

AZcitizenRatio = vector()
CAcitizenRatio = vector()
OtherRatio=vector()
for(i in 1:nrow(census)){
  if(census[i,'State']=='Arizona'){
    AZcitizenRatio[i]=citratio(census[i,])
  } else if(census[i,"State"]=='California') {
    CAcitizenRatio[i]=census[i,'Citizen']/census[i,'TotalPop']
  } else {
    OtherRatio[i]=census[i,'Citizen']/census[i,'TotalPop']
  }
}

Did that work as expected?

Simulations:

For your blog this week develop a simulation using R to model one of the following:

  • Population size over several years/generations (Model births/deaths/migrants per year), consider reproductive success (# offspring per mating pair), resources, predation, etc.
  • Simulate a card game. Draw samples from a ‘deck’ of 52 cards and use conditional statements to determine the next step or outcome.
  • Simulate DNA sequencing reads from a template sequence. That is, start with a template of 10,000 bases and sample 50bp reads for a predetermined number of reads.
  • Simulate or model some other phenomenon of your choosing AFTER getting it approved by your instructor.

Prof. Harbert will help you get set up for one of these that you choose

home