Many of the functions we have seen so far will automatically apply the calculation across a vector of data (vector, column of matrix or data frame). But, if you need to do some kind of action across these objects that does not work that way you will need to understand loops.
Loops are programming structures that control the execution of some code in a cycle for a set number of rounds (called iterations).
Example:
for(i in 1:10){
print(i)
} #Will print the value of i ten times and iterate i from 1 to 10 with each round
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
## [1] 6
## [1] 7
## [1] 8
## [1] 9
## [1] 10
Instead of doing this:
#Generate 5 random samples of numbers between 1 and 1000
sample1=sample(1:1000,10, replace=F)
sample2=sample(1:1000,10, replace=F)
sample3=sample(1:1000,10, replace=F)
sample4=sample(1:1000,10, replace=F)
sample5=sample(1:1000,10, replace=F)
sam.mat = cbind(sample1, sample2, sample3, sample4, sample5)
Use a loop:
sam.mat = matrix(ncol=5, nrow=10)
for (i in 1:5){
sam.mat[,i] = sample(1:1000, 10, replace=F);##<- This code is only written ONCE in a loop. Less chance for error.
}
Write a loop that counts even numbers backwards from 1000.
Write a loop that generates 100 random samples of integers and then calculates and stores the median value.
Loops but better*.
*Better == more efficient in most cases
See: apply, lapply, tapply, sapply for various data types.
sapply for simple operations on vectors
s=c(1:10000)
sq=sapply(s, sqrt)
apply for tables
m = matrix(nrow=length(s), ncol=5)
for(i in 1:ncol(m)){
m[,i]=s
}
head(m)
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1 1 1 1 1
## [2,] 2 2 2 2 2
## [3,] 3 3 3 3 3
## [4,] 4 4 4 4 4
## [5,] 5 5 5 5 5
## [6,] 6 6 6 6 6
sum.row = apply(m, 1, sum)
sum.col = apply(m, 2, sum)
tapply - for ‘ragged’ arrays –> arrays with grouping factors
census = read.csv('/usr/share/data/kaggle/census.csv')
head(census)
state.mean.poverty = tapply(census$Poverty, census$State,mean)
R has a number of useful libraries for writing parallel code. At the end of the day these are just fancy loops
Logical flow of parallel computing:
Open a CPU cluster (R daughter processes pool of size n) Split job into discrete ‘queues’ Run job queues across each process (1:n) Collect results data objects
We can build some example code with parallel processing using “apply” type functions:
require(parallel)
## Loading required package: parallel
do = seq(1, 10000000)
p = proc.time()
l_works = sapply(do, sqrt)
proc.time() - p
## user system elapsed
## 10.44 0.28 10.75
nclus = 4
cl = makeCluster(nclus, type ='SOCK');
p = proc.time()
splits = clusterSplit(cl, do)
p_works2 = parSapply(cl, splits, sqrt)
proc.time() - p
## user system elapsed
## 1.33 0.36 1.83
stopCluster(cl)
You do want to make sure that you are sending a job big enough to justify the parallel overhead. It is very easy to make things worse.
nclus = 4
cl = makeCluster(nclus, type ='SOCK');
p = proc.time()
p_works2 = parSapply(cl, do, sqrt)
proc.time() - p; #No faster than non-parallel
## user system elapsed
## 9.28 5.44 20.03
stopCluster(cl)
NOTE: parSapply returns a list of vectors. The list has nclus elements:
length(p_works2)
## [1] 10000000
length(unlist(p_works2))
## [1] 10000000
Often we want to tell R to execute some command on our data if a certain criteria is met. (i.e., IF we have data for state X we want to calculate some summary value)
To do this we use logical operators within ‘if’ statements. Often within loops.
citizenRatio = vector()
for(i in 1:nrow(census)){
if(census[i,'State']=='Arizona'){
citizenRatio[i]=census[i,'Citizen']/census[i,'TotalPop']
}
}
summary(citizenRatio)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.1992 0.6194 0.7060 0.6918 0.7768 1.0000 1354
Maybe we also want to do something in the cases where that condition is NOT met. For that we use ‘else’
AZcitizenRatio = vector()
OtherRatio=vector()
for(i in 1:nrow(census)){
if(census[i,'State']=='Arizona'){
AZcitizenRatio[i]=census[i,'Citizen']/census[i,'TotalPop']
} else {
OtherRatio[i]=census[i,'Citizen']/census[i,'TotalPop']
}
}
summary(AZcitizenRatio)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.1992 0.6194 0.7060 0.6918 0.7768 1.0000 1354
summary(OtherRatio)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0000 0.6698 0.7372 0.7147 0.7813 1.0000 2210
And still other times we may want to test MULTIPLE conditions and apply different actions to each:
AZcitizenRatio = vector()
CAcitizenRatio = vector()
OtherRatio=vector()
for(i in 1:nrow(census)){
if(census[i,'State']=='Arizona'){
AZcitizenRatio[i]=census[i,'Citizen']/census[i,'TotalPop']
} else if(census[i,"State"]=='California') {
CAcitizenRatio[i]=census[i,'Citizen']/census[i,'TotalPop']
} else {
OtherRatio[i]=census[i,'Citizen']/census[i,'TotalPop']
}
}
summary(AZcitizenRatio)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.1992 0.6194 0.7060 0.6918 0.7768 1.0000 1354
summary(CAcitizenRatio)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.140 0.558 0.665 0.642 0.742 1.000 3605
summary(OtherRatio)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 0.683 0.742 0.724 0.784 1.000 10222
Multiple layers of if/else statements can get confusing and difficult to understand. If you end up having an error inside this code it can be a nightmare!
Any time you write a bit of code that you find yourself repeating more than a couple of times you should consider writing that as a function instead. The more complicated the code that’s repeating is, the more likely you will benefit from a function.
To declare a function:
citratio=function(x){
#calculate the ratio of citizens to the total pop and return the median
#pass census to this function 'citratio(census)'
#then census is refered to within as 'x'
ratio = x$Citizen/x$TotalPop
med=median(ratio, na.rm=T) #Ignore missing values
return(med)
}
citratio(census)
## [1] 0.7366694
citratio(census[census$State=='Virginia',])
## [1] 0.7452957
Then you can use this function anywhere after it’s declaration:
AZcitizenRatio = vector()
CAcitizenRatio = vector()
OtherRatio=vector()
for(i in 1:nrow(census)){
if(census[i,'State']=='Arizona'){
AZcitizenRatio[i]=citratio(census[i,])
} else if(census[i,"State"]=='California') {
CAcitizenRatio[i]=census[i,'Citizen']/census[i,'TotalPop']
} else {
OtherRatio[i]=census[i,'Citizen']/census[i,'TotalPop']
}
}
Did that work as expected?
For your blog this week develop a simulation using R to model one of the following:
Prof. Harbert will help you get set up for one of these that you choose