We will be working on a teaching cluster maintained by the SICG. You will get printed instructions on how to log in. We have pre-installed most of the R libraries we will need for this course, but where possible I will include installation details for libraries we are using here.
NOTE: Usernames have been defined as your first initial and last name. e.g., ‘rharbert’
ssh youruser@lyell.amnh.org
This will prompt you to set up DUO authentication access.
Looking around Lyell:
pwd
ls
cd /data/spbio
ls #Should find ebird, and climgrids folders
There are fundamental command line tools native to Unix that make working with large text files common to bioinformatics relatively easy. There are other options, like relational databases, that efficiently handle large data sets, but these are no substitue for unix command line tools for exploratory analyses. We will work through some of this in the next section.
The best tool for moving large data files arount is rsync.
##From your laptop, access Lyell to grab ebd_trim.csv
# Make sure you know where you are:
pwd
#rsync syntax:
# a - archive (get everything within directories)
# v - verbose (print messages)
# z - compress data before transfer
# ./ is the path to your current directory.
rsync -avz youruser@lyell.amnh.org:/data/spbio/ebd_trim.csv ./
R is very handy for it’s interactive command line interface. To get started type: ‘R’ at your command line.
What version of R do you have?
Or
version
## _
## platform x86_64-apple-darwin14.5.0
## arch x86_64
## os darwin14.5.0
## system x86_64, darwin14.5.0
## status
## major 3
## minor 4.1
## year 2017
## month 06
## day 30
## svn rev 72865
## language R
## version.string R version 3.4.1 (2017-06-30)
## nickname Single Candle
We can easily get started with the R command promp open.
x=2
print(x) ##Print method
## [1] 2
class(x)
## [1] "numeric"
x=seq(1:10) # Create a vector
class(x)
## [1] "integer"
print(x)
## [1] 1 2 3 4 5 6 7 8 9 10
print(x[1]) # First index of vector
## [1] 1
print(x[1:5])
## [1] 1 2 3 4 5
y = matrix(nrow=5, ncol=5) # create a 5x5 matrix
print(y)
## [,1] [,2] [,3] [,4] [,5]
## [1,] NA NA NA NA NA
## [2,] NA NA NA NA NA
## [3,] NA NA NA NA NA
## [4,] NA NA NA NA NA
## [5,] NA NA NA NA NA
class(y)
## [1] "matrix"
y[1,1] = 5
print(y)
## [,1] [,2] [,3] [,4] [,5]
## [1,] 5 NA NA NA NA
## [2,] NA NA NA NA NA
## [3,] NA NA NA NA NA
## [4,] NA NA NA NA NA
## [5,] NA NA NA NA NA
y[,1] = x[1:5]
print(y)
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1 NA NA NA NA
## [2,] 2 NA NA NA NA
## [3,] 3 NA NA NA NA
## [4,] 4 NA NA NA NA
## [5,] 5 NA NA NA NA
class(y[,1])
## [1] "numeric"
y = cbind(seq(1:5),
seq(1:5),
seq(1:5),
seq(1:5),
seq(1:5))
class(y)
## [1] "matrix"
read.table() read.csv() read.delim()
cars = read.table('../data/mtcars.csv', header=T, sep = ',') # Read a comma separated values file
head(cars)
## model mpg cyl disp hp drat wt qsec vs am gear carb
## 1 Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## 2 Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## 3 Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## 4 Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## 5 Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## 6 Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
cars = read.csv('../data/mtcars.csv')
cars = read.csv2('../data/mtcars.csv') ## Interesting behavior here, will be somewhat faster
cars = read.delim('../data/mtcars.csv', sep=',')
fread - Faster read
library(data.table)
ebd = fread('../data/ebd_trim.csv', sep = '\t')
Repeating tasks using for loops
for(i in 1:10) {
print(i)
}
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
## [1] 6
## [1] 7
## [1] 8
## [1] 9
## [1] 10
Catch loop output
li = vector()
for(i in 1:10){
li[[i]]=log(i)
}
The Apply functions in R provide efficient repetition that usually out-performs for loops.
print(y) #our matrix from earlier
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1 1 1 1 1
## [2,] 2 2 2 2 2
## [3,] 3 3 3 3 3
## [4,] 4 4 4 4 4
## [5,] 5 5 5 5 5
y = as.data.frame(y)
li1 = apply(y, 1, sum) # row-wise
li2 = apply(y, 2, sum) # column-wise
li2 = lapply(y[,1], log) #returns list
li2 = sapply(y[,1], log) #returns vector
#replicate an operation, a wrapper for sapply
rep = replicate(10, log(y[,1]), simplify='array')
The syntax for R functions:
square = function(x){
return(x**2)
}
square(64)
## [1] 4096
Can take as many arguments as you want
sumit = function(x,y){
return(x+y)
}
sumit(3,5)
## [1] 8
And you can give the function default variable values:
logbase=function(x, base=10){
l = log(x, base)
return(l)
}
logbase(2) #log10
## [1] 0.30103
logbase(2, base=exp(1)) #natural log
## [1] 0.6931472
the ggplot library is part of the ‘tidyverse’. We will not use tidy protocols exclusively in this course, but the tools for making high quality figures in ggplot are worth touching on here.
# install.packages("tidyverse") # if not yet installed
library(tidyverse)
## ── Attaching packages ─────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 2.2.1 ✔ purrr 0.2.4
## ✔ tibble 1.4.2 ✔ dplyr 0.7.4
## ✔ tidyr 0.8.0 ✔ stringr 1.3.0
## ✔ readr 1.1.1 ✔ forcats 0.3.0
## ── Conflicts ────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
#import data as before:
## **The following code and data are shamelessly stolen from Chase Nelson's (cnelson@amnh.org) SICG workshop at the AMNH on ggplot from February 2018.**
virus_data <- read.delim("../data/R_workshop_data.txt") # import SNP data
# examine
head(virus_data) # view the first 6 rows
## host host_type position reference allele count coverage gene
## 1 B10 host 903 T C 40 1437 PrM
## 2 B10 host 2170 A G 15 1298 E
## 3 B10 host 4460 T C 16 1240 NS2b
## 4 B10 host 4957 A T 33 1371 NS3
## 5 B10 host 5523 G A 16 1330 NS3
## 6 B10 host 7050 C T 30 1593 NS4b
## coding_change amino_acid_change
## 1 TRUE FALSE
## 2 TRUE TRUE
## 3 TRUE TRUE
## 4 TRUE TRUE
## 5 TRUE FALSE
## 6 TRUE FALSE
head(virus_data, n = 10) # view the first n rows
## host host_type position reference allele count coverage gene
## 1 B10 host 903 T C 40 1437 PrM
## 2 B10 host 2170 A G 15 1298 E
## 3 B10 host 4460 T C 16 1240 NS2b
## 4 B10 host 4957 A T 33 1371 NS3
## 5 B10 host 5523 G A 16 1330 NS3
## 6 B10 host 7050 C T 30 1593 NS4b
## 7 B10 host 8263 T C 168 1610 NS5
## 8 B10 host 8382 A G 252 1747 NS5
## 9 B10 host 8433 A C 52 1499 NS5
## 10 B10 host 10984 C T 9 362 3UTR
## coding_change amino_acid_change
## 1 TRUE FALSE
## 2 TRUE TRUE
## 3 TRUE TRUE
## 4 TRUE TRUE
## 5 TRUE FALSE
## 6 TRUE FALSE
## 7 TRUE TRUE
## 8 TRUE FALSE
## 9 TRUE FALSE
## 10 FALSE FALSE
str(virus_data) # describe the STRucture of the data.frame
## 'data.frame': 516 obs. of 10 variables:
## $ host : Factor w/ 20 levels "B10","B2","B3",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ host_type : Factor w/ 2 levels "host","vector": 1 1 1 1 1 1 1 1 1 1 ...
## $ position : int 903 2170 4460 4957 5523 7050 8263 8382 8433 10984 ...
## $ reference : Factor w/ 4 levels "A","C","G","T": 4 1 4 1 3 2 4 1 1 2 ...
## $ allele : Factor w/ 5 levels "-","A","C","G",..: 3 4 3 5 2 5 3 4 3 5 ...
## $ count : int 40 15 16 33 16 30 168 252 52 9 ...
## $ coverage : int 1437 1298 1240 1371 1330 1593 1610 1747 1499 362 ...
## $ gene : Factor w/ 13 levels "2K","3UTR","5UTR",..: 13 5 8 9 9 11 12 12 12 2 ...
## $ coding_change : logi TRUE TRUE TRUE TRUE TRUE TRUE ...
## $ amino_acid_change: logi FALSE TRUE TRUE TRUE FALSE FALSE ...
# add a new column
virus_data$frequency <- virus_data$count / virus_data$coverage # add col with SNP frequencies
virus_data$frequency <- with(virus_data, count / coverage) # same thing a different way
head(virus_data) # view again to check that it worked
## host host_type position reference allele count coverage gene
## 1 B10 host 903 T C 40 1437 PrM
## 2 B10 host 2170 A G 15 1298 E
## 3 B10 host 4460 T C 16 1240 NS2b
## 4 B10 host 4957 A T 33 1371 NS3
## 5 B10 host 5523 G A 16 1330 NS3
## 6 B10 host 7050 C T 30 1593 NS4b
## coding_change amino_acid_change frequency
## 1 TRUE FALSE 0.02783577
## 2 TRUE TRUE 0.01155624
## 3 TRUE TRUE 0.01290323
## 4 TRUE TRUE 0.02407002
## 5 TRUE FALSE 0.01203008
## 6 TRUE FALSE 0.01883239
# generate summary statistics
summary(virus_data$coverage)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 281 1319 1444 1368 1514 2031
# SCATTERPLOTS
# view SNP frequencies
ggplot(data = virus_data) +
geom_point(mapping = aes(x = position, y = frequency))
# change x and y axis titles
ggplot(data = virus_data) +
geom_point(mapping = aes(x = position, y = frequency)) +
xlab("Genome Position") + ylab("SNP Frequency")
# use transparency to better visualize clusters of points
ggplot(data = virus_data) +
geom_point(mapping = aes(x = position, y = frequency), alpha = 0.5) +
xlab("Genome Position") + ylab("SNP Frequency")
# Facetting -- multiple plots
ggplot(data = filter(virus_data, host_type == 'host')) +
geom_point(mapping = aes(x = position, y = frequency)) +
xlab("Genome Position") + ylab("SNP Frequency") +
facet_wrap(~ host)
Also in the tidyverse is a great packages for easy generation of maps.
library(tidyverse)
library(mapdata)
## Loading required package: maps
##
## Attaching package: 'maps'
## The following object is masked from 'package:purrr':
##
## map
library(maps)
library(ggmap)
library(magrittr)
##
## Attaching package: 'magrittr'
## The following object is masked from 'package:ggmap':
##
## inset
## The following object is masked from 'package:purrr':
##
## set_names
## The following object is masked from 'package:tidyr':
##
## extract
If any of these fail try and install packages.
One of the best parts of these tools is the built in access to Google maps aerial imagery.
loc = cbind(-73.973917, 40.781799)
loc = as.data.frame(loc)
colnames(loc) = c('lon', 'lat')
bkmap <- get_map(location = loc, maptype = "satellite", source = "google", zoom =14)
## Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=40.781799,-73.973917&zoom=14&size=640x640&scale=2&maptype=satellite&language=en-EN&sensor=false
ggmap(bkmap) +
geom_point(data = loc,
color = "red",
size =4)
bkmap3 <- get_map(location = loc , maptype = "terrain", source = "google", zoom = 12)
## Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=40.781799,-73.973917&zoom=12&size=640x640&scale=2&maptype=terrain&language=en-EN&sensor=false
ggmap(bkmap3) +
geom_point(data = loc,
color = "red",
size =4)
bkmap4 <- get_map(location = loc , maptype = "toner-lite", source = "google", zoom = 10)
## maptype = "toner-lite" is only available with source = "stamen".
## resetting to source = "stamen"...
## Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=40.781799,-73.973917&zoom=10&size=640x640&scale=2&maptype=terrain&sensor=false
## Map from URL : http://tile.stamen.com/toner-lite/10/300/383.png
## Map from URL : http://tile.stamen.com/toner-lite/10/301/383.png
## Map from URL : http://tile.stamen.com/toner-lite/10/302/383.png
## Map from URL : http://tile.stamen.com/toner-lite/10/300/384.png
## Map from URL : http://tile.stamen.com/toner-lite/10/301/384.png
## Map from URL : http://tile.stamen.com/toner-lite/10/302/384.png
## Map from URL : http://tile.stamen.com/toner-lite/10/300/385.png
## Map from URL : http://tile.stamen.com/toner-lite/10/301/385.png
## Map from URL : http://tile.stamen.com/toner-lite/10/302/385.png
ggmap(bkmap4) +
geom_point(data = loc,
color = "red",
size =4)