R Basics

Outline:

  • Setting up on Lyell
  • Interactive R - Data I/O refresher, calling functions, libraries, data visualization.
  • apply functions
  • Defining custom R functions
  • Data visualization with ggplot primer

Lyell

We will be working on a teaching cluster maintained by the SICG. You will get printed instructions on how to log in. We have pre-installed most of the R libraries we will need for this course, but where possible I will include installation details for libraries we are using here.

NOTE: Usernames have been defined as your first initial and last name. e.g., ‘rharbert’

ssh youruser@lyell.amnh.org

This will prompt you to set up DUO authentication access.

Looking around Lyell:

pwd
ls
cd /data/spbio
ls #Should find ebird, and climgrids folders

Unix

There are fundamental command line tools native to Unix that make working with large text files common to bioinformatics relatively easy. There are other options, like relational databases, that efficiently handle large data sets, but these are no substitue for unix command line tools for exploratory analyses. We will work through some of this in the next section.

Shared course data:

Navigate to /data/sp_bio and find the file ebd.csv

Let’s explore this file via some handy Unix command line tools.

cd /data/sp_bio
ls -l
du -h
du -h ebd_relMay-2017.txt
head ebd_relMay-2017.txt
head -50 ebd_relMay-2017.txt #print the 1st 50 lines
wc -l ebd_relMay-2017.txt #get the line count 

grep - Regular expression searching.

grep "Cardinalis" ebd_relMay-2017.txt | wc -l

grep "New York" ebd_relMay-2017.txt | wc -l

# Let's put some of that into a file: NYS birds --

grep "New York" ebd_relMay-2017.txt > ebd_trim.csv

awk is useful for simple scripting and file filtering

head -1 ebd_relMay-2017.txt #column headings

#awk 'conditional { do } ' 
#It can be useful to pair awk with head for exploration

head ebd_trim.csv | awk '{print $1}'
head ebd_trim.csv | awk '{print $2}'

# The default delimiter for awk is any white space (space/tab)
# This can be set explicitly
head ebd_trim.csv | awk -F "\t" '{print $2}'
head ebd_trim.csv | awk -F "\t" '{print $5}' # these should line up with the column header now.

#Column 6 should be the scientific name
# Get a list of unique entries in this field
head ebd_trim.csv | awk -F "\t" '{print $6}' | sort | uniq

rsync – Command line file transfer

The best tool for moving large data files arount is rsync.

##From your laptop, access Lyell to grab ebd_trim.csv 
# Make sure you know where you are:
pwd

#rsync syntax:
# a - archive (get everything within directories)
# v - verbose (print messages)
# z - compress data before transfer
# ./ is the path to your current directory. 
rsync -avz youruser@lyell.amnh.org:/data/spbio/ebd_trim.csv ./

Interactive R

R is very handy for it’s interactive command line interface. To get started type: ‘R’ at your command line.

What version of R do you have?

Or

version
##                _                           
## platform       x86_64-apple-darwin14.5.0   
## arch           x86_64                      
## os             darwin14.5.0                
## system         x86_64, darwin14.5.0        
## status                                     
## major          3                           
## minor          4.1                         
## year           2017                        
## month          06                          
## day            30                          
## svn rev        72865                       
## language       R                           
## version.string R version 3.4.1 (2017-06-30)
## nickname       Single Candle

We can easily get started with the R command promp open.

x=2
print(x) ##Print method
## [1] 2
class(x)
## [1] "numeric"
x=seq(1:10) # Create a vector
class(x)
## [1] "integer"
print(x)
##  [1]  1  2  3  4  5  6  7  8  9 10
print(x[1]) # First index of vector
## [1] 1
print(x[1:5])
## [1] 1 2 3 4 5
y = matrix(nrow=5, ncol=5) # create a 5x5 matrix
print(y)
##      [,1] [,2] [,3] [,4] [,5]
## [1,]   NA   NA   NA   NA   NA
## [2,]   NA   NA   NA   NA   NA
## [3,]   NA   NA   NA   NA   NA
## [4,]   NA   NA   NA   NA   NA
## [5,]   NA   NA   NA   NA   NA
class(y)
## [1] "matrix"
y[1,1] = 5
print(y)
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    5   NA   NA   NA   NA
## [2,]   NA   NA   NA   NA   NA
## [3,]   NA   NA   NA   NA   NA
## [4,]   NA   NA   NA   NA   NA
## [5,]   NA   NA   NA   NA   NA
y[,1]  = x[1:5]
print(y)
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    1   NA   NA   NA   NA
## [2,]    2   NA   NA   NA   NA
## [3,]    3   NA   NA   NA   NA
## [4,]    4   NA   NA   NA   NA
## [5,]    5   NA   NA   NA   NA
class(y[,1])
## [1] "numeric"
y = cbind(seq(1:5), 
          seq(1:5),
          seq(1:5),
          seq(1:5),
          seq(1:5))

class(y)
## [1] "matrix"

Data I/O

Read table/tab/csv/txt text files:

read.table() read.csv() read.delim()

cars = read.table('../data/mtcars.csv', header=T, sep = ',') # Read a comma separated values file
head(cars)
##               model  mpg cyl disp  hp drat    wt  qsec vs am gear carb
## 1         Mazda RX4 21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## 2     Mazda RX4 Wag 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## 3        Datsun 710 22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## 4    Hornet 4 Drive 21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## 5 Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## 6           Valiant 18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
cars = read.csv('../data/mtcars.csv')
cars = read.csv2('../data/mtcars.csv') ## Interesting behavior here, will be somewhat faster

cars = read.delim('../data/mtcars.csv', sep=',')

fread - Faster read

library(data.table)
ebd = fread('../data/ebd_trim.csv', sep = '\t')

Loops

Repeating tasks using for loops

for(i in 1:10) {
  print(i)
}
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
## [1] 6
## [1] 7
## [1] 8
## [1] 9
## [1] 10

Catch loop output

li = vector()
for(i in 1:10){
  li[[i]]=log(i)
}

apply family functions

The Apply functions in R provide efficient repetition that usually out-performs for loops.

print(y) #our matrix from earlier
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    1    1    1    1    1
## [2,]    2    2    2    2    2
## [3,]    3    3    3    3    3
## [4,]    4    4    4    4    4
## [5,]    5    5    5    5    5
y = as.data.frame(y)
li1 = apply(y, 1, sum) # row-wise
li2 = apply(y, 2, sum) # column-wise

li2 = lapply(y[,1], log) #returns list
li2 = sapply(y[,1], log) #returns vector

#replicate an operation, a wrapper for sapply
rep = replicate(10, log(y[,1]), simplify='array')

Writing Functions

The syntax for R functions:

square = function(x){
  return(x**2)
}
square(64)
## [1] 4096

Can take as many arguments as you want

sumit = function(x,y){
  return(x+y)
}
sumit(3,5)
## [1] 8

And you can give the function default variable values:

logbase=function(x, base=10){
  l = log(x, base)
  return(l)
}

logbase(2) #log10
## [1] 0.30103
logbase(2, base=exp(1)) #natural log
## [1] 0.6931472

ggplot

the ggplot library is part of the ‘tidyverse’. We will not use tidy protocols exclusively in this course, but the tools for making high quality figures in ggplot are worth touching on here.

# install.packages("tidyverse") # if not yet installed

library(tidyverse)
## ── Attaching packages ─────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 2.2.1     ✔ purrr   0.2.4
## ✔ tibble  1.4.2     ✔ dplyr   0.7.4
## ✔ tidyr   0.8.0     ✔ stringr 1.3.0
## ✔ readr   1.1.1     ✔ forcats 0.3.0
## ── Conflicts ────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
#import data as before:
## **The following code and data are shamelessly stolen from Chase Nelson's (cnelson@amnh.org) SICG workshop at the AMNH on ggplot from February 2018.**

virus_data <- read.delim("../data/R_workshop_data.txt") # import SNP data

# examine
head(virus_data) # view the first 6 rows
##   host host_type position reference allele count coverage gene
## 1  B10      host      903         T      C    40     1437  PrM
## 2  B10      host     2170         A      G    15     1298    E
## 3  B10      host     4460         T      C    16     1240 NS2b
## 4  B10      host     4957         A      T    33     1371  NS3
## 5  B10      host     5523         G      A    16     1330  NS3
## 6  B10      host     7050         C      T    30     1593 NS4b
##   coding_change amino_acid_change
## 1          TRUE             FALSE
## 2          TRUE              TRUE
## 3          TRUE              TRUE
## 4          TRUE              TRUE
## 5          TRUE             FALSE
## 6          TRUE             FALSE
head(virus_data, n = 10) # view the first n rows
##    host host_type position reference allele count coverage gene
## 1   B10      host      903         T      C    40     1437  PrM
## 2   B10      host     2170         A      G    15     1298    E
## 3   B10      host     4460         T      C    16     1240 NS2b
## 4   B10      host     4957         A      T    33     1371  NS3
## 5   B10      host     5523         G      A    16     1330  NS3
## 6   B10      host     7050         C      T    30     1593 NS4b
## 7   B10      host     8263         T      C   168     1610  NS5
## 8   B10      host     8382         A      G   252     1747  NS5
## 9   B10      host     8433         A      C    52     1499  NS5
## 10  B10      host    10984         C      T     9      362 3UTR
##    coding_change amino_acid_change
## 1           TRUE             FALSE
## 2           TRUE              TRUE
## 3           TRUE              TRUE
## 4           TRUE              TRUE
## 5           TRUE             FALSE
## 6           TRUE             FALSE
## 7           TRUE              TRUE
## 8           TRUE             FALSE
## 9           TRUE             FALSE
## 10         FALSE             FALSE
str(virus_data) # describe the STRucture of the data.frame
## 'data.frame':    516 obs. of  10 variables:
##  $ host             : Factor w/ 20 levels "B10","B2","B3",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ host_type        : Factor w/ 2 levels "host","vector": 1 1 1 1 1 1 1 1 1 1 ...
##  $ position         : int  903 2170 4460 4957 5523 7050 8263 8382 8433 10984 ...
##  $ reference        : Factor w/ 4 levels "A","C","G","T": 4 1 4 1 3 2 4 1 1 2 ...
##  $ allele           : Factor w/ 5 levels "-","A","C","G",..: 3 4 3 5 2 5 3 4 3 5 ...
##  $ count            : int  40 15 16 33 16 30 168 252 52 9 ...
##  $ coverage         : int  1437 1298 1240 1371 1330 1593 1610 1747 1499 362 ...
##  $ gene             : Factor w/ 13 levels "2K","3UTR","5UTR",..: 13 5 8 9 9 11 12 12 12 2 ...
##  $ coding_change    : logi  TRUE TRUE TRUE TRUE TRUE TRUE ...
##  $ amino_acid_change: logi  FALSE TRUE TRUE TRUE FALSE FALSE ...
# add a new column
virus_data$frequency <- virus_data$count / virus_data$coverage # add col with SNP frequencies
virus_data$frequency <- with(virus_data, count / coverage) # same thing a different way

head(virus_data) # view again to check that it worked
##   host host_type position reference allele count coverage gene
## 1  B10      host      903         T      C    40     1437  PrM
## 2  B10      host     2170         A      G    15     1298    E
## 3  B10      host     4460         T      C    16     1240 NS2b
## 4  B10      host     4957         A      T    33     1371  NS3
## 5  B10      host     5523         G      A    16     1330  NS3
## 6  B10      host     7050         C      T    30     1593 NS4b
##   coding_change amino_acid_change  frequency
## 1          TRUE             FALSE 0.02783577
## 2          TRUE              TRUE 0.01155624
## 3          TRUE              TRUE 0.01290323
## 4          TRUE              TRUE 0.02407002
## 5          TRUE             FALSE 0.01203008
## 6          TRUE             FALSE 0.01883239
# generate summary statistics
summary(virus_data$coverage)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     281    1319    1444    1368    1514    2031
# SCATTERPLOTS
# view SNP frequencies
ggplot(data = virus_data) +
  geom_point(mapping = aes(x = position, y = frequency))

# change x and y axis titles
ggplot(data = virus_data) +
  geom_point(mapping = aes(x = position, y = frequency)) +
  xlab("Genome Position") + ylab("SNP Frequency")

# use transparency to better visualize clusters of points
ggplot(data = virus_data) +
  geom_point(mapping = aes(x = position, y = frequency), alpha = 0.5) +
  xlab("Genome Position") + ylab("SNP Frequency")

# Facetting -- multiple plots

ggplot(data = filter(virus_data, host_type == 'host')) +
  geom_point(mapping = aes(x = position, y = frequency)) +
  xlab("Genome Position") + ylab("SNP Frequency") +
  facet_wrap(~ host)

ggmaps

Also in the tidyverse is a great packages for easy generation of maps.

library(tidyverse)
library(mapdata)
## Loading required package: maps
## 
## Attaching package: 'maps'
## The following object is masked from 'package:purrr':
## 
##     map
library(maps)
library(ggmap)
library(magrittr)
## 
## Attaching package: 'magrittr'
## The following object is masked from 'package:ggmap':
## 
##     inset
## The following object is masked from 'package:purrr':
## 
##     set_names
## The following object is masked from 'package:tidyr':
## 
##     extract

If any of these fail try and install packages.

One of the best parts of these tools is the built in access to Google maps aerial imagery.

loc = cbind(-73.973917, 40.781799)
loc = as.data.frame(loc)
colnames(loc) = c('lon', 'lat')
bkmap <- get_map(location = loc, maptype = "satellite", source = "google", zoom =14)
## Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=40.781799,-73.973917&zoom=14&size=640x640&scale=2&maptype=satellite&language=en-EN&sensor=false
ggmap(bkmap) + 
  geom_point(data = loc, 
             color = "red",
             size =4)

bkmap3 <- get_map(location = loc ,  maptype = "terrain", source = "google", zoom = 12)
## Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=40.781799,-73.973917&zoom=12&size=640x640&scale=2&maptype=terrain&language=en-EN&sensor=false
ggmap(bkmap3) + 
  geom_point(data = loc, 
             color = "red",
             size =4)

bkmap4 <- get_map(location = loc ,  maptype = "toner-lite", source = "google", zoom = 10)
## maptype = "toner-lite" is only available with source = "stamen".
## resetting to source = "stamen"...
## Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=40.781799,-73.973917&zoom=10&size=640x640&scale=2&maptype=terrain&sensor=false
## Map from URL : http://tile.stamen.com/toner-lite/10/300/383.png
## Map from URL : http://tile.stamen.com/toner-lite/10/301/383.png
## Map from URL : http://tile.stamen.com/toner-lite/10/302/383.png
## Map from URL : http://tile.stamen.com/toner-lite/10/300/384.png
## Map from URL : http://tile.stamen.com/toner-lite/10/301/384.png
## Map from URL : http://tile.stamen.com/toner-lite/10/302/384.png
## Map from URL : http://tile.stamen.com/toner-lite/10/300/385.png
## Map from URL : http://tile.stamen.com/toner-lite/10/301/385.png
## Map from URL : http://tile.stamen.com/toner-lite/10/302/385.png
ggmap(bkmap4) + 
  geom_point(data = loc, 
             color = "red",
             size =4)