Outline:
We are going to use the Unix (or similar) terminal tab in RStudio. If you do not already have RStudio set up you will need to do this now.
To get RStudio set up on your laptop:
Do the install of R and THEN RStudio
The shell provides an environment to interact with your computer on a more fundamental level than we are used to. It does this by implementing a simple “language” that we can learn to communicate with our machines through a command-line interface.
The command-line runs a Read-Evaluate-Print-Loop operation that is the core of everything we do there. This is essentially what the RStudio Console window does for us in the context of R.
The command-line looks something like this:
And we interact with the shell by typing a command after the $:
What did that do?
ls – List what is in your current directory.
All command line programs/commands should** have a help file associated with them. We can view those in one of two ways:
pwd – print working directory
cd – change working directory
Let’s make a new folder in our current directory to play in:
There is nothing there because we just created that directory. To make a new file we can use:
OR a text editor ‘nano’. Be sure to put some text into the nano editor and save before closing.
Use Ctrl+X to close nano and follow the prompts at the bottom of your terminal if needed.
We can do some simple viewing of files with commands ‘cat’ and ‘head’. These behave just like head() and cat() in R.
Or maybe we want to move a file to a new place
All of that was junk so let’s clean up after ourselves by removing the files we made with ‘rm’ and then the backdir directory with ‘rmdir’
NOTE: ‘rm’ is FOREVER. Be very sure you understand how rm behaves before pressing ENTER.
One dangerous error
Let’s first grab an example data file using a new command ‘curl’:
And we can do some basic exploration: (one command at a time!)
Does this text look familiar? How can we figure out what we have? Why has your instructor made this so difficult?
A command you all NEED to know that helps search through big files is ‘grep’. Once again R has a grep() function that works on similar principles. Unix grep is one of the most used tools in bioinformatics data exploration.
grep --help
grep "the" test.txt #print all lines with the word "the"
grep -c "the" test.txt # count lines matching "the"
grep -cw "the" test.txt # why is this different?
Things can get more complicated with Regular Expressions but for now (until we start working with FASTA and fastq files) we will deal only in whole strings of characters.
It looks like this file has some old English phrases in it. Maybe it’s Shakespeare. But which one? Let’s try to grep out some key words that could help us figure it out:
Nope….
grep -i "othello" test.txt
grep -i "henry" test.txt
grep -i "lear" test.txt
grep -i "petruchio" test.txt
hmm… What do we have?
We have a scrambled text file that is a collection of Shakespeare plays. Can we sort this file out and make it readable again? YES! Better living through the power of Unix!
The ‘sort’ command does (sometimes) exactly what it sounds like. This file that we have looks like it has line numbers at the start of every line. Let’s see if that is the case:
**the “>” here redirects the output into a new file. This is REALLY USEFUL.
## sort: cannot read: test.txt: No such file or directory
Not quite. Check sort –help and see if you can fix it!
Now let’s look at a tabular data file and why we need “awk” in addition to grep.
curl https://raw.githubusercontent.com/rsh249/bioinformatics/master/data/ebd_trim3.csv > ebird.csv
head ebird.csv
This is a small chunk of the eBird database. How big is this file?
Searching this file with grep works OK ….
Downsides: 1) grep looks anywhere in the line, not just the column where the common name resides. 2) There are too many columns to read anything easily.
This is where awk is really useful. Could you read this into R, sure. But remember that this is a tiny example. If you had a table that was even 10% of the entire eBird database you would not be able to read that into the RAM available on your laptop. ‘awk’ can deal with very large files in a much more efficient way and is really useful for exploring data and retreiving the parts you want.
Example: Use awk to print just specific columns. Aside: The pipe “|” symbol passes the output of one command to the next command typed. We will do more with this.
But what columns to we really want?
Let’s look at only #1 (ID), #5 (common name), and #6 (latin name)
Is that what we wanted? Not quite, the defualt field delimeter in grep is any white space (space, tab, and a few other variants). The delimeter in this file is a tab character(.
We can tell awk that with the -F parameter
Now we can see that our grep command is not actually finding all Osprey records. It is fingind the word Osprey somewhere else in the line for some other bird records.
We can fix that with awk by supplying an “if” statement
Challenge: Can you write an awk command to grab the ID, names, and latitude and longitude and put that in it’s own file?
A useful pair of tools is ‘sort’ and ‘uniq’. These do what they say: sort a list and find unique elements. ‘uniq’ is a fast algorithm but only removes duplicates in series so you should always uses sort and then uniq.
Example:
awk -F '\t' '{print $17}' ebird.csv #counties in our eBird slice.
awk -F '\t' '{print $17}' ebird.csv | sort
awk -F '\t' '{print $17}' ebird.csv | sort | uniq
And to see that we need sort with our uniq just count the output:
Go to: https://swcarpentry.github.io/shell-novice/
Work through the tutorial on the Unix Shell from Software Carpentry and create a notes file. Turn in your notes by email before class on Monday. Note that the recommended time to teach this entire tutorial is 4.5 hours. I don’t know how long it will take each of you independently to work through it but I think more than 1 hour should be a minimum.