The Unix Shell: A fundamental tool for bioinformatics

Outline:

  • Setup
  • What is the Unix shell? (What’s Unix? What’s a shell?)
  • Read Evaluate Print Loop (REPL)
  • Why do we need to know this?
  • Interacting with the shell
  • Moving in and viewing your filesystem
  • Files (create, edit, delete)
  • Parsing (search and bulk edit files - read, search, sort, and rearrange files)
  • Advanced topics: Scripting, loops, functions (*Probably for later)

Setup: RStudio

We are going to use the Unix (or similar) terminal tab in RStudio. If you do not already have RStudio set up you will need to do this now.

To get RStudio set up on your laptop:

  1. Download and install R from CRAN
  2. Download and install the Free RStudio Desktop from the RStudio Downloads Page

Do the install of R and THEN RStudio

What is the Unix Shell?

The shell provides an environment to interact with your computer on a more fundamental level than we are used to. It does this by implementing a simple “language” that we can learn to communicate with our machines through a command-line interface.

The command-line runs a Read-Evaluate-Print-Loop operation that is the core of everything we do there. This is essentially what the RStudio Console window does for us in the context of R.

The command-line looks something like this:

And we interact with the shell by typing a command after the $:

What did that do?

Finding out about your file system: ls, pwd, cd

ls – List what is in your current directory.

A brief note about getting help:

All command line programs/commands should** have a help file associated with them. We can view those in one of two ways:

pwd – print working directory

cd – change working directory

Files - Create, edit, delete files and directories: mkdir, nano, touch, cp, mv, rm, rmdir, cat, head, less

Let’s make a new folder in our current directory to play in:

mkdir - make directory

There is nothing there because we just created that directory. To make a new file we can use:

nano

OR a text editor ‘nano’. Be sure to put some text into the nano editor and save before closing.

Use Ctrl+X to close nano and follow the prompts at the bottom of your terminal if needed.

cp

Let’s say we want to make a backup of newfile. We can use ‘cp’:

cat head

We can do some simple viewing of files with commands ‘cat’ and ‘head’. These behave just like head() and cat() in R.

wc

And gather basic file statistics with ‘wc’:

mv, rm, rmdir

Or maybe we want to move a file to a new place

All of that was junk so let’s clean up after ourselves by removing the files we made with ‘rm’ and then the backdir directory with ‘rmdir’

NOTE: ‘rm’ is FOREVER. Be very sure you understand how rm behaves before pressing ENTER.

One dangerous error

Searching and manipulating text files: grep, awk, shuf, sort, uniq

Let’s first grab an example data file using a new command ‘curl’:

And we can do some basic exploration: (one command at a time!)

Does this text look familiar? How can we figure out what we have? Why has your instructor made this so difficult?

grep

A command you all NEED to know that helps search through big files is ‘grep’. Once again R has a grep() function that works on similar principles. Unix grep is one of the most used tools in bioinformatics data exploration.

Things can get more complicated with Regular Expressions but for now (until we start working with FASTA and fastq files) we will deal only in whole strings of characters.

It looks like this file has some old English phrases in it. Maybe it’s Shakespeare. But which one? Let’s try to grep out some key words that could help us figure it out:

Nope….

hmm… What do we have?

We have a scrambled text file that is a collection of Shakespeare plays. Can we sort this file out and make it readable again? YES! Better living through the power of Unix!

sort

The ‘sort’ command does (sometimes) exactly what it sounds like. This file that we have looks like it has line numbers at the start of every line. Let’s see if that is the case:

**the “>” here redirects the output into a new file. This is REALLY USEFUL.

## sort: cannot read: test.txt: No such file or directory

Not quite. Check sort –help and see if you can fix it!

awk

Now let’s look at a tabular data file and why we need “awk” in addition to grep.

This is a small chunk of the eBird database. How big is this file?

Searching this file with grep works OK ….

Downsides: 1) grep looks anywhere in the line, not just the column where the common name resides. 2) There are too many columns to read anything easily.

This is where awk is really useful. Could you read this into R, sure. But remember that this is a tiny example. If you had a table that was even 10% of the entire eBird database you would not be able to read that into the RAM available on your laptop. ‘awk’ can deal with very large files in a much more efficient way and is really useful for exploring data and retreiving the parts you want.

Example: Use awk to print just specific columns. Aside: The pipe “|” symbol passes the output of one command to the next command typed. We will do more with this.

But what columns to we really want?

Let’s look at only #1 (ID), #5 (common name), and #6 (latin name)

Is that what we wanted? Not quite, the defualt field delimeter in grep is any white space (space, tab, and a few other variants). The delimeter in this file is a tab character(.

We can tell awk that with the -F parameter

Now we can see that our grep command is not actually finding all Osprey records. It is fingind the word Osprey somewhere else in the line for some other bird records.

We can fix that with awk by supplying an “if” statement

Challenge: Can you write an awk command to grab the ID, names, and latitude and longitude and put that in it’s own file?

sort | uniq

A useful pair of tools is ‘sort’ and ‘uniq’. These do what they say: sort a list and find unique elements. ‘uniq’ is a fast algorithm but only removes duplicates in series so you should always uses sort and then uniq.

Example:

And to see that we need sort with our uniq just count the output:

For further reading: The GNU project

https://www.gnu.org/gnu/thegnuproject.en.html

Homework:

Go to: https://swcarpentry.github.io/shell-novice/

Work through the tutorial on the Unix Shell from Software Carpentry and create a notes file. Turn in your notes by email before class on Monday. Note that the recommended time to teach this entire tutorial is 4.5 hours. I don’t know how long it will take each of you independently to work through it but I think more than 1 hour should be a minimum.