Sequence data on their own are not that informative. To make sense of it all we have to start comparing data through alignment.
When we want to align two sequences together we call that pairwise alignment. Finding the similarities between many sequences is called multiple sequence alignment. Today we will focus on pairwise alignment algorithms and the ubiquitous BLAST algorithm.
First, let’s look at the simple Needleman-Wusch alignment algorithm:https://www.ebi.ac.uk/Tools/psa/emboss_needle/. Needleman-Wunsch searches for the optimal alignment of any pair of DNA (or protein) sequences. N-W alignments are complete or “global”, that is, it compares every position of the sequences.
Consider the algorithmic complexity of the Needleman-Wunsch alignment.
BLAST (Basic Local Alignment Search Tool) optimizes time complexity in pairwise alignment by searching for short, local, sequence matches and then extending the comparison out from the matching regions.
Some good resources to review later:
Most people first interact with BLAST via the web portal.
In the web portal search for this sequence:
“CATTTCAGGTGTGGGTGTTTATGATTGCGCCCTACCGTGACGTTTTCATCTATCGGAGGGAGTGGAACACCTGTCGCTTTCTATCTTCCATTGTACTTTGTGTTTACGAGCTAAAGTTTTAACACAAGAAGTTAGTATATTTACCATTCGATACAAACTCTTTTTTTTTGAGGACCGCTGTAAATAATTTCTGCATATGCATTACGGCGATCGTGATAATAGAAAAATCCGTCGGTCAGTCGACTTACTAACAAGGATGTCCTAGTACGCTTCACAAAATTTCATTTTTGTAATGATCCAATCCAGAGACTGATTAGACTAATGTAACATCAGCTTTCTTAATAGCATTATCCGTTATAAATGAATTTTCGGCGTTTGACTCCGCTTCAAATATTTTGTCGCACACTTGAAAGATAGCCAAAGCCGAAGAATGCTTGGTTGATTGTATAGATCCTTCCGGGACTGAGCCCACCCATAAAATGCCATTGCCATAAACGAACAAGGTAACGGCCTGTTTCCATTTATTCATTCAGAAGTATCTTTTGTAGCCTTCCAAAATGGATTTTCCCCGATATCTAACATAATGCATGAAAGGATCTTTGAAGACCCGTCTGCGATAGCCGGAAAATCATTAGCAAAGACTTCTTCTACCACAAGGATATTTTTATTTTCCATAGAAATATATTCAGCTCAAAAAAAAGCCCCATGAGAGGATGCGACCCAATCGTAAATGAGAAAAGTGGTTGCGGAGAAAAAGTAAGATGGATTCGTATTCTAAGCATGAGAATTATAGGAACAAGGAAGCCTAATCTTGGATTTATTTTGCAGAAAAGGAAAAGTCAAATAGATTTTTGTAGAAAATAATAAAAACATTACCTTCAATTAGAATATAGATGAAGAAAAGGCCGTAATAAATGTACAAAGAAGGAATGCCACTTGTGAAGAATTTGAACCAAGATTTCCAGATGGACTGGGTGCAATATCAGCACCAACAGGAAATCCAGCGTTTCCTCGATAGATGAAACGT”
What is it? ^
Work on the server
We will attempt to search human transcriptome data against the human genome using NCBI RefSeq data.
First, set up a working directory:
mkdir blast
cd blast
mkdir data
Then, download data into the data folder
cp /usr/share/data/BIO331/genome.fna.gz data/genome.fna.gz
cp /usr/share/data/BIO331/transcripts.fna.gz data/transcripts.fna.gz
The first step to running a blast analysis on your machine is to build your database. We will be using the genome file for that purpose.NOTE FOR MAC USERS: For all commands below leave out the “.exe” in the program file name.
gunzip data/genome.fna.gz
makeblastdb -in data/genome.fna -dbtype nucl -title human
gunzip data/transcripts.fna.gz
head -40 data/transcripts.fna
Use the Unix command line tools we talked about last time to look through the genome.fna and transcripts.fna files.
They should both be in FASTA format which looks like this:
“> NC0000001.1 Descriptive title string
ACTCGCGCTCGATCGATCGATCGCTG
ATCGTCGCTCGCTCCTCTCGTGAGAT
ATCGCTGCTAGCTAGCTGATGCGTAT”
What does this do? (Hint: ‘tr –help’)
head -100 data/transcripts.fna | tr '\n' '\0'
Then we can run a basic BLAST search:
head -100 data/transcripts.fna > min.fna
blastn -db data/genome.fna -query data/min.fna -evalue 1e-50
Spend some time parsing this output.
Looking at the alignments can be informative. Often, however, you may just want summary data: change -outfmt flag
blastn -db data/genome.fna -query data/min.fna -evalue 1e-50 -outfmt 6
The columns in this blast output format are:
qseqid query (e.g., gene) sequence id
sseqid subject (e.g., reference genome) sequence id
pident percentage of identical matches
length alignment length
mismatch number of mismatches
gapopen number of gap openings
qstart start of alignment in query
qend end of alignment in query
sstart start of alignment in subject
send end of alignment in subject
evalue expect value