Sequence data on their own are not that informative. To make sense of it all we have to start comparing data through alignment.
When we want to align two sequences together we call that pairwise alignment. Finding the similarities between many sequences is called multiple sequence alignment. Today we will focus on pairwise alignment algorithms and the ubiquitous BLAST algorithm.
First, let’s look at the simple Needleman-Wusch alignment algorithm:https://www.ebi.ac.uk/Tools/psa/emboss_needle/. Needleman-Wunsch searches for the optimal alignment of any pair of DNA (or protein) sequences. N-W alignments are complete or “global”, that is, it compares every position of the sequences.
Consider the algorithmic complexity of the Needleman-Wunsch alignment.
BLAST (Basic Local Alignment Search Tool) optimizes time complexity in pairwise alignment by searching for short, local, sequence matches and then extending the comparison out from the matching regions.
Some good resources to review later:
Most people first interact with BLAST via the web portal.
In the web portal search for this sequence:
“CATTTCAGGTGTGGGTGTTTATGATTGCGCCCTACCGTGACGTTTTCATCTATCGGAGGGAGTGGAACACCTGTCGCTTTCTATCTTCCATTGTACTTTGTGTTTACGAGCTAAAGTTTTAACACAAGAAGTTAGTATATTTACCATTCGATACAAACTCTTTTTTTTTGAGGACCGCTGTAAATAATTTCTGCATATGCATTACGGCGATCGTGATAATAGAAAAATCCGTCGGTCAGTCGACTTACTAACAAGGATGTCCTAGTACGCTTCACAAAATTTCATTTTTGTAATGATCCAATCCAGAGACTGATTAGACTAATGTAACATCAGCTTTCTTAATAGCATTATCCGTTATAAATGAATTTTCGGCGTTTGACTCCGCTTCAAATATTTTGTCGCACACTTGAAAGATAGCCAAAGCCGAAGAATGCTTGGTTGATTGTATAGATCCTTCCGGGACTGAGCCCACCCATAAAATGCCATTGCCATAAACGAACAAGGTAACGGCCTGTTTCCATTTATTCATTCAGAAGTATCTTTTGTAGCCTTCCAAAATGGATTTTCCCCGATATCTAACATAATGCATGAAAGGATCTTTGAAGACCCGTCTGCGATAGCCGGAAAATCATTAGCAAAGACTTCTTCTACCACAAGGATATTTTTATTTTCCATAGAAATATATTCAGCTCAAAAAAAAGCCCCATGAGAGGATGCGACCCAATCGTAAATGAGAAAAGTGGTTGCGGAGAAAAAGTAAGATGGATTCGTATTCTAAGCATGAGAATTATAGGAACAAGGAAGCCTAATCTTGGATTTATTTTGCAGAAAAGGAAAAGTCAAATAGATTTTTGTAGAAAATAATAAAAACATTACCTTCAATTAGAATATAGATGAAGAAAAGGCCGTAATAAATGTACAAAGAAGGAATGCCACTTGTGAAGAATTTGAACCAAGATTTCCAGATGGACTGGGTGCAATATCAGCACCAACAGGAAATCCAGCGTTTCCTCGATAGATGAAACGT”
What is it? ^
Work on the server
We will attempt to search human transcriptome data against the human genome using NCBI RefSeq data.
First, set up a working directory:
mkdir blast
cd blast
mkdir dataThen, download data into the data folder
cp /usr/share/data/BIO331/genome.fna.gz data/genome.fna.gz
cp /usr/share/data/BIO331/transcripts.fna.gz data/transcripts.fna.gzThe first step to running a blast analysis on your machine is to build your database. We will be using the genome file for that purpose.NOTE FOR MAC USERS: For all commands below leave out the “.exe” in the program file name.
gunzip data/genome.fna.gz
makeblastdb -in data/genome.fna -dbtype nucl -title humangunzip data/transcripts.fna.gz
head -40 data/transcripts.fnaUse the Unix command line tools we talked about last time to look through the genome.fna and transcripts.fna files.
They should both be in FASTA format which looks like this:
“> NC0000001.1 Descriptive title string
 ACTCGCGCTCGATCGATCGATCGCTG
 ATCGTCGCTCGCTCCTCTCGTGAGAT
 ATCGCTGCTAGCTAGCTGATGCGTAT”
What does this do? (Hint: ‘tr –help’)
head -100 data/transcripts.fna | tr '\n' '\0' Then we can run a basic BLAST search:
head -100 data/transcripts.fna > min.fna
blastn -db data/genome.fna -query data/min.fna -evalue 1e-50 Spend some time parsing this output.
Looking at the alignments can be informative. Often, however, you may just want summary data: change -outfmt flag
blastn -db data/genome.fna -query data/min.fna -evalue 1e-50 -outfmt 6The columns in this blast output format are:
 qseqid      query (e.g., gene) sequence id sseqid      subject (e.g., reference genome) sequence id pident      percentage of identical matches length      alignment length mismatch    number of mismatches gapopen     number of gap openings qstart      start of alignment in query qend    end of alignment in query sstart      start of alignment in subject send    end of alignment in subject evalue      expect value