Today:

  • Discuss sequence alignment.
  • Introduce command line BLAST

DNA Sequence Alginment

Sequence data on their own are not that informative. To make sense of it all we have to start comparing data through alignment.

When we want to align two sequences together we call that pairwise alignment. Finding the similarities between many sequences is called multiple sequence alignment. Today we will focus on pairwise alignment algorithms and the ubiquitous BLAST algorithm.

First, let’s look at the simple Needleman-Wusch alignment algorithm:https://www.ebi.ac.uk/Tools/psa/emboss_needle/. Needleman-Wunsch searches for the optimal alignment of any pair of DNA (or protein) sequences. N-W alignments are complete or “global”, that is, it compares every position of the sequences.

Consider the algorithmic complexity of the Needleman-Wunsch alignment.

BLAST

BLAST (Basic Local Alignment Search Tool) optimizes time complexity in pairwise alignment by searching for short, local, sequence matches and then extending the comparison out from the matching regions.

Some good resources to review later:

Web BLAST

Most people first interact with BLAST via the web portal.

In the web portal search for this sequence:

“CATTTCAGGTGTGGGTGTTTATGATTGCGCCCTACCGTGACGTTTTCATCTATCGGAGGGAGTGGAACACCTGTCGCTTTCTATCTTCCATTGTACTTTGTGTTTACGAGCTAAAGTTTTAACACAAGAAGTTAGTATATTTACCATTCGATACAAACTCTTTTTTTTTGAGGACCGCTGTAAATAATTTCTGCATATGCATTACGGCGATCGTGATAATAGAAAAATCCGTCGGTCAGTCGACTTACTAACAAGGATGTCCTAGTACGCTTCACAAAATTTCATTTTTGTAATGATCCAATCCAGAGACTGATTAGACTAATGTAACATCAGCTTTCTTAATAGCATTATCCGTTATAAATGAATTTTCGGCGTTTGACTCCGCTTCAAATATTTTGTCGCACACTTGAAAGATAGCCAAAGCCGAAGAATGCTTGGTTGATTGTATAGATCCTTCCGGGACTGAGCCCACCCATAAAATGCCATTGCCATAAACGAACAAGGTAACGGCCTGTTTCCATTTATTCATTCAGAAGTATCTTTTGTAGCCTTCCAAAATGGATTTTCCCCGATATCTAACATAATGCATGAAAGGATCTTTGAAGACCCGTCTGCGATAGCCGGAAAATCATTAGCAAAGACTTCTTCTACCACAAGGATATTTTTATTTTCCATAGAAATATATTCAGCTCAAAAAAAAGCCCCATGAGAGGATGCGACCCAATCGTAAATGAGAAAAGTGGTTGCGGAGAAAAAGTAAGATGGATTCGTATTCTAAGCATGAGAATTATAGGAACAAGGAAGCCTAATCTTGGATTTATTTTGCAGAAAAGGAAAAGTCAAATAGATTTTTGTAGAAAATAATAAAAACATTACCTTCAATTAGAATATAGATGAAGAAAAGGCCGTAATAAATGTACAAAGAAGGAATGCCACTTGTGAAGAATTTGAACCAAGATTTCCAGATGGACTGGGTGCAATATCAGCACCAACAGGAAATCCAGCGTTTCCTCGATAGATGAAACGT”

What is it? ^

Get Data

Work on the server

We will attempt to search human transcriptome data against the human genome using NCBI RefSeq data.

First, set up a working directory:

mkdir blast
cd blast
mkdir data

Then, download data into the data folder


cp /usr/share/data/BIO331/genome.fna.gz data/genome.fna.gz


cp /usr/share/data/BIO331/transcripts.fna.gz data/transcripts.fna.gz

BLAST: Make Database from Genome File

The first step to running a blast analysis on your machine is to build your database. We will be using the genome file for that purpose.NOTE FOR MAC USERS: For all commands below leave out the “.exe” in the program file name.

gunzip data/genome.fna.gz

makeblastdb -in data/genome.fna -dbtype nucl -title human
gunzip data/transcripts.fna.gz
head -40 data/transcripts.fna

Use the Unix command line tools we talked about last time to look through the genome.fna and transcripts.fna files.

FASTA

They should both be in FASTA format which looks like this:

“> NC0000001.1 Descriptive title string
ACTCGCGCTCGATCGATCGATCGCTG
ATCGTCGCTCGCTCCTCTCGTGAGAT
ATCGCTGCTAGCTAGCTGATGCGTAT”

What does this do? (Hint: ‘tr –help’)

head -100 data/transcripts.fna | tr '\n' '\0' 

Then we can run a basic BLAST search:

head -100 data/transcripts.fna > min.fna

blastn -db data/genome.fna -query data/min.fna -evalue 1e-50 

Spend some time parsing this output.

Looking at the alignments can be informative. Often, however, you may just want summary data: change -outfmt flag


blastn -db data/genome.fna -query data/min.fna -evalue 1e-50 -outfmt 6

The columns in this blast output format are:

  1.  qseqid      query (e.g., gene) sequence id
  2.  sseqid      subject (e.g., reference genome) sequence id
  3.  pident      percentage of identical matches
  4.  length      alignment length
  5.  mismatch    number of mismatches
  6.  gapopen     number of gap openings
  7.  qstart      start of alignment in query
  8.  qend    end of alignment in query
  9.  sstart      start of alignment in subject
  10.  send    end of alignment in subject
  11.  evalue      expect value
  12. bitscore