Assembling the zika virus genome was a good warm-up to get familiar with some of the tools for working with whole genome sequence data. Now we will scale up our analyses and try some new software for building bacterial genomes.
Recall that the virus assembly that we started with ended up with many contig sequences, none long enough to cover the entire genome, and lot’s of contamination. The likely non-viral sequence reads are probably responsible for creating a difficult to resolve de Bruijn graph in Abyss, resulting in many contig sequences, when we really want one. One way to deal with problems like this using modern sequencing technology is to sequence long-read data in addition to the traditional (and efficient) short-read Illumina data. Today we will look at how long-read data impacts the assembly of a larger (bacterial) genome.
Raw bacterial whole genome data from Illumina and Nanopore sequencing experiments:
https://www.ncbi.nlm.nih.gov/sra/?term=SAMEA3357010
nanoraw="SRR5665597"
shortraw="ERR1023775"
nthreads=12
fastq-dump $nanoraw --split-3
fastq-dump $shortraw --split-3
The ERR1023775 data is the short-read dataset. Check the data quality with fastqc.
fastqc $shortraw* #run fastqc for both forward (1) and reverse (2) reads files
The SRR5665597 data is from a Nanopore sequencing experiment. To examine these data for quality use NanoPlots as fastqc makes assumptions about the data being in short-read formats.
NanoPlot --fastq $nanoraw.fastq --outdir Nanoplots
How do these data look?
Unicycler is a program that is specifically designed for the assembly of bacterial genomes from short-read (Illumina) and/or long read (Nanopore/PacBio) raw data.
The Unicycler GitHub README has an EXCELLENT demonstration of how it deals with de Bruijn graph complexities.
Running Unicycler: Illumina Only
#Illumina Assembly
unicycler -1 $shortraw\_1.fastq -2 $shortraw\_2.fastq -o unicycler_short --threads $nthreads
Running Unicycler: Nanopore long reads only
#Long read assembly
unicycler -l $nanoraw.fastq -o unicycler_long --threads $nthreads
Running Unicycler: Hybrid assembly (Short + Long to resolve tangled graph)
#Hybrid Assembly
unicycler -1 $shortraw\_1.fastq -2 $shortraw\_2.fastq -l $nanoraw.fastq -o unicycler_hybrid --threads $nthreads
The above assemblies will take several hours to run. Work through each line in this document, or better yet, add these lines to a bash script and run the whole analysis at once. In either case, verify that the analyses complete and check the output folders specified by Unicycler for contigs files. The lab this week will be visualizing these assemblies in Quast and looking for evidence that the long reads and hybrid assembly better resolve the graph related complexities of the assembled genome.