Create variable pointing to the class E. coli assembly and raw data.
datapath=/usr/share/data/proj_data/ecoli_lab
reads=$datapath/oxford.fasta
assembly=$datapath/ecoli-oxford3/ecoli.contigs.fasta
Then create a directory for polish testing. We will work in here today.
mkdir polishtest
cd polishtest
Once you have contigs the challenge becomes viewing the assembly and understanding how well this characterizes the genome.
Today we will be exploring some of the tools for visualizing and improving upon our Nanopore assemblies.
One very useful tool for examining genome statistics is Quast. This helps visualize how much and which parts of the genome our contigs cover.
Example Run:
#get reference genome:
### Check a reference
wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/005/845/GCF_000005845.2_ASM584v2/GCF_000005845.2_ASM584v2_genomic.fna.gz
gunzip GCF_000005845.2_ASM584v2_genomic.fna.gz
refdata=GCF_000005845.2_ASM584v2_genomic.fna
quast.py $assembly -R $refdata --nanopore $reads -o quast_test_output_base
Check out a copy here:
After we get a look at one genome output we can attempt to improve upon this via genome polishing. There are several methods for this with Nanopore data but we will focus here on the minimap -> racon process.
Here we are using the raw nanopore reads. However, if you have Illumina data USE THOSE TO POLISH THE CONTIGS!
minimap2: https://github.com/lh3/minimap2 racon: https://github.com/isovic/racon
Map reads to assembly using minimap:
minimap -Sw5 -L100 -m0 -t2 $assembly $reads > mapIT01.paf
Where: + “-t2” specifies 2 threads (optional for this small example) + “Sw5” ????? + “-m0” specifies to never merge chains. + “-L100” requires matches to be at least 100 base pairs in length.
Then we take those mapped data and generate a consensus sequence over the assembly using racon.
racon -t 2 $reads mapIT01.paf $assembly > mapIT01.fa
minimap -Sw5 -L100 -m0 mapIT01.fa $reads > mapIT02.paf
racon -t 2 $reads mapIT02.paf mapIT01.fa > mapIT02.fa
minimap -Sw5 -L100 -m0 mapIT02.fa $reads > mapIT03.paf
racon -t 2 $reads mapIT03.paf mapIT02.fa > mapIT03.fa
*From mummer: https://github.com/mummer4/mummer
dnadiff -p ref $refdata $datapath/ecoli-oxford3/ecoli.contigs.fasta
dnadiff -p one $refdata mapIT01.fa
dnadiff -p two $refdata mapIT02.fa
dnadiff -p three $refdata mapIT03.fa
Then look at the *.report files
less ref.report
Look at %Identity between runs but also look at the number of SNPs and indels reported. How is polishing changing this (already very good) assembly?
grep "TotalSNPs" *.report
grep "TotalIndels" *.report
pilon:https://github.com/broadinstitute/pilon **Another consensus caller like racon Tutorial: https://denbi-nanopore-training-course.readthedocs.io/en/latest/polishing/pilon/pilon.html
###WARNING: This fails on pilon for java reasons.### UNDER CONSTRUCTION
cp $assembly ./
aslocal=ecoli.contigs.fasta #or name of assembly contigs
bwa index $aslocal #call local copy of assembly file by name here because bwa needs write access
bwa mem -t 4 $aslocal $reads | samtools view - -Sb | samtools sort - -@4 -o WGS.sorted.bam
samtools index WGS.sorted.bam
#grab pilon jar file
wget https://github.com/broadinstitute/pilon/releases/download/v1.23/pilon-1.23.jar
#run java pilon to control memory and thread use
java -Xmx64G -jar pilon-1.23.jar --genome $aslocal --fix all --changes --bam WGS.sorted.bam --threads 4 --outdir pilonpolish
ls -lh pilonpolish
NanoPolish:https://github.com/jts/nanopolish **Seems to require access to the original fast5 files
Working in two groups attempt to set up the polishing protocol above for completed genome assemblies. Currently, as a group we have assemblies completed for Arabidopsis and pneumonia. Work on these as groups to examine the minimap/racon polishing.
Submit as group rmarkdown blog again this week.