Based on our testing Kraken 2 was a good balance of time and space efficiency. Today we will experiment with Kraken2 further by building a custom database to search against.
Guidance on using Kraken2 can be found in the Manual
Also, the Kraken2 paper is a good resource that shows how Kraken2 performs relative to related programs.
First, we must download the RefSeq plastid genomes database.
mkdir kraken
cd kraken
#1) Download Plastid reference data:
wget ftp://ftp.ncbi.nlm.nih.gov/refseq/release/plastid/plastid.1.1.genomic.fna.gz
wget ftp://ftp.ncbi.nlm.nih.gov/refseq/release/plastid/plastid.2.1.genomic.fna.gz
wget ftp://ftp.ncbi.nlm.nih.gov/refseq/release/plastid/plastid.3.1.genomic.fna.gz
#2) Then, these files need to be unzipped (uncompressed).
gunzip *.fna.gz
#3) Name your database
DBNAME='plastid_custom'
#4) Get the NCBI taxonomy files:
kraken2-build --download-taxonomy --db $DBNAME
#5) Add custom reference data
kraken2-build --add-to-library plastid.1.1.genomic.fna --db $DBNAME
kraken2-build --add-to-library plastid.2.1.genomic.fna --db $DBNAME
kraken2-build --add-to-library plastid.3.1.genomic.fna --db $DBNAME
#6) Finalize the database
kraken2-build --build --db $DBNAME
Now we can test our custom database using our MATK eDNA sample file.
Recall that last time we queried the entire ‘nt’ database using:
kdb=/usr/share/data/krakendb/nt-custom/
targdata=/usr/share/data/BIO331/MATK-1RKIM/small/barcode11.s.fastq
kraken2 --db $kdb --threads 1 --use-names --report kreport.tab $targdata > kraken.out
All we need to do is substitute the path to our custom database.
Make sure working directory is set to the ‘kraken’ folder
kdb=plastid_custom
targdata=/usr/share/data/BIO331/MATK-1RKIM/small/barcode11.s.fastq
kraken2 --db $kdb --threads 1 --use-names --report kreport.tab $targdata > kraken.out
TIP: Check out the quick-reference guide on ggplot2 from RStudio
Use ggplot to show the # of reads mapping to genera.
library(ggplot2)
library(taxonomizr)
kreport=read.delim('kraken/kreport.tab', sep ='\t', header=F)
#preview
head(kreport)
#how many rows?
nrow(kreport)
## [1] 673
#add column names to the kreport table
colnames(kreport) = c('percent', 'readsRooted', 'reads', 'taxRank', 'taxID', 'sciName')
fam.kreport = subset(kreport, kreport$taxRank=='F')
ggplot(data=fam.kreport) +
geom_bar(aes(x=sciName, y = readsRooted), stat='identity')
How can we improve upon this graph? What other ways might we represent these data?
Try:
x <- transform(fam.kreport, variable=reorder(sciName, -readsRooted) )
x = x[order(-x$readsRooted),]
ggplot(data=x[1:5,]) +
geom_bar(aes(x=sciName, y = readsRooted), stat='identity') +
theme(axis.text.x=element_text(angle=90))
Also try:
ggplot(data=kreport, aes(x=log(readsRooted), y=log(percent))) +
geom_point() +
geom_text(aes(label=sciName),hjust=0, vjust=0, size=1)
# Homework
For next class: Using files for barcode10, barcode11, barcode12, or none.fastq, create a graph showing the read counts for the Kraken output based on the entire ‘nt’ database and one based on the plastid custom database. Post to #plots and discuss any differences.