Kraken 2

Based on our testing Kraken 2 was a good balance of time and space efficiency. Today we will experiment with Kraken2 further by building a custom database to search against.

Guidance on using Kraken2 can be found in the Manual

Also, the Kraken2 paper is a good resource that shows how Kraken2 performs relative to related programs.

Custom DB: Plastid genomes

First, we must download the RefSeq plastid genomes database.

mkdir kraken
cd kraken

#1) Download Plastid reference data:
wget ftp://ftp.ncbi.nlm.nih.gov/refseq/release/plastid/plastid.1.1.genomic.fna.gz
wget ftp://ftp.ncbi.nlm.nih.gov/refseq/release/plastid/plastid.2.1.genomic.fna.gz
wget ftp://ftp.ncbi.nlm.nih.gov/refseq/release/plastid/plastid.3.1.genomic.fna.gz


#2) Then, these files need to be unzipped (uncompressed).

gunzip *.fna.gz

#3) Name your database
DBNAME='plastid_custom'

#4) Get the NCBI taxonomy files:
kraken2-build --download-taxonomy --db $DBNAME

#5) Add custom reference data
kraken2-build --add-to-library plastid.1.1.genomic.fna --db $DBNAME
kraken2-build --add-to-library plastid.2.1.genomic.fna --db $DBNAME
kraken2-build --add-to-library plastid.3.1.genomic.fna --db $DBNAME

#6) Finalize the database
kraken2-build --build --db $DBNAME

Query the custom DB

Now we can test our custom database using our MATK eDNA sample file.

Recall that last time we queried the entire ‘nt’ database using:

kdb=/usr/share/data/krakendb/nt-custom/
targdata=/usr/share/data/BIO331/MATK-1RKIM/small/barcode11.s.fastq

kraken2 --db $kdb  --threads 1 --use-names --report kreport.tab $targdata > kraken.out

All we need to do is substitute the path to our custom database.

Make sure working directory is set to the ‘kraken’ folder

kdb=plastid_custom
targdata=/usr/share/data/BIO331/MATK-1RKIM/small/barcode11.s.fastq

kraken2 --db $kdb  --threads 1 --use-names --report kreport.tab $targdata > kraken.out

Visualizing Kraken report files

TIP: Check out the quick-reference guide on ggplot2 from RStudio

Use ggplot to show the # of reads mapping to genera.

library(ggplot2)
library(taxonomizr)
kreport=read.delim('kraken/kreport.tab', sep ='\t', header=F)

#preview
head(kreport)
#how many rows?
nrow(kreport)
## [1] 673
#add column names to the kreport table
colnames(kreport) = c('percent', 'readsRooted', 'reads', 'taxRank', 'taxID', 'sciName')

fam.kreport = subset(kreport, kreport$taxRank=='F')

ggplot(data=fam.kreport) +
  geom_bar(aes(x=sciName, y = readsRooted), stat='identity') 

How can we improve upon this graph? What other ways might we represent these data?

  • experiment with x-axis label alignment
  • order x axis??
  • theme_bw() or theme_linedraw()
  • plot ‘reads’ or ‘percent’
  • plot by family rather than genus to smooth over apparent false positives?

Try:

x <- transform(fam.kreport, variable=reorder(sciName, -readsRooted) ) 
x = x[order(-x$readsRooted),]


ggplot(data=x[1:5,]) +
  geom_bar(aes(x=sciName, y = readsRooted), stat='identity') +
  theme(axis.text.x=element_text(angle=90))

Also try:

ggplot(data=kreport, aes(x=log(readsRooted), y=log(percent))) +
  geom_point() + 
  geom_text(aes(label=sciName),hjust=0, vjust=0, size=1)

# Homework

For next class: Using files for barcode10, barcode11, barcode12, or none.fastq, create a graph showing the read counts for the Kraken output based on the entire ‘nt’ database and one based on the plastid custom database. Post to #plots and discuss any differences.