One step that is often overlooked in the analysis of high-throughput sequencing data is checking for contaminanting sequences. Often there are trace amounts of off-target DNA in your sample because most biological samples are not purely of a single origin. Bacterial cultures may be contaminated with viruses or other bacteria. Human tissues will almost certainly include viral and bacterial DNA. When working with low-DNA samples from other organisms it is often the case that contamination comes from the technician doing the sequencing or from the laboratory environment. Lastly, lab reagents and water carry their own microbial contaminants some of the time.
To prevent these contaminating sequences from disrupting downstream analyses it can be useful to classify your sequence reads to identify the probable origin of each read. We will look at two programs for this process: Centrifuge and Kraken.
A common method for identifying the taxonomic source of a sequence of DNA is to align your reads against reference databases. I recommend Centrifuge and Kraken (now Kraken2) for rapid profiling of your data.
Source:
Kraken2 is the new (late 2018) version of a popular classification tool that operates using kmer matching. Kraken builds a reference database of unique short nucleotide sequence molecules of length ‘k’ known as kmers (think dimers, oligomers –> ‘kmers’). Kraken then finds matches in your data by comparing exact kmer matches. The specificity of this analysis increases with increasing kmer length (longer sequences more likely to be unique to one or a few taxa).
#Bash variables for:
#kraken database
minikraken=/usr/share/data/minikraken2
targdata1=/var/lib/MinKNOW/data/Ecoli_clean/1/20190315_1851_MN30146_FAK49314_0702fe4b/fastq_pass/FAK49314_d964c358d30ef21efeb534d90bcb33256e7a6827_0.fastq
targdata2=/var/lib/MinKNOW/data/Ecoli_clean2/1/20190315_2017_MN30146_FAK49314_66d15b21/fastq_pass/FAK49314_61fa75bc4882f54c466f9385d284144ffb521dcb_0.fastq
cat $targdata1 $targdata2 > input.fq
targdata=input.fq
kraken2 --db $minikraken --threads 12 --use-names --report kreport.tab --fastq-input $targdata > kraken.out
less kreport.tab
#OR: grep
#e.g.,:
grep "Eukar" kreport.tab
Tip: change the path in variable $targdata to point to your data files.
Source:
Centrifuge works by building an efficient Burrows-Wheeler index of the reference data and aligning each read in your dataset to the index. The best match(es) are reported for each read so that you can see how many and which reads match to which group of organisms. Centrifuge differs from Kraken in that explicit alignments of the sequence reads are performed and allow for both mutations and gaps between the reference and test data which may produce somewhat higher accuracy.
centdb=/usr/share/data/centstandard/abv
centrifuge -p 2 -q -x $centdb -U $targdata -S out.centclass --report-file creport.tab --out-fmt tab
If your data contain a significant proportion of contaminants it may be worth thinking about filtering reads by kraken or Centrifuge classification prior to taking any next steps. Look at the options for each program to see how we might do this:
centrifuge -h
kraken2 -h
Write a blog post with details about how to run both Kraken and Centrifuge on your data. Describe and/or show the output in the report files generated. Make careful note of significant possible contamination.