Prof. Harbert
05 February, 2019
Understand the division between long and short read DNA sequencing technology.
Conceptualize the scale of the difference involved in mapping long vs. short reads.
Discuss how both are used for hybrid assembly of genomes
“Short read” Technology includes:
“Long read” Technology includes:
*Long read technology is sometimes called “Third Generation Sequencing” and sometimes separated into Third and Fourth gen.
Short read tech tends to have fixed read lenghts of 50-150bp depending on the platform.
What qualifies as long? That depends on when you ask that question. A safe answer might be 2000-10000bp but people are working to get longer and longer reads using careful laboratory and sequencing protocols.
Some features of genomes are difficult to “see” with short reads:
We can visualize general trends with simulated matching of random DNA segments of variable lengths to a genome…
The E. coli genome is ~4,100,000bp (4.1Mb) long.
IF we assume this is made up of randomly generated DNA, what is the probability that any randomly selected base is an “A”? What about the segment “ATCTA”? “ATCTATAT”?
For random sequence (no bias in base identity) we can express the probability of a random sequence occurring as:
\[ P(seq) = 0.25^n \]
Where 'n' is the length of the nucleotide sequence string.
How often would a specific sequence be produced by drawing a random string?
But we have a lot of chances to select any given string in 4,100,000bp…
The probability of matching a random nucleotide string to a genome can be represented as:
\[ P(Match) = S * 0.25^n \]
Where S is the size of the genome and n is the size of the nucleotide string.
*NOTE: Horizontal line represents P(Match) = 1. How should you interpret P()=1?
But what about the probability of correctly matching a whole sequencing experiment of, let's say 5 million reads between 50-150bp?
Typical read lengths for short read platforms are 50-150bp x 5 million reads
Using a log-scale we can zoom out to look at 'order-of-magnitude' differences for longer reads
NOTE: R runs out of numbers to describe how close to P(Match)=0 we get after n=500.
For datasets composed of reads of >500bp, for our 4.1Mb E.coli genome mapping should be correctly placing reads with near-certainty.
Are genomes random?
How do new genes arise?
What makes up most of the “junk” (unused) DNA in a genome?
How does this scale with larger genomes? Larger samples?
Any factor that creates repeatable patterns in a genome (duplication, horizontal transfer, recombination) will decrease the certainty with which we can place a sequence read.
In practice what is often done now is to sequence with both long and short reads.
The long reads provide the genome structure and the short reads are used to correct base identity errors. Leading to a more complete genome that maintains high accuracy.
Check eLearn for the ONT Plant Genome white paper. Also, read Assembly of chloroplast genomes with long- and short-read data: a comparison of approaches using Eucalyptus pauciflora as a test case