ALLPATHS: de novo assembly of whole-genome shotgun microreads. Gene- boosted assembly of a novel bacterial genome from very short reads. We provide an initial, theoretical solution to the challenge of de novo assembly from whole-genome shotgun “microreads.” For 11 genomes of sizes up to 39 Mb, . An international, peer-reviewed genome sciences journal featuring outstanding original research that offers novel insights into the biology of all organisms.
|Published (Last):||3 July 2008|
|PDF File Size:||14.10 Mb|
|ePub File Size:||17.78 Mb|
|Price:||Free* [*Free Regsitration Required]|
New DNA sequencing technologies deliver data at dramatically lower costs but demand new analytical methods to take full advantage of the very short reads that they produce.
The bacterial genomes of Campylobacter jejuni and Escherichia coli assemble optimally, yielding single perfect contigs, and larger genomes yield assemblies that are highly connected and accurate.
Assemblies are presented in a graph form that retains intrinsic ambiguities such as those arising from polymorphism, thereby providing information that has been absent allpatys previous genome assemblies. Larger genomes produce more complicated graphs, but the vast majority of the bases in their assemblies are present in long reas that are nearly always perfect.
We describe a general method for genome assembly that can be applied to all types of DNA sequence data, not only short read data, but also conventional sequence reads. These yield data suitable for straightforward mapping of biological features such as transcription factor binding sites and chromatin modifications Johnson et al. A much harder problem is de novo assembly of whole-genome shotgun microreads.
In this study, we present a theoretical analysis of this problem and describe an algorithm for addressing it, which we apply to simulated data based on real Solexa reads. We present results for small- to mid-size 39 Mb genomes, describing assembly completeness, continuity, and correctness. Briefly, the paper proceeds as follows: The very large number of mostly false overlaps makes this approach intractable. We determine exactly how good an assembly of such data could possibly be.
The answer is captured by a graph, allowing for alternatives in cases where the data lack power to determine the correct answer. Paired-read assembly turns out to be considerably more complicated than unpaired assembly, and although we cannot describe a simple answer as to its best possible result, we do describe an algorithm for it and a research software system, ALLPATHS, that instantiates this algorithm.
Importantly, an ALLPATHS assembly is presented as a graph that retains intrinsic ambiguities, arising from limitations of the data set and also from polymorphism in diploid genomes.
ALLPATHS: de novo assembly of whole-genome shotgun microreads.
Thus, in principle, the assemblies capture exactly what can be known from the data. We have implemented this here for microreads. Whole-benome same conceptual framework can apply lalpaths DNA sequence data of any type. The reference sequences used in this paper are described in the Supplemental material Part a. The Source code used in computations is whole-benome with the paper Supplemental material Part b.
This is challenging because there are far too many overlaps wjole-genome reads to microo and most of these overlaps are wrong. A direct approach to assembly would compare reads to each other, glue overlapping ones together, and thereby progressively agglomerate the genome.
Table 1A shows that when the size of the minimum allowed overlap K is small, the probability of gluing correctly along such overlaps is very low, but can improve dramatically with increase in K. The value of K is limited both by read length and by coverage: For a given K and a given genome, we show the mean number of perfect placements whole-genoms the genome for a K -mer drawn at random from the genome, excluding the true placement.
This number is the expected ratio of false to true overlaps between reads overlapping by exactly K bases. Values were estimated using a sample size of 10 6. The procedure used to generate this table is in the Supplemental material Part c.
To provide context for Table 1Awe also show the fraction of K -mers having a unique placement on the genome in Table 1B. For a given K and a given genome, we show the fraction of its K -mer that have a unique placement on the genome. Values were estimated using a sample size of 10 4. Finding all overlaps between microreads is also computationally very expensive because there are so many overlaps.
ALLPATHS: de novo assembly of whole-genome shotgun microreads.
Reads are shorter than the long reads from Sanger-chemistry sequencing, thus at the same level of coverage there will be more reads and hence more true overlaps. But the same level of coverage is not enough: One needs to raise coverage to get a usable minimum overlap Kand as one does this, the number of true overlaps rises further increasing quadratically as whole-enome rises. To make matters worse, the true overlaps may be swamped by false overlaps Table 1A.
In shptgun, there are too many overlaps, and thus the standard assembly paradigm of finding all overlaps is unlikely to be the best approach for microreads. Setting aside the problem of how genomes might be assembled from microreads, we first describe how good an assembly could possibly be if it were based solely on unpaired reads. While the answer for unpaired reads is not simple, it is precisely computable from the genome. In the process of explaining how this is done, we introduce key concepts needed to assemble genomes from paired microreads.
We will see in the next section that imperfect reads at high coverage will suffice, provided that the reads are longer.
CiteULike: ALLPATHS: De novo assembly of whole-genome shotgun microreads
Unipath graph of the 1. The genome was treated as linear to simplify computation. Each unipath is labeled with its number of copies multiplicity in the genome and with a letter to facilitate discussion. Formally, the graph rewds includes a reversed copy corresponding to the reverse-complemented sequence data not shown. The middle horizontal edge represents a 6. This edge is present exactly because the reads are shorter than the repeat. If the reads were longer than 6.
This graph along with edge sequences and multiplicities represents exactly what can be known from the data: Figure 1 exhibits whole-genomee unipath graph for the 1. The graph is simple, as is its relation to the genome: There are two ways to traverse the graph from beginning to end, one of which is correct and one of which is a misassembled version of the genome.
It is impossible to do better using unpaired reads unless one has reads longer than 6. The graph thus encodes exactly what can be known from the data: When the minimum overlap K is lowered to 20, we find instead that the unipath graph of C. The N50 size of the unipaths is 3.
Raising K to 24 improves matters strikingly: We see that potential assembly quality is highly sensitive to the minimum overlap Kand hence to both read length and coverage. It reveals both what can be known from the data and what cannot be known. We set the same goal for aasembly of reads, thus building a sequence graph that retains intrinsic ambiguities arising from polymorphism in the genome or the limited power of the data. If this is done correctly, errors should be exceedingly rare, and where there is uncertainty, the assembly will display the alternatives, rather than picking the one that is judged to be true.
We have not yet explained how unipaths may be alloaths from reads. Here we outline the basic ideas. At least for simulated reads modeled on real datathese approximate unipaths closely match the true genomic unipaths. The only exception is that the first and last seven bases are missing; this is an artifact of linearizing a circular genome for this analysis. For paired reads, the assembly problem is far more complex.
But this represents an absolute limit, which is not necessarily achievable. To approach this limit, new paired-read assembly algorithms are needed. We then elaborate in subsequent sections. If the coverage is high enough, this approach is guaranteed to yield the correct path, that is, the true closure of the read pair. However, the approach will typically also yield other, incorrect paths. Our assembly method involves initially finding all of the paths, keeping track of them, and ultimately sorting out which one is right, where possible.
We use pairs to group together most or all of the reads from a given region of the genome sometimes accidentally including reads from allpatns regionsthen assemble each group separately, in an in silico analog of clone-by-clone sequencing.
The idea is to tile the genome by overlapping regions even though we do not know the genome in advanceassemble each of these in turn, then glue all these local assemblies together to form one big assembly of the entire genome. The results of the algorithm depend on the variation in the size of the DNA fragments. Table 2 illustrates how the number of paths connecting a given read pair can vary, both across pairs and also as a function of the standard deviation SD in the size of the DNA fragment.
First, consider walking across read pairs using reads from the entire E. Number of read pair closures in E. The table shows the histogram of the number of closures found per read pair, for each of two choices of library SD, and for each of two strategies.
Rows give the nonoverlapping closure count ranges. In the first strategy, reads from the entire genome are used in the walk. In the second strategy, we picked kb regions and walked short fragments from them using only the reads within a given region. Dde strategy 2, we used randomly chosen kb regions and short fragments from each.
It is possible for a pair to be reported as having zero closures because whereas we searched for closures having no more than 3 SDs of stretch, the underlying distribution of fragments includes some that are stretched more.
It is also possible that zero closures could result from lack of coverage, although this would be a rare event. These read pairs having large numbers of closures pose a complex series of problems. First, the number of closures could easily be so large that it would be impossible to store them, let alone compute them. More importantly, even if we could compute all these closures, rewds would be no way to sort them out so as to ultimately yield a usable and relatively untangled final answer for the assembly.
Finally, since we cannot compute, store, or use such a gargantuan set of closures, the algorithms must be designed to terminate in the worst cases, thereby failing to return any closures at all. This will lead to holes in the final assembly, generally in the most repetitive places. Indeed, if we always found all paths, we would expect assemblies to have ambiguities rather than errors, in all cases.
One approach to this problem of too many closures is to localize the qssembly, so that only reads from the correct region are used to construct closures.
Table 2 offers an optimistic preview of how well this might work: If we could localize, then the all paths problem would become much simpler. The right half of Table 2 reports the mirco that would be obtained if one could use only reads from a kb region containing the read pair. The unipath computation ignores the pairing of reads. Now with the unipaths and read pairs in whol-egenome, we are ready to localize.
The ideal seed unipaths are long and of low copy number ideally one. Copy number is inferred from read coverage of the unipaths.