# Graph Algorithms For Assembly Next Generation Sequencing Data Biology Essay

Published:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

The advent of next generation sequencing technologies, an experimental technique used for obtaining information about a genomes sequence, whereby it is broken up into many short and possibly overlapping segments, whose sequence is then determined, make it possible to extend the scope of genome studies to a point previously unimaginable. The genome of an organism offers great insight into its phylogenetic history, interaction with the environment, and internal function; thus, assembly continues to play a central role in biology for the sequencing of novel species.

What is an assembly?

An assembly is a hierarchical data structure that maps the sequence data to a putative reconstruction of the target. All sequencers produce observations of the target DNA molecule in the form of reads, which are sequences of single-letters {A,T,G,C}. One basic assumption made by assembly algorithms is that every error-free read in the input must be present in the original genome, since it was sequenced from the genome. Some methods made another, less justifiable assumption: that the original genome should be the shortest sequence that contains every read as a sub-string. These assumptions lead to the casting of the genome assembly problem as the Shortest Common Superstring (SCS) problem, which is known to be NP-hard. The problem of modeling genome assembly as the SCS problem is that most genomes have repeats (identical, or nearly identical, stretches of DNA), while the SCS solution represents each of these repeats only once in the assembled genome. This problem is known as over-collapsing the repeats.

Graph modeling: Genome Graphs

There have been several approaches to addressing this problem, but among the most successful have been those based on graphs. A genome can be represented in the form of a graph, known as a genome graph. The vertices of these graphs represent either the reads or their substrings (k-mers expressed by various combinations of the letters A,T,G and C) and the edges represent overlaps between them (the prefix of one read is the suffix of the other). By walking along the edges of the graph, one can spell a string and in such a way reconstruct the genome. However, graphs are very sensitive to repeats and errors in sequencing reads, which leads a graph to converge and have unexpected branching which increases graph complexity, leading to tangles that are difficult to resolve. In the graph context, assembly is a graph reduction problem to simplify the graph.

Analysis

Paths through the graph are the potential contigs (DNA sequence reconstructed from a set of overlapping DNA segments), and paths can be converted to sequence. Paths may have mirror images representing the reverse complement sequence. There are two ways to force paths to obey the semantics of double-stranded DNA. If the graph has separate nodes for read ends, then paths must exit the opposite end of the read they enter. If the graph has separate edges for the forward and reverse strands, then paths must exit a node on the same strand they enter. The next generation sequencing assemblers can be organized into 2 groups based on the type of graph algorithm they use, viz. The Overlap / Layout / Consensus methods, which rely on an overlap graph & The de Bruijn Graph method, which use some form of K-mer graphs.

Figure 1: Next Generation Sequencers provide sequence reads which require assembly for the digital reconstruction of genome for further analytical studies on that genome. The figure explains how the Overlap Layout Consensus and de Bruijn graphs work in the reconstruction of a genome.

Assembly is complicated by the computational complexity of processing larger volumes of data. For efficiency, all assembly software relies to some extent on the notion of a K-mer. This is a sequence of K base calls, where K is any positive integer. In most implementations, only consecutive bases are used. Intuitively, reads with high sequence similarity must share K-mers in their overlapping regions, and shared K-mers are generally easier to find than overlaps. Fast detection of shared K-mer content vastly reduces the computational cost of assembly, especially compared to all-against-all pairwise sequence alignment. A tradeoff of K-mer based algorithms is lower sensitivity, thus missing some true overlaps. The probability that a true overlap spans shared K-mers depends on the value of K, the length of the overlap, and the rate of error in the reads. An appropriate value of K should be large enough that most false overlaps don't share K-mers by chance, and small enough that most true overlaps do share K-mers. The choice should be robust to variation in read coverage and accuracy. Assembly algorithms, and their implementations, are typically complex. Assembly operation can require high-performance computing platforms for large genomes. Algorithmic success can depend on pragmatic engineering and heuristics, that is, empirically derived rules of thumb. Heuristics help overcome convoluted repeat patterns in real genomes, random and systematic error in real data, and the physical limitations of real computers.

Overlap Layout Consensus Approach:

An overlap graph represents the sequencing reads and their overlaps. The overlaps must be pre-computed by a series of (computationally expensive) pair-wise sequence alignments. Conceptually, the graph has nodes to represent the reads and edges to represent overlaps. In practice, the graph might have distinct elements or attributes to distinguish the 5' and 3' ends of reads, the forward and reverse complement sequences of reads, the lengths of reads, the lengths of overlaps, and the type of overlap (suffix-to-prefix or containment). Paths through the graph are the potential contigs, and paths can be converted to sequence. Paths may have mirror images representing the reverse complement sequence. There are two ways to force paths to obey the semantics of double-stranded DNA. If the graph has separate nodes for read ends, then paths must exit the opposite end of the read they enter. If the graph has separate edges for the forward and reverse strands, then paths must exit a node on the same strand they enter.

de Bruijn graph approach:

The de Bruijn graph was developed outside the realm of DNA sequencing to represent strings from a finite alphabet. The nodes represent all possible fixed-length strings. The edges represent suffix to- prefix perfect overlaps. A K-mer graph is a form of de Bruijn graph. Its nodes represent all the fixed-length subsequences drawn from a larger sequence. Its edges represent all the fixed-length overlaps between subsequences that were consecutive in the larger sequence. In its application to assembly, the K-mer graph represents the input reads. Each read induces a path. Reads with perfect overlaps induce a common path. Thus, perfect overlaps are detected implicitly without any pair-wise sequence alignment calculation. Compared to overlap graphs, K-mer graphs are more sensitive to repeats and sequencing errors. Paths in overlap graphs converge at repeats longer than a read, but paths in K-mer graphs converge at perfect repeats of length K or more, and K must be less than the read length. Each single-base sequencing error induces up to K false nodes in the K-mer graph. Each false node has a chance of matching some other node and thereby inducing a false convergence of paths.

Figure 2: OLC approach (L) v/s de Bruijn graph approach (R) using K-mer graphs

(L) A pair-wise overlap represented by a K-mer graph. (a) Two reads have an errorfree overlap of 4 bases. (b) One K-mer graph, with K=4, represents both reads. The pair-wise alignment is a by-product of the graph construction. (c) The simple path through the graph implies a contig whose consensus sequence is easily reconstructed from the path.

(R) A read represented by K-mer graphs. (a) The read is represented by two types of K-mer graph with K=4. Larger values of K are used for real data. (b) The graph has a node for every K-mer in the read plus a directed edge for every pair of K-mers that overlap by K-1 bases in the read. (c) An equivalent graph has an edge for every K-mer in the read and the nodes implicitly represent overlaps of K-1 bases. In these examples, the paths are simple because the value K=4 is larger than the 2 bp repeats in the read. The read sequence is easily reconstructed from the path in either graph.

Plan for finishing work

The success of both the de Bruijn and Overlap graph models in practice indicates that by defining a more restricted graph model that nevertheless covers most actual genomes, we could develop more efficient and accurate algorithms for genome assembly. The analysis of the comparative study done so far partially explain the need for heuristics for assembly algorithms in current frameworks. Another reason is that the de Bruijn and overlap graphs do not explicitly model the double-stranded nature of DNA, which is instead treated heuristically by current assemblers. For the completion of this research project I plan to show that this limitation can be overcome by extending the model without making the algorithmic tasks more difficult.