0115 966 7955 Today's Opening Times 10:00 - 20:00 (BST)

Error-correction Methods for Next Generation Sequencing

Published: Last Edited:

Disclaimer: This essay has been submitted by a student. This is not an example of the work written by our professional essay writers. You can view samples of our professional work here.

Any opinions, findings, conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of UK Essays.

Introduction

The advent of high-throughput genomics and development in next generation sequencing technologies has led to an exponential growth in the identi ca-tion and storage of high quality sequence reads. Sequenced reads of such high quality lead to much more accurate results which are especially relevant in the elds of resequencing, denovo sequencing, metagenomics and gene ex-pression analysis. However, such results requires equally well developed error correction methods to account for the di erent types of biases that occur in the reads. While contemporary error correction techniques have been con-structed alongside these next generation sequencing methods, there has been a de nitive lack of a yardstick/benchmark to compare within and between the di erent error correction measures with respect to their respective advan-tages and disadvantages. A survey conducted in 2012 attempted to provide this benchmark by compiling a common test data set for the di erent error correction measures [1] . It also compared the di erent methods based on several criteria including quality, run-time, memory usage, ability to scale and ability to deal with nuances of real data such as quality scores and am-biguous nucleotides.

1

Next Generation Data Analysis

Across the last few years, several NGS technologies have been developed targeting cheap and accurate sequencing of reads. The prominent NGS plat-forms are 454 LifeSciences, Illumina, Applied Biosystems SOLiD, HeliScope, Complete Genomics, Paci c BioSciences and Ion Torrent. The properties of these platforms (as of February 2012) has been given below in gure 1.

Figure 1: Comparision of the most popular NGS platforms

Accumulation of high quality sequencing data generated using NGS plat-forms has resulted in a demand for e ective methods for analyzing such data. The nature of these methods largely vary based on the targeted application. Most of these applications can however improve the standard of results, com-putational demands or run-time performance by detecting and rectifying the errors and biases in the NGS data using a range of error correction methods.

The di erent types of biases that cause errors in NGS data are :

  1. Systematic Bias : Systematic bias occurs when there is a deviation of results or inferences from the truth due to variations in measurement; aws in study design; deviation of inferences, interpretations, or analyses based on awed data or data collection.
  1. Coverage Bias : Coverage bias can occur when population members do not appear in the sample frame (undercoverage). Coverage bias occurs when the observed value deviates from the population parameter due to di erences

2

between covered and non-covered units.

c) Batch E ects : Batch e ects are technical sources of errors that have been added to the samples during handling due to variations caused by dif-ferent people working on the samples or due to handling of di erent samples at di erent times.

While all of these e ects contribute to errors in NGS data, the major source of errors are sequencing errors (lack of de nite identi cation, inser-tion, deletion and substitution errors). It is also only possible to target these types of errors for correction once the reads have been sequenced. The oc-curence of these errors varies depending on the NGS platforms. Insertions and deletions are prevalent in 454 and Ion Torrent while substitution errors are predominant in Illumina for example. The initial popularity of Illumina initially led to a lot of algorithmic methods being developed which aimed at correcting substituion errors. With the other methods also gaining an equal foothold now however, there is a growing requirement for error correction methods which insertion and deletion errors are also identi ed and corrected.

3

Error-Correction Methods

The primary idea behind error-correction methods designed so far are based on the concept of haplotype genome sequencing. The basic idea behind this concept is that the base in a particular genomic location can be identi ed by comparing all overlapping reads that traverse over that position. Usually, errors do not occur across all reads simultaneously. Thus, by identifying the base that occurs across a majority of reads, a consensus sequence can be identi ed and all errors that occur in reads at that particular genomic loca-tion can be corrected. Higher coverage and read depth would imply better accuracy. Di erent versions of this concept has been applied in di erent al-gorithms. One version of this is to assume that every location is covered by multiple reads of length k (called k-mers). Other methods further develop this by performing a Multiple Sequence Alignment across the di erent se-quences that share the k-mers and correct the errors based on the alignment. Based on such di erences in the concept of the error correction methods, they are classi ed into three types :

  1. K-spectrum based
  1. Su x tree/array-based iii)MSA-based methods.

k-spectrum-based

The simplest version of error correction is the k-spectrum based method. Here, all the reads are decomposed into the k-spectrum which is the com-plete set of all k-mers that are contained within the reads. The assumption here is that k-mers within a small Hamming distance from each other are likely to belong to the same genomic location which is usually the case in NGS data sets predominantly containing substitution errors, . It is then possible to align all reads directly using the k-spectrum without performing MSA (which is a computationally intensive process). A consensus sequence is derived from the alignment. Constituent reads/k-mers with errors are then corrected by comparing to this consensus sequence. If the k-spectrum con-sists of k-mers from varied locations, this method fails being accurate. The

4

value of k therefore plays an important role in the functioning of methods based on this concept. There are several di erent softwares which are based on this method. A short read assembler constructed by Chaisson et al. con-tains a dynamic programming and heuristic scaling elements which converts all k-mers to solid k-mers within a given number of edit operations during the correction process.

Another software that makes use of this concept is the SOAPdenovo as-sembler.

Quake is a software which, in addition to the solid and insolid k-mers, incorporates nucleotide speci c miscall rates. It targetted Ilumina reads and was based on the assumption that the simulation was done in such a way that the errors produced during sequencing follow the speci c probabilities mentioned in the quality values. The work ow of Quake is given as the following two step process :

(i) Identify solid and insolid k-mers :

  1. Calculate the weight of a k-mer as the weighted sum of all its instances
  1. Model the weight histogram of solid k-mers as a mixture of Gaussian and Zeta distributions, and insolid k-mers as Gamma distribution.
  1. Choose a cut-o to di erentiate them.

(ii) Eliminate insolid k-mers by converting them solid k-mers

  1. Locate erroneous region in r using insolid k-mers. This is performed heuristically. If some insolid k-mers cover the 3'-end, trimming is applied.
  1. Greedily correct bases with low quality scores until all k-mers are solid. If not, r is considered not xable.

Another tool similar to Quake is Reptile , which similarly takes quality scores into account. Relative position of k-mer is taken into context to im-prove prediction.

5

  1. Tiles, or overlapping substrings, are formed from the decomposition of each read. The errors are then detected and corrected from 5' to 3' direction.
  1. As the distribution of errors in each read are not consistent, it is nec-essary to explore alternative decomposition of a read in order to limit the maximum number of errors in a tile. Each tile consists of a sequence of 2 non-overlapping or overlapping k-mers.
  1. This captures k-mer context information and helps in resolving am-biguities when a k-mer aligns equally well to di erent genomic locations.

A Hamming Graph is created where the k-mer are represented as nodes and the Hamming distance between them is given as the edge. All tiles which occur at a higher frequency than the given tile are considered to be candidates of replacement for the tile. These candidates are recognized by searching through the Hamming graph.

Hammer is a tool which also uses the concept of Hamming graphs. It makes use of spaced seeds to identify clusters of similar k-mers from which a consensus k-mer is generated and used for correcting errors.

Su x tree/array based

A variation of k-mer based approaches are the su x tree/array-based error-correction methods which expand on the k-mer-based approach by handling multiple k values and the corresponding threshold M. They di er from the k-mer approaches in their method of deriving the threshold value and in how the su xes in the input are stored. Su x arrays are comparitively more ef-cient than su x trees as they occupy lesser space, while permitting similar operations.

SHREC : In this method, a su x trie is created for the reads. A sub-string sk for each node k is created by linking the edge labels from the root to k. Frequency of occurence of sk corresponds to the number of leaves of the subtree rooted at k. The frequency of sk is assumed to be an outcome of Bernoulli trials, and is computed analytically. The last base of sk is consid-ered as an error if the frequency is lesser than a unit of variance. Errors in

6

the subtree at k are corrected by comparing with another error free subtree rooted at sibling v. If the two subtrees are identical, then k can be merged to v and relevant changes are made for all reads that were previously leaves under k. Otherwise, there might be more than one error in a read containing sk. Then, every read r under k is individually inspected. Multiple errors can be corrected by repeating the above procedure.

Built upon SHREC, Hybrid-SHREC further captures insertion/deletion errors. In accordance with the above notation, if the last base of sk is an insertion, then in the su x trie, k should be compared with the siblings of the parent of k. However, if on the other hand, a deletion occurs, k should be compared with the children of its sibling. This extension is able to capture up to one insertion/deletion error in a read per given iteration.

HiTEC : In this method, a su x array is created. Any erroneous base can be corrected by HiTEC only if this base is preceded by an error free k-mer. The goal while using HiTEC is to therefore derive proper values of k andM so that a satisfactory error-correction result can be achieved. HiTEC chooses k to minimize the total number of bases that cannot be corrected and the bases that can be potentially wrongly corrected. And similar to SHREC, M is chosen empirically based on the expected and observed k-mer occurrences.

MSA based

Coral is a MSA based error correction software which uses k-mers as seeds to locate and ascertain the reads on the unknown reference genome. The re-lationship between the k-mers and their source reads are recorded in a hash table. Using every read as a refernce, every other read sharing a common k-mer is aligned to it globally using the NeedlemanWunsch algorithm. Using this alignment, a consensus sequence is generated, which is in turn used as the new reference to be aligned with the remaining reads. Once the complete MSA is generated, corrections are applied if the number of reads involved in the MSA is neither too large nor too small, and the largest edit distance between any constituent read and the consensus of the MSA is relatively small. These are controlled by user speci ed parameters. This approach can be applied to both 454 reads and Illumina reads by controlling alignment penalties for edit operations. Insertion and deletion are considered in the

7

former case, whereas in the latter case, only substitution is allowed.

ECHO is another MSA based method which consists of two major stages: neighbor nding and maximum a posteriori error correction. In the rst stage, substitutions as compared to the reference genome are discerned. The input reads and their reverse complimentary strands are recorded in a k-mer hash table in the form of key-value pair. Similar to Coral, k-mers that occur with a high frequency in the data set are ignored. Next, for each k-mer, pairwise alignment is performed every pair of reads that contain the k-mer. For each selected pair , neighbor nding method is applied to a subset of in-put data that are randomly sampled. Then the observed coverage histogram is generated for this partial data and is approximated by a closest Poisson distribution.

In the second stage, a position dependent substitution matrix is estimated using an expectation maximization algorithm. Then, for each alignment po-sition derived from the rst stage, the consensus base is calculated to be the maximum a posteriori estimate using the base information in the alignment column and the corresponding quality score information.

8

Results And Discussion

Two distinct sets of test data were created for analysing the methods and establishing the benchmark for future comparision - one for Illumina and one for 454 and Ion Torrent. HSHREC, Reptile, SOAPec and Coral were able to generate complete results for the Illumina datasets. ECHO produced a sig-ni cant amount of intermediate data and failed to yield any results for most of the data sets. Quake failed on a few speci c data sets indicating that the data set has insu cient coverage over the reference genome. HiTEC failed with segmentation fault error for two of the data sets . For 454 and Ion Tor-rent data sets, only HSHREC and Coral generate complete results. Reptile has the best comparitive results for Illumina while Coral has the best results for 454 and Ion Torrent.

Error-correction methods have largely focussed on substitution errors so far due to Illumina dominating the market in next generation sequencing. The long reads and low error rates of 454 has made them less relevant tar-gets for error-correction even though they have been around for a long time. However, the growing importance of Ion Torrent, which produces short reads and predominantly makes insertion and deletion errors, has led to a change in focus.

Current error-correction methods do su er from many faults:

  1. Selection of the wrong parameters by the user could result in unusable results as the data is sensitive to choice of parameters.
  1. Almost all pre-existing stand-alone error-correction programs employ various versions of the same method - haplotype genome sequencing.
  1. Polymorphisms are often mistaken for sequencing errors with these methods, especially when rate of polymorphism is comparable to the rate of error.
  1. Extending error correction to non-uniformly sampled data sets and metagenomic data sets are important.
  1. While these methods work fairly well for single read experiments, none

9

of them account for paired read information.

vi) Novel methods also have to be developed in order to address correct-ing hybrid data sets of sequences derived from multiple platforms.

Such drawbacks have to be targetted and resolved in future measures and methods developed for error-correction in next generation sequencing.

References

  1. Xiao Yang, Sriram P. Chockalingam, and Srinivas Aluru. A survey of error-correction methods for next-generation sequencing. Brief Bioin-form, 14(1):56{66, Jan 2013.

10


To export a reference to this article please select a referencing stye below:

Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.

Request Removal

If you are the original writer of this essay and no longer wish to have the essay published on the UK Essays website then please click on the link below to request removal:


More from UK Essays

We can help with your essay
Find out more
Build Time: 0.0079 Seconds