Ensuring All Stages Pipelining and Accuracy in PASQUAL

2753 words (11 pages) Essay

4th Apr 2018 Computer Science Reference this

Tags:

Disclaimer: This work has been submitted by a university student. This is not an example of the work produced by our Essay Writing Service. You can view samples of our professional work here.

Any opinions, findings, conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of UKEssays.com.

Nachiket D. More

Abstract

GENOME is term used for genetic material of organism. It is used to encode DNA of organisms, or RNA of various kinds of viruses. Ii contains both coding and non coding parts of DNA/RNA. Now a day’s GENOME is constructed for mostly all animals, viruses, and bacteria’s. These data is mostly used in medical research and as well as to predict disease like cancer, HIV and many more.

GENOME is consisting of reads, these reads are very large in amount to manipulate and also to store and maintains. Sequencing machine produce output of short overlapping substrings, these substring are called reads. The sequence assembly reconstructs genome sequence of these reads. These genome sequences are long and continuous. Assembly software for Nest Generation Sequencing (NGS) must be a very accurate, fast and have a less memory consumption.

PASQUAL is tool used for faster work of NGS GENOME assembly. For address challenges of NGS assembly, parallel algorithm and compressed data structure are used in PASUQAL. PASQUAL delivers better speed of execution, less memory consumption and better solution quality.

Keywords – Parallel algorithm, parallel suffix array construction, high performance bioinformatics, de novo sequence assembly, shared memory parallelism, DNA sequence, genome assembly.

 

  1. Introduction

The term “genome” is used for represent/refer as cellular instruction set. Also it used to refer genetic material of a cell. A genome consist of chromosomes, it can be one or more individual chromosomes. Chromosomes consist of deoxyribonucleic acid (DNA), and for many viruses it consists of ribonucleic acid (RNA). DNA is made from simple unit called nucleotides (nt). Nucleotides having four types namely A, C, G, and T. In sequence start and end are denoted by 5’ and 3’ respectively.

Deducing the order of nucleotides from cell and encoding it as a string of letters is called a DNA sequencing process. This process cannot read whole sequence continuously, so it breaks DNA molecules into small part, which is used in chemical reaction as templates to produce short sub-sequences called reads. Major problem is a reconstruct the original genome sequence from reads. For these purpose GENOME assembly algorithms are used. A GENOME assembly uses many automated rounds to improvements, but it inspected and edited by specialists. Assembling reads into a long contiguous sequence is called contigs.

The genome sequencing is process of reading sequence of base pairs (bp). Organism genome consists of base pairs, which is derived from two stranded of complementary bases. This is a main part to the study of genomes in bioinformatics. Except Whole – Genome Shotgun (WGS) sequencing machine, no other current sequencing method is capable to read whole sequence in one pass. De novo assembly not uses any reference sequence aids to reconstruction of original sequence, because of these it is used in PASQUAL.

We have to generate a large number of reads in a small amount of time, for these purpose we used a Next Generation Sequencing (NGS) technologies. Due to these it greatly reduces the experimental cost per base. It helps to study organism at genome level, to deeply understanding of biological mechanism and genome regulation. Due to sequencing genome rapidly, it helps researchers to study more on evolution of viruses and bacteria. Because, bacteria and viruses can adopt behavior more easily also generate mutation easily at every step of reproduction.

  1. Next Generation Sequencings (NGS)

Decoding DNA sequences is essential in all branches of biological research. For these purpose scientist uses the capillary electrophoresis (CE) – based Sanger sequencing, scientists able to manifest genetic information for any biological system. Because of these it is adopted by many research laboratories. But it has many limitations like throughout, scalability, speed and resolution to preclude in scientists research study.

To overcome from these problem, these is new technology is introduced namely as Nest-Generation Sequencing (NGS), that become a reason for boost in research area in bioinformatics and genomic science. NGS is responsible for major transformation in path of retrieving information biological system, genome and epigenome of species. This gives an important breakthrough in fields like human disease and agriculture research.

The principle behind NGS is similar to CE. CE generates small fragments of DNA. These fragments are sequentially identified from each fragment, which is re-synthesized from DNA template. NGS perform similar work in parallel fashion, which is population of millions of reaction rather than single or few DSN fragments. Due to this NGS produces hundreds of gigabases of data in single pass/sequencing run.

NGS perform its operation as – a single genomic DNA is firstly fragmented into numbers of small segments, which is also known as library of segments. These segments are uniformly and accurately sequenced in millions of parallel reactions. These strings of bases are called as reads. Then these reads are reassembled by tow technique, first is known reference genome called as scaffold (re-sequencing) and second is without any reference genome (de novo sequencing). The output is set of aligned reads represents entire sequence of each chromosome in the gDNA.

Fig. Conceptual Overview of Whole-Genome Sequencing

  1. Extracted gDNA.
  2. gDNA is fragmented into a library of small segments that are each sequenced in paralllel.
  3. Individual sequence reads are reassembled by aligning to a reference genome.
  4. The Whole–genome sequence is derived from the consensus of aligned reads.

NGS output is increased as a rate that outpaces Moor’s law. A single pass can produce up to one gigabase (Gb) of data, at the time of invention i.e. in 2007. At 2011 it reaches up to terabase (Tb) of data in single pass/sequencing run. i.e. almost 1000× increase in four years. Because of this ability of NGS, researchers can move from idea to full data sets in few hours or days. Using CE technology sequencing of human genome takes a time around 10 years. But using NGS we can generate five human genomes at a single run. So it reduces the cost of genome projects.

In NGS we can tune resolution of genome experiments. It is possible to produce more or less data, also it support zoom in particular regions of genome with high resolution or view with low resolution but it is more expansive. To do these researchers can tune coverage generated in experiments. This ability gives number of experimental design advantages.

Because of various advantages of NGS has permeated in many areas of study. Using NGS, researchers can develop a broad range of application that transformed study designs and finding new information never before imaginable.

  1. PASQUAL

PASQUAL can produce large data in assembly process in terms of memory consumption and running time. PASQUAL stands for PArallel SeQUence AssembLer. It uses OpenMP for shared memory parallelism, because of its good working between programmer productivity and performance. PASQUAL uses OLC approach and obtain high quality solutions with combination of tailored algorithms.

PASQUAL can handle billions of bases. It uses de novo assembly, because of it does not need any reference to produce original sequence. Algorithm constructs biological sequences in parallel by suffix array, and it is good key for parallel performance and memory optimization. Index stage and string graph construction is used for finding overlaps. Misassembles of genome sequence by PASQUAL is significantly less than ny other assemblers.

PASQUAL can handle billion of bases in less time, because it uses pipelined stages and compressed data. It has advantages over SOAPdenovo and k-mer like SOAPdenovo is only a tool having comparable speed and k-mer is restricted to smaller length than 128. Rather than PASQUAL produces less errors compared to any other tool.

4. Literature Survey

4.1 De Novo Genome Sequence Assembly

In year 2008 to 2012 these are many sequencing techniques are developed, due to these there is major drop in cast from 1/100000th to 1/100000th of price. De novo algorithm is inherited from the SOAPdenovo2 framework. De novo sequencing involves novel genome; it requires specific assembly of reads (sequencing reads). It requires unique combination of length, depth of reads also it requires flexible paired-end insert size. Unpatrolled raw read makes confident and efficient production and long contig assemblies. De novo sequencing assembly is preferred for study of non-model organisms, because it is cheaper and easier to construct a genome.

The reference-based assembly uses mapping on to reference genome, because of these it has inability to account for incidents of structural alteration of mRNA transcript. De novo assembly provides means to discover new and unknown sequence in biological research. Reading of whole sequence at once is limited, de novo methods are irreplaceable. It mostly used to discover new and unknown sequences, which is important in biodiversity in world.

4.2 Overlap/Layout/Consensus (OLC) Approach

Overlap Layout Consensus (OLC) method is used in de novo assembly. It has a three steps overlap, layout and consensus respectively. In overlap stage graph is constructed, graph is made up of basic assembly. In layout stage this given graph is compressed. And in the consensus stage upon graph data, genome sequence is determined. These data is generated in previous two stapes.

  1. Overlap:-

In the overlap stage, each and every reads are compared with every other read, and these is perform in both direction forward and reverse complement orientations. It is very time consuming procedure especially in set of large reads.

  1. Layout:-

Finding path in OLC graph in not an easy task, because it has million of nodes and edges, and it very tedious task to find path that visit each node exactly ones. In this stage it OLC assembly graph is simplified, where assembly graph (i.e. segments) are compressed into contigs.

  1. Consensus:-

This is a final stage of OLC approach, at this step assembly graph is reduced to large scaffolds i.e. single scaffold. It start from left most read of each scaffold, OLC algorithm computes consensus of all the reads composing each scaffold. Gaps in the genome may still be presents if the consensus step had insufficient mate-pair or repeat contig information. If an assembly had gaps, it would result in a fragmented genome, composed of multiple scaffolds because the gaps between the scaffolds could not be joined.

4.3 Shotgun Sequencing

Sanger DNA sequencing technique work on limited distance in sequencing primer from 30 to 350 nt i.e. read length. Because of chain termination very few product can produce chain. These work at best ability to sequence maybe 500 bases a day and it is infeasible for human genome which have billions of bases.

Another approach is, first divide DNA in to smaller fragments which is individually sequenced. Then these fragments are reassembled into original form based on overlaps. This strategy is known as shotgun sequencing, it also known as shotgun cloning.

In shotgun sequencing, it randomly sheared into small pieces (usually about 1kb) and sub cloned into universal cloning vector. The library of sub fragments is sampled at random, and sequence reads are generated. These reads are assembled into contig. From this procedure complete sequence of clone generated. Shotgun technique can identify gaps (i.e. there is no sequence available) and single standard regions (where there is sequence for only one stand). They are targeted for additional sequencing to produce fill sequenced module.

5. Full Stage Pipelining and accuracy in PASQUAL

5.1 Motivation for this topic

With an explosive growth of genome research area and in genome sequencing data, there is huge demand for tool and systems that enables researchers to more efficiently and more effectively work. NGS technology can produce shorter reads as compared to previous sequencing and delivers higher coverage. Coverage means ratio of total length of reds to genome length. Typically NGS generates reads from millions to few billion. This result is depending upon genome size and coverage. Due to high improvements in technologies, data sets to grow larger. As well as assembly become more demanding in time and memory consumption.

5.2 Selected area

In NGS mainly contains DNA and RNA sequencing. I studied research paper for genome sequencing techniques. Genome sequencing techniques changes rapidly and become more and more advance over the period of time. Now a day’s genome sequencing is not used for research area also in treatments of many diseases.

I am choosing full stage pipeline and more accuracy in PASQUAL because today many bioinformatics research topics uses genome sequencing, also it used for research topic in biodiversities. I have studied lots of paper where NGS is suggested for genome sequencing. I used full stage pipelining and more accuracy in PASQUAL NGS genome sequencing.

6. Problem statement

Purpose of these research work is make full stage pipelining and more accuracy in PASQUAL genome sequencing.

7. Proposed Solution

This system is completely new and it has different techniques to make it efficient for genome sequencing. Currently PASQUAL is not offering full all stages pipelining. Also scaffolding and support of paired-end reads uses third-party tools. It has to be improved error correction. Also acceleration in assembly process and reduce memory consumption.

8. Work done till Today

  1. Study of different types of feature PASQUAL.
  2. Code for different sequence assembler techniques.
  3. Study of different sequencing and assembly algorithms.

9. Objectives

  1. Applying full stage pipelining in all stages of PASQUAL.
  2. Improving error correction
  3. Accelerate the assembly process.
  4. Reduce memory consumption.

10. References

  1. “PASQUAL: Parallel Techniques for Next Generation Genome Sequence Assembly” by Xing Liu, Student Member, IEEE, Pushkar R. Pande, Henning Meyerhenke, and David A. Bader, Fellow, IEEE.
  2. B.H. Bloom, “Space/Time Trade-Offs in Hash Coding with Allowable Errors,” Comm. ACM, vol. 13, pp. 422-426, 1970.
  3. D. Bryant, W. Wong, and T. Mockler, “QSRA—A Quality-Value Guided de Novo Short Read Assembler,” BMC Bioinformatics, vol. 10, no. 1, p. 69, 2009.
  4. J. Butler, I. MacCallum, M. Kleber, I.A. Shlyakhter, M.K. Belmonte, E.S. Lander, C. Nusbaum, and D.B. Jaffe, “ALLPATHS: De Novo Assembly of hole-Genome Shotgun Microreads,” GenomeResearch, vol. 18, no. 5, pp. 810-820, 2008.
  5. H. Dinh and S. Rajasekaran, “A Memory-Efficient Data Structure Representing Exact-Match Overlap Graphs with Application for Next-Generation DNA Assembly,” Bioinformatics, vol. 27, pp. 1901-1907, 2011.
  6. J. Dohm, C. Lottaz, T. Borodina, and H. Himmelbauer, “SHARCGS, A Fast and Highly Accurate Short-Read Assembly Algorithm for de Novo Genomic Sequencing,” Genome Research, vol. 17, no. 11, pp. 1697-1706, 2007.
  7. U. Manber and G. Myers, “Suffix Arrays: A New Method for OnLine String searches,” Proc. First Ann. ACM-SIAM Symp. DiscreteAlgorithms, pp. 319-327, 1990.
  8. www.wikipedia.com

Cite This Work

To export a reference to this article please select a referencing stye below:

Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.

Related Services

View all

DMCA / Removal Request

If you are the original writer of this essay and no longer wish to have your work published on the UKDiss.com website then please: