Analyzing Next Generation Sequencing Data Computer Science Essay

Published:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

Next generation sequencing (NGS) technology is changing the landscape of genetic analysis at a fast pace with its ability to produce large amount of data in faster and inexpensive way and becoming very popular among the researchers. However the growing data imposes challenges in terms of management and analysis. Researchers find themselves engaged in planning the resources to store the data by building and running data centers rather than scientific discovery. Also because NGS has got wide range application portfolio it is being used extensively among the different research groups across Novartis and different groups use different software for processing and analyzing NGS reads which results in inefficient use of resources available and poor storage management.

The aim of the project was to develop a single, scalable and flexible pipeline to process the NGS data which can be optimized to perform in parallel environment. The pipeline should provide a frame work which can be used by different groups for NGS processing based on experiment type and should have the ability to store experiment related metadata with the analysis runs also platform should allow researchers to handle data analysis for large data sets with one run. We worked with different scientific user groups across Novartis to develop a generalized framework for NGS analysis which can have extended scope than existing pipelines and optimizing the performance without compromising functionality.

Introduction:

The term "sequencing" refers to method of deciphering the genomic content amd Next generation sequencing implies to highthroughput technique of performing it. and From 1977 to 2005 for ~30 years sanger sequencing was practically only way of doing genome sequencing. In 2005 sequencing revolution started with new technologies which could sequence much faster and produce large amount of data and lot cheaper compared to sanger type of instrument accelerating the scientific studies using various applications including both RNA and DNA[1]. NGS experiments are used in Denovo and targeted sequencing, to identify microRNA, mRNA, methylation studies etc. Chipseq experimentation to study the protein interaction with DNA. Some of the clinical application of NGS are detecting the mutation in genes, cancer studies , drug response studies, pathogen detection , discovery of new drug targets and novel genes and comapritive genomic studies.

The primary advantage being a inexpensive and fast method of generating large amount of sequence data , usually NGS instrument single run can produce data ranging from giga bytes to tera bytes and a group of analysis can include processing millions of reads to entire genome. This data complexity had made researches to rethink traditional data management techiniques. With exponentially growning data NGS data seem to outpace Moore's law. If we keep waiting for computer hardware to catch up with NGS data, we are going to be waiting for a long period. Instead intelligent ways of data processing should be applied to justify the data and analysis complexity. Figure 1 provides time line of computer storage prices v/s DNA sequencing costs [2].

Figure 1: Provides time line of Computer Storage Prices v/s DNA Sequencing costs.

All plots are logarithmic without taking inflation into consideration

Blue line: Storage price per dollar

Yellow line: Pre-NGS with a doubling time less than disk storage until 2004.

Red line: NGS technologies results in exponential increase in data.

There is need to have a mechanism in place to handle the tsunami of data coming our way, so that we can better serve the scientific community. The expriements demand specific laboratory pipelines and distinct bioinformatics methodologies to process , filter, assemble, target, collect and visualize NGS data which delivers meaningful biological information.

Design and Workflow:

NGS data undergoes basic preprocessing steps irrespective of the type of experimentation carried out. A instrument run can have multiple samples and needs to demultiplexed and may have some of the sequences which are used in sequencing technology and some of the contaminants which needs to be removed from the reads before subjcted to analysis. YAP provides general purpose pipeline for NGS raw data processing. Some of the features are:

Architecture and alignment algorithm independent nature :

The pipeline is developed using python programming language which enables it to run on any operating system and its modular design and provides extensive flexibilty. Different algorithm switches, different analysis packages or codes can be included as per the user requirement as they become available.

Maximal amount of work is done when a sequence already resides in memory : From reading of raw read to obtaining the analysis results in the end,the sequence is not written out which avoids having multiple copies of same data everywhere.

Minimal amount of temporary files generated :

Temporary file generation is avoided which makes the pipeline data efficient from storage point of view.

Minimize the amounts of reads and writes through file based I/O :

Data flows from one step of pipeline to another with minimal disk Input/output operations to speed up the run. The existing pipelines used show inefficiency in terms of disk I/O and temporary file genarations which is avoided in YAP pipeline. This method reduces the tremendous cost associated with the disk reading and writng operations. Below is the pictorial representation ( Figure 2) of the algorithm complexity implemented in YAP to have optimized data flow throught the pipeline and the comaprison of the data flow with the existing pipeline.

Figure 2 : Comparison of Disk I/O costs between the existing pipeline and the YAP pipeline.

Maximum flexibility and customization based upon configuration files :

NGS data has wide scope for analysis. Different experiments may have to be analyzed using different parameters and sometimes same experiment needs to be tested with different parameters according to the aim of the research. YAP provides parameter customization using configuration files . There are three configuration files used.

a) Main configuration file : which takes all the parameters for data preprocessing and main switch values and configuration file names for alignment and postprocessing,respectively.

b) Alignment configuration file : Takes detailed parameters for alignment algorithm. The specific file format supports parameters input for multiple commands. The parser generates mutiple commands based on the parameters and all the steps are performed sequentially.

c) Postprocess configuration file : This file contains parameters concerned to specific packages used to analyze results from alignment files. Like Alignment file configuration file this also can have multiple commands parameters.

Support for common file formats:

There are various type of instruments used for NGS data generation and data can be represented in different forms. YAP supports some of the common NGS data formats such as qseq, FASTA, FASTQ, tab separated, SAM and BAM for input and output.

Centralized reference database : YAP also provides mechanism to store the references data base indexes at a centralized location so that the different user groups acorss the organization can keep track of various versions and also it helps in avoiding multiple copies of same being created .

Parallelized using Message Passing Interface(MPI) :

The analysis run speed up time is increased by enabling the parallelization of codes using Message Passing Interface so that multiple files representing one lane of NGS run can be analyzed at a time. Also if single file provided its data can be split and processed in parallel fashion to minimize the run time making efficient utilization of available cluster nodes.

Integration with Sun Grid Engine:

YAP can be run through the SGE script and user can have automated and multiple job run by submitting to the various SGE job queues and make the best use of resources as and when they bacome available.

Data Processing Steps:

Data is analyzed in three main steps. Preprocessing,alignment and postprocessing. Figure 3 gives the various steps involed in the data flow and its preprocessing steps which starts from the raw reads from the instrument to the Alignment and postprocessing steps which are togther termed as Analysis .

Figure 3 : A general represenation of NGS data processing work flow. The dotted box represent the steps involved in preprocessing. And the Analysis box represent the data flow to the subsequent alignment and postprocess steps.

Preprocessing Step:

The raw reads obtained from Instruments are converted into appropriate data formats. If input file is qseq it is first converted to fastq format. Data is then demutiplexed using the user given barcodes for separtion of different samples.The algortithm used is independent of the position of the barcode in the sequence and also allows user specified mismatch for bacode matching. Then each sample is processed sepearately in parallel. After barcode removal the adaptor sequences are subjected to removal. User has to specify the type of adaptors such as 5' and 3' and the adaptor sequence used. The adaptor sequence removal algorithm takes into consideration the 3' tailing effect and also possibility of mismatches in adaptor matching.

The adaptor trimmed data is then subjected to filtering process where the insignificant data can be strained out.The three criteria's used for filtering the data are :

a) length of sequence : if sequences length is smaller than length specified.

b) total number of non base characters : If the sequence contains more non base characters than the allowed number. A non base character can be anything other than four nucleotides A,T,C and G. Sometimes Instrument can produce unknown information which can be sequencing error or junk and such reads are considered insignificant and filtered out.

c) Low entropy: if the sequency complexity drops below the specified value. Algorithm calcualtes the The Shannons entropy which is a standard measure of the order state of symbol sequences [3] .The higher the entropy , the more the sequences or block of sequences occur at equal frequences. A higer value of entropy will filter more sequences that have similar string of charecters ie n - mers or consequtive strand of n symbols.

The data after preprocessing can be written out but the pipeline has the default value avoids from writing the preproessed data out and data is directly piped to next data processing step unless user wants to write it out. All of the above steps can be switched on and off and altered with different criteria numbers according to user's requirement.

Apart from the data preprocessing steps YAP also provides option to perform quality check on the data before any filtering is carried out to evaluate the instrument performance. The algorithm gives the count of number of all nucleotides at each position in all the sequences.

Alignment Step:

Data is aligned to reference database using specified alignment algorithm. This step is also carried out in parallel. Various parameters for alignment algorithm can be given through alignment configuration file and this alignment configuration file is specified in main configuration file. Once the alignment is processed in parallel the results are gathered in root node and then processed sequentially. The alignment result file then can be split according to target which makes it evaluate number of hits against each target. Postprocess step is then carried out using alignment results. Figure 4 gives the data flow from alignment to different postprocessing steps.

Postprocess Step:

The postprocess steps involve alignement result assessments. YAP provides mechanism to quality control step to validate the alignment results which takes standard mapping location coordinate file as input. The alignment results can be analyzed using packages and scripts. Example Expression count analysis : In order to get the expression count for each target , sequence alignment is analyzed for number of valid hits as one sequence can have multiple target hits and then each target receives the expression count which number equivalent to sequence copy number divided by the number of valid hits. Or user can chose to perform some other postprocessing such as More quality control checks using picard package [4]. User has flexibity of adding more packages to this framework.

Figure 4 : Framework design for Alignment and postprocess steps and packages involved in NGS data processing. Indicated in the red are the packages which are yet to be implemented completely.

Some of the scripts used in pipeline are listed in the Table1 in the below section.

To run the YAP user needs to set environment variables for YAP , python , perl , mpi and intel compilers and have the three configuration files. SGE jobs can be sumitted by careful insertion of the parameters in the configuration files.

Conclusion:

The Next Generation Sequencing (NGS) data is growing at a very fast pace. Traditional data management approaches are proving to be insufficient in handling humungous data flow. YAP pipeline was designed by analyzing different pipelines used and their inefficiencies also the fact that many of these pipelines were overlapping in functionalities. YAP project was attempted to develop generic NGS processing framework for optimized performance. YAP pipeline version 1.0 alpha was released on November 15 2010. The pipeline is being tested among scientific groups across various sites of Novartis and is flexible enough to accomodate more features according to user community request. Pipeline is more efficient in terms of cluster utiliation and storage management compared to existing pieplines used. The benchmark analysis of the pipeline still needs to be carried out, which we could not because of the co-op time limit.

My 6 months co-op at NIBR, Scientific Computing Group was a great learning experience. I got exposure to corporate environment and the stratergies and protocols applied in business development to cope up with the ever growing scientific technologies such as NGS data proccessing.

Table 1.Scritps And binaries:

Executable

Usage

Desciption

yap_command_parser.py

Parser

Genreric Parser for aligner and posprocess configuration files

yap_config_dictionary.py

Parser

Creates a dictionary for the yap main configuration file

yap_create_ral.py

File formatting

Creates the read alignment (.ral) format from the bowtie output by splitting according to the chromosomes type.

yap_exe.py

Multiple command execution , Flow control

Executes multiple commands in memory using the subprocess module.

Also consists of flow control switches from main configuration file.

Returns an multi dimensional array based on aligner output.

yap_format_sequences.py

File formating

Prints the Sequence array into the corresponding format given in the YAP main Config File

yap_postprocess_exe.py

Multiple command execution

Similar to yap_exe.py.

Processes multiple postprocessing commands .

yap_preprocess.py

Creates data structure , flow control

Creates the data structure , incorporates the line number , and calls various other preprocessing modules.

yap.py

MPI commands , flow control , directory and file creation

Main executable for the YAP pipeline .Consists of all MPI commands for the splitting and gathering of data.

Also generates all the directories and files in the pipeline

yap_read_file.py

File reader

Checks for file extension and reads file in to an array

yap_split_file.py

File splitter

Splits file based on number of processor and file type

yap_remove_barcodes.py

Preprocessing

Removes barcodes on the basis of hamming distance values.

yap_qc_basecount.py

Pre-processing

Counts the number of occurances of a particular base in the sequence

yap_expression_count.pl

Post-processing

Calculates expression values

yap_get_duplicates_and _hitcounts.py

Post-processing

Multiple sorting and formating commands

DeepSeqTools.pm

Pre-processing

Perl module for data filtering

yap_filter_sequences.pm

Pre-processing

Filter sequences on the basis of adaptors , entropy , non base charecters , read length etc.

Writing Services

Essay Writing
Service

Find out how the very best essay writing service can help you accomplish more and achieve higher marks today.

Assignment Writing Service

From complicated assignments to tricky tasks, our experts can tackle virtually any question thrown at them.

Dissertation Writing Service

A dissertation (also known as a thesis or research project) is probably the most important piece of work for any student! From full dissertations to individual chapters, we’re on hand to support you.

Coursework Writing Service

Our expert qualified writers can help you get your coursework right first time, every time.

Dissertation Proposal Service

The first step to completing a dissertation is to create a proposal that talks about what you wish to do. Our experts can design suitable methodologies - perfect to help you get started with a dissertation.

Report Writing
Service

Reports for any audience. Perfectly structured, professionally written, and tailored to suit your exact requirements.

Essay Skeleton Answer Service

If you’re just looking for some help to get started on an essay, our outline service provides you with a perfect essay plan.

Marking & Proofreading Service

Not sure if your work is hitting the mark? Struggling to get feedback from your lecturer? Our premium marking service was created just for you - get the feedback you deserve now.

Exam Revision
Service

Exams can be one of the most stressful experiences you’ll ever have! Revision is key, and we’re here to help. With custom created revision notes and exam answers, you’ll never feel underprepared again.