Analyzing Next Generation Sequencing Data Computer Science Essay

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

Next generation sequencing (NGS) technology is changing the landscape of genetic analysis at a fast pace with its ability to produce large amount of data in faster and inexpensive way and becoming very popular among the researchers. However the growing data imposes challenges in terms of management and analysis. Researchers find themselves engaged in planning the resources to store the data by building and running data centers rather than scientific discovery. Also because NGS has got wide range application portfolio it is being used extensively among the different research groups across Novartis and different groups use different software for processing and analyzing NGS reads which results in inefficient use of resources available and poor storage management.

The aim of the project was to develop a single, scalable and flexible pipeline to process the NGS data which can be optimized to perform in parallel environment. The pipeline should provide a frame work which can be used by different groups for NGS processing based on experiment type and should have the ability to store experiment related metadata with the analysis runs also platform should allow researchers to handle data analysis for large data sets with one run. We worked with different scientific user groups across Novartis to develop a generalized framework for NGS analysis which can have extended scope than existing pipelines and optimizing the performance without compromising functionality.


The term "sequencing" refers to method of deciphering the genomic content amd Next generation sequencing implies to highthroughput technique of performing it. and From 1977 to 2005 for ~30 years sanger sequencing was practically only way of doing genome sequencing. In 2005 sequencing revolution started with new technologies which could sequence much faster and produce large amount of data and lot cheaper compared to sanger type of instrument accelerating the scientific studies using various applications including both RNA and DNA[1]. NGS experiments are used in Denovo and targeted sequencing, to identify microRNA, mRNA, methylation studies etc. Chipseq experimentation to study the protein interaction with DNA. Some of the clinical application of NGS are detecting the mutation in genes, cancer studies , drug response studies, pathogen detection , discovery of new drug targets and novel genes and comapritive genomic studies.

The primary advantage being a inexpensive and fast method of generating large amount of sequence data , usually NGS instrument single run can produce data ranging from giga bytes to tera bytes and a group of analysis can include processing millions of reads to entire genome. This data complexity had made researches to rethink traditional data management techiniques. With exponentially growning data NGS data seem to outpace Moore's law. If we keep waiting for computer hardware to catch up with NGS data, we are going to be waiting for a long period. Instead intelligent ways of data processing should be applied to justify the data and analysis complexity. Figure 1 provides time line of computer storage prices v/s DNA sequencing costs [2].

Figure 1: Provides time line of Computer Storage Prices v/s DNA Sequencing costs.

All plots are logarithmic without taking inflation into consideration

Blue line: Storage price per dollar

Yellow line: Pre-NGS with a doubling time less than disk storage until 2004.

Red line: NGS technologies results in exponential increase in data.

There is need to have a mechanism in place to handle the tsunami of data coming our way, so that we can better serve the scientific community. The expriements demand specific laboratory pipelines and distinct bioinformatics methodologies to process , filter, assemble, target, collect and visualize NGS data which delivers meaningful biological information.

Design and Workflow:

NGS data undergoes basic preprocessing steps irrespective of the type of experimentation carried out. A instrument run can have multiple samples and needs to demultiplexed and may have some of the sequences which are used in sequencing technology and some of the contaminants which needs to be removed from the reads before subjcted to analysis. YAP provides general purpose pipeline for NGS raw data processing. Some of the features are:

Architecture and alignment algorithm independent nature :

The pipeline is developed using python programming language which enables it to run on any operating system and its modular design and provides extensive flexibilty. Different algorithm switches, different analysis packages or codes can be included as per the user requirement as they become available.

Maximal amount of work is done when a sequence already resides in memory : From reading of raw read to obtaining the analysis results in the end,the sequence is not written out which avoids having multiple copies of same data everywhere.

Minimal amount of temporary files generated :

Temporary file generation is avoided which makes the pipeline data efficient from storage point of view.

Minimize the amounts of reads and writes through file based I/O :

Data flows from one step of pipeline to another with minimal disk Input/output operations to speed up the run. The existing pipelines used show inefficiency in terms of disk I/O and temporary file genarations which is avoided in YAP pipeline. This method reduces the tremendous cost associated with the disk reading and writng operations. Below is the pictorial representation ( Figure 2) of the algorithm complexity implemented in YAP to have optimized data flow throught the pipeline and the comaprison of the data flow with the existing pipeline.

Figure 2 : Comparison of Disk I/O costs between the existing pipeline and the YAP pipeline.

Maximum flexibility and customization based upon configuration files :

NGS data has wide scope for analysis. Different experiments may have to be analyzed using different parameters and sometimes same experiment needs to be tested with different parameters according to the aim of the research. YAP provides parameter customization using configuration files . There are three configuration files used.

a) Main configuration file : which takes all the parameters for data preprocessing and main switch values and configuration file names for alignment and postprocessing,respectively.

b) Alignment configuration file : Takes detailed parameters for alignment algorithm. The specific file format supports parameters input for multiple commands. The parser generates mutiple commands based on the parameters and all the steps are performed sequentially.

c) Postprocess configuration file : This file contains parameters concerned to specific packages used to analyze results from alignment files. Like Alignment file configuration file this also can have multiple commands parameters.

Support for common file formats:

There are various type of instruments used for NGS data generation and data can be represented in different forms. YAP supports some of the common NGS data formats such as qseq, FASTA, FASTQ, tab separated, SAM and BAM for input and output.

Centralized reference database : YAP also provides mechanism to store the references data base indexes at a centralized location so that the different user groups acorss the organization can keep track of various versions and also it helps in avoiding multiple copies of same being created .

Parallelized using Message Passing Interface(MPI) :

The analysis run speed up time is increased by enabling the parallelization of codes using Message Passing Interface so that multiple files representing one lane of NGS run can be analyzed at a time. Also if single file provided its data can be split and processed in parallel fashion to minimize the run time making efficient utilization of available cluster nodes.

Integration with Sun Grid Engine:

YAP can be run through the SGE script and user can have automated and multiple job run by submitting to the various SGE job queues and make the best use of resources as and when they bacome available.

Data Processing Steps:

Data is analyzed in three main steps. Preprocessing,alignment and postprocessing. Figure 3 gives the various steps involed in the data flow and its preprocessing steps which starts from the raw reads from the instrument to the Alignment and postprocessing steps which are togther termed as Analysis .

Figure 3 : A general represenation of NGS data processing work flow. The dotted box represent the steps involved in preprocessing. And the Analysis box represent the data flow to the subsequent alignment and postprocess steps.

Preprocessing Step:

The raw reads obtained from Instruments are converted into appropriate data formats. If input file is qseq it is first converted to fastq format. Data is then demutiplexed using the user given barcodes for separtion of different samples.The algortithm used is independent of the position of the barcode in the sequence and also allows user specified mismatch for bacode matching. Then each sample is processed sepearately in parallel. After barcode removal the adaptor sequences are subjected to removal. User has to specify the type of adaptors such as 5' and 3' and the adaptor sequence used. The adaptor sequence removal algorithm takes into consideration the 3' tailing effect and also possibility of mismatches in adaptor matching.

The adaptor trimmed data is then subjected to filtering process where the insignificant data can be strained out.The three criteria's used for filtering the data are :

a) length of sequence : if sequences length is smaller than length specified.

b) total number of non base characters : If the sequence contains more non base characters than the allowed number. A non base character can be anything other than four nucleotides A,T,C and G. Sometimes Instrument can produce unknown information which can be sequencing error or junk and such reads are considered insignificant and filtered out.

c) Low entropy: if the sequency complexity drops below the specified value. Algorithm calcualtes the The Shannons entropy which is a standard measure of the order state of symbol sequences [3] .The higher the entropy , the more the sequences or block of sequences occur at equal frequences. A higer value of entropy will filter more sequences that have similar string of charecters ie n - mers or consequtive strand of n symbols.

The data after preprocessing can be written out but the pipeline has the default value avoids from writing the preproessed data out and data is directly piped to next data processing step unless user wants to write it out. All of the above steps can be switched on and off and altered with different criteria numbers according to user's requirement.

Apart from the data preprocessing steps YAP also provides option to perform quality check on the data before any filtering is carried out to evaluate the instrument performance. The algorithm gives the count of number of all nucleotides at each position in all the sequences.

Alignment Step:

Data is aligned to reference database using specified alignment algorithm. This step is also carried out in parallel. Various parameters for alignment algorithm can be given through alignment configuration file and this alignment configuration file is specified in main configuration file. Once the alignment is processed in parallel the results are gathered in root node and then processed sequentially. The alignment result file then can be split according to target which makes it evaluate number of hits against each target. Postprocess step is then carried out using alignment results. Figure 4 gives the data flow from alignment to different postprocessing steps.

Postprocess Step:

The postprocess steps involve alignement result assessments. YAP provides mechanism to quality control step to validate the alignment results which takes standard mapping location coordinate file as input. The alignment results can be analyzed using packages and scripts. Example Expression count analysis : In order to get the expression count for each target , sequence alignment is analyzed for number of valid hits as one sequence can have multiple target hits and then each target receives the expression count which number equivalent to sequence copy number divided by the number of valid hits. Or user can chose to perform some other postprocessing such as More quality control checks using picard package [4]. User has flexibity of adding more packages to this framework.

Figure 4 : Framework design for Alignment and postprocess steps and packages involved in NGS data processing. Indicated in the red are the packages which are yet to be implemented completely.

Some of the scripts used in pipeline are listed in the Table1 in the below section.

To run the YAP user needs to set environment variables for YAP , python , perl , mpi and intel compilers and have the three configuration files. SGE jobs can be sumitted by careful insertion of the parameters in the configuration files.


The Next Generation Sequencing (NGS) data is growing at a very fast pace. Traditional data management approaches are proving to be insufficient in handling humungous data flow. YAP pipeline was designed by analyzing different pipelines used and their inefficiencies also the fact that many of these pipelines were overlapping in functionalities. YAP project was attempted to develop generic NGS processing framework for optimized performance. YAP pipeline version 1.0 alpha was released on November 15 2010. The pipeline is being tested among scientific groups across various sites of Novartis and is flexible enough to accomodate more features according to user community request. Pipeline is more efficient in terms of cluster utiliation and storage management compared to existing pieplines used. The benchmark analysis of the pipeline still needs to be carried out, which we could not because of the co-op time limit.

My 6 months co-op at NIBR, Scientific Computing Group was a great learning experience. I got exposure to corporate environment and the stratergies and protocols applied in business development to cope up with the ever growing scientific technologies such as NGS data proccessing.

Table 1.Scritps And binaries:





Genreric Parser for aligner and posprocess configuration files


Creates a dictionary for the yap main configuration file

File formatting

Creates the read alignment (.ral) format from the bowtie output by splitting according to the chromosomes type.

Multiple command execution , Flow control

Executes multiple commands in memory using the subprocess module.

Also consists of flow control switches from main configuration file.

Returns an multi dimensional array based on aligner output.

File formating

Prints the Sequence array into the corresponding format given in the YAP main Config File

Multiple command execution

Similar to

Processes multiple postprocessing commands .

Creates data structure , flow control

Creates the data structure , incorporates the line number , and calls various other preprocessing modules.

MPI commands , flow control , directory and file creation

Main executable for the YAP pipeline .Consists of all MPI commands for the splitting and gathering of data.

Also generates all the directories and files in the pipeline

File reader

Checks for file extension and reads file in to an array

File splitter

Splits file based on number of processor and file type


Removes barcodes on the basis of hamming distance values.


Counts the number of occurances of a particular base in the sequence


Calculates expression values



Multiple sorting and formating commands


Perl module for data filtering


Filter sequences on the basis of adaptors , entropy , non base charecters , read length etc.