1 Introduction

1.1 Motivation

The gene is the basic unit in deoxyribonucleic acid (DNA) which contains the required information to carry heredity. The expression of genes is a process in which a functional gene product is synthesized from DNA using gene information. In gene expression, transcription factors play a very important role. Transcription factors are proteins which bind to DNA and control the synthesis of gene product. Identification of transcription factors is very crucial for gene expression and it is a challenging task, due to the complexity of DNA sequences. DNA sequences contain few recurring patterns which can be potential binding sites for transcription factor. These recurring patterns are known as motifs [1]. Discovery of these motifs has gained a lot of importance since it is very useful for gene expression [2]. This task of discovering motifs could be considered as equivalent to pattern matching in a set of sequences. But, this would be a very complex task in case of DNA sequences as they are not simple, and can have insertions, deletions or mutations of nucleotides.

Several software approaches have been developed for motif matching over the past few years. The computation time for these software approaches has been very high for huge datasets because of the complexity of DNA sequences and probabilistic approaches used in them. A hardware alternative would help in decreasing the computation time by employing the techniques of concurrent simulation, pipelining and parallelization.

1.2 Aim of the Thesis

Motif matching is very important since it helps in identifying binding sites which are critical in understanding the biological content of genome sequences. Several software approaches have been proposed for motif matching and many of them use the principle of Expectation Maximization (EM) algorithm as it has been proved to be one of the efficient algorithms for motif matching. Though the algorithm is very efficient, it has high execution times for large datasets. Few approaches have been developed based on principle of parallelization using multiple processors. But this requires many processors and hence increases the cost incurred. A hardware implementation of the algorithm would be a very viable alternative to this.

This thesis deals with implementation of EM algorithm in hardware, where the whole algorithm is divided into different steps and implemented using sub-modules. The modules are described in Verilog HDL [3] and to ensure that the modules are synthesizable, they are synthesized using Altera Quartus II.

1.3 Introduction to Verilog HDL

Verilog HDL is a Hardware Description Language (HDL). Gateway Design System Corporation which is now a part of Cadence Design Systems introduced Verilog 1985 [3]. Verilog language enables the designer to describe digital system architecture at various levels of abstraction like behavioral, gate-level and RTL. Most of the computer-aided tools have the capability of processing Verilog codes and hence this language provides a gateway to use computer-aided tools in the design flow.

A design in Verilog contains many inter-connected modules to form a hierarchy. Modules are the basic elements of Verilog design. They are used to perform a particular functionality which is later instantiated at different places. Modules have input, bidirectional and output ports which are used for communication with other modules [4]. Each module can be either an element or collection of few sub-modules.

Verilog has various advantages over other languages or design practices. Few of them are [5]:

It offers different levels of abstraction in the same model. So we can use gates, switches, RTL or behavioral code to describe a hardware module.

  • All synthesis tools support Verilog HDL. Hence the modules can later be ported onto FPGA's or ASIC's.
  • Verilog HDL libraries for post-synthesis simulation are provided by all the fabrication vendors. Hence we can choose over a wide range of vendors.
  • Designing with Verilog is easier to develop and debug for very complex circuits where gate-level schematics become incomprehensible.

1.4 Introduction to EM algorithm (not sure if I can include this section here, can be removed if not required)

This algorithm was introduced by Lawrence and Reilly [6] for solving motif matching problem. The input to the algorithm is a set of unaligned sequences and motif length (W) and it returns a probabilistic model of the shared motif. Shared motif is the common subsequence present in all the input sequences. The starting point of the shared motif is unknown and differs among sequences. Finding out the starting point of the motif turns out to be the major challenge.

Each motif can be considered as a sequence of independent, discrete random variables. Hence, the observed frequencies of the letters in the columns are the maximum likelihood estimates of the distribution of the random variables. But since the sequences are unaligned, the offsets should also be estimated. Given the input data and initial guess of motif, EM initially estimates the probability (zij) that the shared motif starts at position ‘j' in dataset sequence ‘i'. Later zij is used to re-estimate the probability of letter ‘l' in column ‘c' of the motif, ?lc, for each letter in the alphabet and 1 = c = W, where W is the width of the motif. The algorithm alternatively estimates z and ? until change in ? from iteration to iteration is less than ?. A detailed description of the algorithm is presented in Chapter 3.

1.5 Organization of the Thesis

The thesis is organized as follows. Chapter 2 presents an overview of the motif matching algorithms available and compares them. It also discusses about the statistical parameters and datasets used for comparative analysis. Chapter 3 talks about Expectation Maximization algorithm and the various steps involved in it. Chapter 4 deals with the hardware implementation of Expectation Maximization algorithm. The whole architecture is also explained in this chapter. Chapter 5 analyses the simulation results and compares them with their software counterpart. Chapter 6 concludes by presenting a summary of the work and discusses possible avenues that could be explored in the future.

2. Background

2.1 DNA

Deoxyribonucleic acid (DNA) contains the genetic information which makes each species unique. Each cell in an organism has the same DNA [13]. A strand of DNA is formed with nucleotides linked into chains and phosphate and sugar groups alternating [14]. It forms like a spiral called double helix as shown in Figure 2.1.

Figure 2.1 Structure of part of DNA [13]

Each nucleotide has a sugar group, phosphate group and one of the base pairs [13]. Base pairs are formed using chemical bases namely adenine (A), guanine (G), cytosine (C), and thymine (T) [13]. ‘A' always pairs up with ‘T', and ‘C' with ‘G' to form base pairs [13]. Figure 2.2 shows an unwound version of DNA. In the figure, ‘S' represents the five carbon sugar called deoxyribose. ‘P' refers to phosphate group. The structure of DNA can be considered as a ladder where the sides of the ladder are formed by alternating sugar and phosphate units, and steps are formed by nucleotide base pairs. These base pairs are held together by hydrogen bonds. As we can see in Figure 2.2, the two ends of DNA strand are in different directions, that is, they run anti-parallel [14]. In the left side of the strand, it begins with a phosphate group at the top and ends with sugar unit. But in the right side of the strand, it begins with sugar molecule at the top and ends with phosphate group at the bottom. The sugar end of DNA strand is named as 3' end, and the phosphate end is named as 5' end.

Figure 2.2 Flattened version of DNA molecule [A]

The two strands of DNA are complementary to each other. For example, if one strand is AGTACGC, the other strand would be TCATGCG. This property of complementary strands is very important for storage and propagation of gene information. The order of bases determines the biological instructions contained in a strand of DNA [14]. The DNA of human species contains approximately three million bases and at least 99 percent of these bases are common in all human beings [13].

2.2 RNA

        Ribonucleic acid (RNA) is a polymer which consists of a long chain of nucleotides. Nucleotides consist of a sugar group, phosphate group and a base. It is similar to DNA but has three main differences. The figure below clearly shows these differences [15]:

  1. RNA consists of ribose sugar group while DNA consists of deoxyribose sugar group. The difference between both the sugar groups is that, ribose has a hydroxyl group (OH) on 2' carbon atom but deoxyribose doesn't have it.
  2. RNA is mostly single stranded while DNA is double-stranded.
  3. The bases in RNA are Adenine (A), Guanine (G), Cytosine (C), and Uracil (U). It does not have Thymine (T) which is present in DNA. In RNA, the complementary base to Adenine is Uracil and not Thymine.

Figure 2.3 Comparison of DNA and RNA structures [B]

There are three types of RNA, namely mRNA, tRNA and rRNA [C]. mRNA is known as messenger RNA and it contains copy of gene information. It contains information which is complementary to one strand of DNA and identical to other strand. tRNA is known as transfer RNA and it can bind to amino acid at one end and mRNA at the other end. rRNA is called as ribosomal RNA and it is a key component in ribosome. A detailed description of tRNA and rRNA has not been provided as a part of this thesis as it is out of scope of the current work.

2.3 Transcription, Transcription factors, Motifs

Transcription

        Transcription is a process by which RNA is transcribed from DNA with the help of an enzyme called RNA polymerase. For transcription process to start, RNA polymerase has to bind to a promoter region [D]. A promoter region contains base sequences which indicate beginning of gene region. When RNA polymerase recognizes promoter region and binds to it, the DNA is two stranded as shown in Figure 2.4. This is called as closed complex. In the vicinity of initiation site, the hydrogen bonds between bases in DNA are broken, and the DNA is unwound.

Figure 2.4 Gene transcription [D]

In transcription, nucleotides are added in 5' to 3' direction. Only one of the DNA strands is transcribed to form mRNA. Each nucleotide base pairs up with the complementary base to form mRNA. For the formation of mRNA, Adenine pairs with Uracil and not with Thymine. Thymine pairs up with Adenine, Guanine pairs with Cytosine and Cytosine pairs with Guanine. Figure 2.5 shows an example of transcription process. In this figure, the DNA sequence GATCAT is transcribed into CUAGUA.

Figure 2.5 Transcription process [E]

Transcription factors

RNA participates in the synthesis of proteins. RNA is transcribed from DNA by enzymes known as RNA polymerases [15]. To help RNA polymerase bind to DNA, we need protein complexes known as transcription factors [16]. Transcription factors can be described as proteins which bind to DNA sequences and control the transcription from DNA to RNA. These transcription factors contain DNA Binding domains (DBD), which are attached to either enhancer or promoter regions of DNA sequences. These regions are adjacent to the genes that they regulate [16]. The figure below shows the transcription factor binding to DNA which controls gene regulation.

Figure 2.6 Transcription process

        The number of transcription factors within an organism is directly proportional to genome size. Organisms with large genome size have more transcription factors [16]. In human genome, there are approximately 2600 proteins that contain DNA binding domains and most of them function as transcription factors [16]. This is approximately 10% of the entire human genome code; hence transcription factors are the single largest family of human proteins [16].

Motifs

DNA sequences have few recurring patterns which can be potential DNA binding sites for transcription factor. These recurring patterns are known as motifs [9]. Motifs are short in length, typically 5 to 20 base pairs long [9]. A same sequence can have zero, one or multiple copies of the same motif.

A motif is considered to be conserved, if the motif is expressed with the same abundance in all the sequences. A subsequence is considered to be over-represented if the frequency of occurrence of motif is significantly higher than its frequency in the background model. Background model refers to the all subsequences which are present in DNA dataset other than motifs. . Constrained motifs do not allow mutations, insertions or deletions. There are two other types of motifs, namely palindromic motifs and spaced-dyad motifs [9]. If a motif is same as its own reverse complement, it is known as palindromic motif. One of the examples of palindromic motif is CACGTG. If a motif consists of two conserved sites which are separated by a gap, it is known as spaced-dyad motif [9]. The gap is normally fixed but can vary in few cases.

2.4 Overview of motif finding algorithms

In molecular biology, a major challenge is to identify mechanisms that regulate gene expression. A very important step in this challenge is to find transcription factor binding sites in deoxyribonucleic acid (DNA) which control gene regulation [8]. Motifs are these potential binding sites in DNA segments [9]. The identification of these binding sites is cumbersome due to the complexity of DNA sequences and insufficient knowledge on gene regulation. A number of algorithms using different techniques have been developed for motif finding over the years. The problem of motif finding can be defined as detecting motifs from a set of DNA sequences which can be good candidates for binding sites [9]. This can be compared to pattern matching where we try to identify an unknown pattern which occurs frequently in a set of strings

This section presents a brief description of motif finding algorithms developed till date. Motif finding algorithms are compared in the later sections and a description of statistical parameters and datasets used for comparison is also presented.

2.4.1 Previous work in motif finding algorithms

A number of algorithms have been developed for motif finding. Most of the algorithms can be categorized into two groups, namely word-based approach and probabilistic algorithms [9]. Word-based approach works on exhaustive enumeration. Nucleotide frequencies are counted and compared, and hence it ensures global optimality. This approach is suitable for short motifs which are normally present in eukaryotes [9]. By using good data-structures like suffix trees [17], word-based approach can be made computationally fast and is very useful for finding constrained motifs, where there are no mutations, insertions or deletions [9]. But, in actual transcription factor motifs are not constrained and word-based approach has the drawback of finding spurious motifs [9]. In probabilistic algorithms, parameters are estimated using maximum likelihood or Bayesian interface [9]. Motif models use techniques like Expectation Maximization, Gibbs sampling and other greedy approaches for finding solutions. These techniques are explained in detail in the later sections. Probabilistic approach requires few input parameters but it relies on the probabilistic models of gene regulator regions [9]. Most of these algorithms are useful for finding long motifs and hence this approach is mostly used for motif finding in prokaryotes.

In this section, we present previous work on motif finding algorithms by different researchers over the last few years. We have categorized them into word-based algorithms, probabilistic algorithms and other approaches.

2.4.1.1 Word-based algorithms

Motif finding algorithm Oligo-analysis was developed by van Helden et al. [18] based on word-based approach. This algorithm compares the occurrences of words with expected values and detects statistically significant motifs [8]. Though the algorithm is simple, it worked efficiently in extracting motifs from most of the yeast [9]. But this algorithm is rigorous and exhaustive and can be used only for short conserved motifs. van Helden et al.[19] later extended the algorithm to find spaced dyad motifs. The important parameter in this algorithm is the choice of probabilistic model for estimating occurrence significance [8]. One of the drawbacks of this method is that it does not allow variations within a nucleotide [9].

Tompa [20] developed an algorithm around the same time for finding short motifs in DNA sequences using exact word-based approach [9]. This algorithm was primarily applied to ribosome binding sites. Yeast motif finder (YMF) was developed by Tompa and Sinha [21] using a similar approach. The model was derived from study of known transcription factors in yeast Saccharomyces cerevisiae [21]. The algorithm takes as input the set of sequences, number of non-spaced motifs to be recapitulated, transition matrix constructed from yeast model and produces motifs with z-scores [9]. This algorithm ensures global optimality and was tested on 23 well known regulons of yeast. YMF was able to generate principal transcription factor for 18 of the regulons [9].

Another algorithm was introduced by Sagot [17] which uses suffix tree for representation of sequences. Vanet et al. [22] developed another algorithm to search for single motifs in bacteria [9]. This method also uses suffix trees for representation. The use of suffix trees for representation helped in reducing the computation time of word-based approach algorithms though it was very exhaustive. Later, Marton and Sagot [23] extended the algorithm developed by Vanet et al. for searching combination of motifs [9].

Weeder [24] developed by Pavesi et al. also uses suffix tree but the input pattern is not constant. It automatically searches the whole space with different motif lengths and compares the top-scoring motifs of each run using a clustering method to know the best candidates which are likely to be transcription factor binding sites [8]. It is a consensus method and all oligos are enumerated up to maximum length and later their occurrences in input sequences are computed.

Eskin and Pevzner [25] developed Mismatch Tree Algorithm (MITRA) which uses suffix tree and is capable of handling insertions and deletions along with mismatches in selected sequences. This algorithm was able to discover motifs with combined length greater than 30 bp [25].

Pevzner and Sze [26] developed an algorithm named WINNOWER which uses graph theoretical methods along with word-based approach. It represents motif instances as vertices and later deletes spurious edges and finally recovers the remaining vertices. CWINNOWER [27] is an extension to this algorithm with a stronger constraint function which helps in decreasing the running time [28].

2.4.1.2 Probabilistic algorithms

        Probabilistic algorithms mostly use statistical methods like Expectation maximization (EM) and Gibbs sampling and its extensions for motif finding.

        Lawrence and Reily [6] developed EM for motif finding in 1990. It is an extension of greedy algorithms proposed by Hertz et al [29] which finds common motif which occurs once in every sequence [9].The main advantage of EM is the computational speed, but it needs proper starting points to yield good results. A detailed description of the algorithm is presented in Chapter 3. Multiple EM for Motif Elicitation (MEME) was developed by Bailey and Elkan [7], which is an extension of EM algorithm. MEME can be used for finding zero or more motifs in a set of sequences and it operates on starting points which actually occur in the sequences. This helps in increasing the probability of finding globally optimal motifs. MEME is a very computationally demanding algorithm but had been proven to show good results [7]. Gaudet et al [30] proposed an algorithm named Improbizer which uses EM to find motifs, which occur with improbable frequency. It compares the normal occurrence of a nucleotide by chance to the actual frequency of occurrence of nucleotide by constructing Gaussian model of motif placement [28].

        Lawrence et al. [31] developed Gibbs sampler for motif finding. Gibbs sampler is based on Markov Chain Monte Carlo (MCMC) approach [9]. As the results of current step depend only on that of the previous step, it is called Markov Chain. The method to select the next step is based on random sampling using random numbers and hence it is also called as Monte Carlo [9]. Gibbs Sampler works on the assumption that there is at least one motif in every sequence.

Many algorithms have later been developed as extensions to Gibbs sampling approach. One of the algorithms is AlignACE developed by Roth et al. [32], which returns overrepresented motifs in the input sequences as weight matrices [8]. To measure the degree of overrepresentation, AlignACE uses Maximum a priori Likelihood (MAP) score [9]. Later, Hughes et al [33] used this algorithm to find motifs in yeast related organisms, but applied group specificity to score motifs instead of MAP score. Group specificity takes into consideration motifs that are preferably associated with the genes, unlike MAP score where few ubiquitously occurring gain high score [9].

        BioProspector is another variant of Gibbs sampling algorithm for motif finding. Liu et al. [34] tested this algorithm on RAP1 protein in yeast, CRP protein in Escherichia coli and TATA-box motif in Bacillus subtilis and were successful in finding motifs [9]. A Markov model is added to estimate promoter sequences in genes, which help in modeling adjacent nucleotide dependency [8]. This algorithm can also be used for modeling gapped motifs with palindrome patterns that are common in prokaryotes [28].

        Thijs et al. [35] developed MotifSampler which is a modification of Gibbs Sampling algorithm. It uses higher order Markov model for background model, and to estimate the occurrence of motifs in each sequence, Bayseian mechanism is used [10]. This algorithm was tested on several datasets namely sequences from plant containing G-box motif, bacterial genes which are regulated by FNR protein, and MotifSampler computed expected motifs [9].

        GLAM was developed by Frith et al. [36] based on Gibbs sampling algorithm. This algorithm automatically optimizes the motif width, and computes the statistical significance of generated output [8]. This algorithm cannot find multiple motifs in a set of sequences. Favorov et al [37] developed SesiMCMC based on Gibbs sampling approach for finding multiple motifs, in an input set of DNA sequences [28].

        Gibbs-ssT was developed by Shinda [38], which combines simulated tempering with Gibbs sampling to overcome the problem of local optima. Simulated tempering helps in improving search methods in the solution space [9]. The author tested Gibbs-ssT on synthetic data and sequences extracted from yeast, and observed good resistance to local optima [38]. Hertz and Stormo [39] developed an algorithm using greedy approach for motif finding. It is an extension of earlier work by Hertz et al [29]. This algorithm searches for a pair of sequences for motif with greatest information content, and then tries to find third sequence to be added, which has greatest information content [8]. ANN-Spec developed by Workman and Stormo [43], finds patterns to distinguish positive dataset from background. This algorithm uses weight matrix to represent the motif model. It finds motifs with higher correlation coefficients and higher specificity compared to Gibbs sampler [31].

        QuickScore was proposed by Regneir and Denise [41], which uses an exhaustive search to predict probabilities of frequent or rare patterns in a set of sequences. This algorithm uses mathematical computations for effectively calculating z-scores and probability values [8]. Liu et al. [42] developed MDScan, which is an enumerative deterministic greedy algorithm [10]. It uses MAP score to evaluate best candidates for motifs. The algorithm constructs motif models using several top motif candidates and then uses a greedy approach to improvise [10].

2.4.1.3 Other approaches

There are various other approaches for motif finding. FMGA algorithm was developed by Lit et al. [43] based on genetic algorithms. Position weight matrices are used for mutation in genetic algorithms [9]. Self organizing neural network structure is also used for motif finding in one the algorithms [44]. Kaplan et al. [45] developed an algorithm, which uses structure based approach for motif finding without any prior knowledge of transcription factor binding sites [9].A deterministic motif finding algorithm was developed by Jain and Hon [46] for human genome. An ensemble algorithm was developed by Hu et al. [10], which combines multiple predictions from various runs using different algorithms. The main aim of this algorithm is to improve prediction accuracy [9]. Gier et al. [2] implemented a variant of MEME on FPGA to accelerate motif matching.

A number of algorithms have been developed using technique of phylogenic foot-printing. We have skipped discussion of these algorithms since they are out of scope of this work. In the next section, we discuss the statistics and datasets to be considered for comparison of motif finding algorithms. The results of previous comparison studies have also been discussed in Section 2.4.

2.4.2 Previous work on comparison of motif finding algorithms

Various algorithms have been developed for finding binding sites of transcription factors which control gene regulation. To give a clear idea to users about the limitations and potentials of available motif matching algorithms, several comparison studies have been done by researchers. These studies also help in providing a benchmark for newly developed algorithms. Accurate analysis of these algorithms is very difficult, since motifs are short sequences occurring in large amount of statistical data and there can also be mutations, insertions or deletions. Moreover, tools are developed based on different approaches and motif models. So, few algorithms might compute good results on a particular dataset but might fail on a different dataset. And, complete information of regulatory mechanisms is not available which makes evaluation of algorithms tougher [9]. In this section, we discuss the common statistical parameters used to assess algorithm performance, datasets used by researchers and results of the comparison studies.

2.4.2.1 Statistics used for comparison

        To analyze the performance of different tools, we need to have few statistical parameters for the basis of comparison. Based on the various studies conducted by researchers, we have tried to summarize the statistical parameters commonly used for comparative evaluation of tools in terms of prediction accuracy and reliability. We can summarize the performance criteria into three types namely, Nucleotide level, Transcription factor binding site level and Sequence level accuracy [10].

Nucleotide level prediction accuracy

At the nucleotide level, for each predicted binding site which overlaps with the target binding site, we can have the following parameters for prediction of nucleotide accuracy [8, 10].

nTP - True positives, the number of common nucleotide positions in predicted and known binding sites

nTN - True negatives, the number of nucleotide positions neither in predicted nor known binding sites

nFP - False positives, the number of nucleotide positions only in predicted binding sites and not in known binding sites

nFN - False negatives, the number of nucleotide positions only in known binding sites and not in predicted binding sites

We try to represent these parameters in form of venn diagram as shown below in Figure 2.7. Let Set A contain known binding sites and set B contain predicted binding sites.

Figure 2.7 Venn diagram showing nTP, nTN, nFP and nFN

The sensitivity at the nucleotide level can be defined as follows [8, 10]:

The positive predictive value can be defined as follows [8]:

The specificity can be defined as [8]:

Performance coefficient helps in comparing specificity and positive predictive value in a single measurement. It can be defined as [8, 10]:

The value of nPC ranges between 0 and 1, where 1 is the value for perfect prediction.

Nucleotide correlation coefficient is defined by Burset and Guigo [47] as [8]:

The value of nCC ranges from -1 to +1, where -1 indicates perfect anti-correlation and +1 indicates perfect correlation. So, for accurate prediction the value of nCC is +1 and for random prediction of motifs independently, the value of nCC is 0 indicating no correlation [8]. At the nucleotide level, the average score is calculated over all binding sites in a sequence, followed by average score of all the sequences in a dataset, and later average score over all the datasets is calculated [10].

Binding site level accuracy

Binding site level accuracy shows whether the predicted binding sites overlap with the known binding sites by at least a specified cutoff. The cutoff can be set depending on the accuracy we require for the analysis. We can define the following parameters for prediction of binding site level accuracy [8, 10].

sTP - True positives, the number of predicted binding sites overlapping with known binding sites.

sFP - False positives, the number of predicted binding sites not overlapping with known binding sites

sFN - False negatives, the number of known binding sites not overlapping with predicted binding sites

As we defined nucleotide sensitivity, positive predictive value and performance coefficient, we can also define the accuracy metrics for each sequence at the binding site level as shown below.

Sensitivity at binding site level can be defined as follows [8, 10]:

Positive predictive value at the binding site level can be defined as below [8]:

Performance coefficient can be defined as [10]:

The average performance at the binding site level can be computed as [8]:

At the binding site level, the average score of binding site level accuracy over all the sequences in a dataset is calculated, followed by average over all the datasets in a benchmark.

Sequence level accuracy

To evaluate sequence level accuracy, we calculate sequence level success rates, which compute the number of sequences with at least one correct predicted motif over the total number of sequences. Let N be the total number of sequences and Ns be the number of sequences with at least one correct predicted motif. Sequence success rate can be defined as [10]:

For an algorithm, the sequence level accuracy is the average of sSr over all sequences for an input dataset.

2.4.2.2 Datasets

There are many web-based resources available for regulatory motifs datasets. Researchers or users can download the datasets and modify them according to their analysis perspective. Wei and Yu [28] provided information of all the web-based resources available for regulatory motif datasets. The data has been tabulated in Table 2.1

Table 2.1 Selected web-based resources for regulatory motifs or TFBSs [28]

Creation of good datasets is very difficult since tools have been developed using different algorithms and have been targeted for different motif models. So, creating a dataset which is unbiased to any of the algorithms or motif models is very challenging. Tompa et al. [8] created good datasets from TRANSFAC database for comparison of algorithms. They created 52 datasets, 26 from human, 12 from mouse and 8 from yeast [8]. The datasets were of three different types: real, generic and markov. More details of the data sets has been provided in Chapter 5.

Hu et al. [10] created large datasets from Escherichia Coli K-12 in RegulonDB database. These datasets were used for evaluation on prokaryotes whereas the earlier datasets generated by Tompa et al. [8] were used on eukaryotes. The datasets generated by Hu et al. [10], ECRDB70, ECRDB62A and ECRDB70B-X are available at http:// dragon.bio.purdue.edu/pmotif/.

2.4.2.3 Results of comparison studies

        A comparative analysis of 13 algorithms namely, AlignACE, GLAM, QuickScore, SeSiMCMC, Improbizer, MEME, MITRA, MotifSampler, Oligo/dyad analysis, ANN-Spec, Consensus, Weeder and YMF was conducted by Tompa et al. [8] on eukaryotic datasets present in TRANSFAC database. Based on the study, it has been found that an absolute measure of correctness of these algorithms was very low. The maximum value for binding site correlation coefficient was 0.20 and maximum value for sensitivity was 0.22 [8]. The authors warned that the predictions might not be completely accurate since datasets used might not be perfect, as the knowledge of underlying regulatory mechanisms is still incomplete.

Figure 2.8 shows the results of the 13 tools. The statistics explained in the previous section have been calculated for each tool. In general, Weeder was observed to produce good results compared to other algorithms and the authors stated its success might be attributed to its judicious choice of predicting no motifs in a dataset. SeSiMCMC did better than other algorithms on fly datasets while MEME and YMF did better on mouse datasets. All the algorithms performed well on yeast datasets. The correlation coefficient increased by 39% when real datasets were removed. MotifSampler performed better on real datasets. The authors suggested that using a combination of tools would yield better results than a single one.

Figure 2.8 Statistics showing comparison all the 13 tools [8]

        Hu et al. [10] conducted another comparison study of motif finding algorithms. They compared AlignACE, MotifSampler, MEME, BioProspector and MDScan on prokaryotic datasets in RegulonDB database. The authors allowed minimal parameter tuning during evaluation. They found that the nucleotide level accuracy is very low for all the algorithms. The scores are better on eukaryotic datasets [8] because the sequence length is shorter in prokaryotic datasets. MEME was found to have best sensitivity and sequence success rate. BioProspector had highest performance coefficient [10]. The prediction accuracy was found to be better at binding site level than at nucleotide level. Performance coefficient score of AlignACE was lowest at binding site level. MEME was found to handle diverse sequences very well [10]. With increase in sequence length, the performance of all the algorithms has decreased. But, Gibbs sampling strategy becomes very inefficient with longer sequence lengths. The authors had developed an ensemble algorithm [9] which combines the predictions of multiple algorithms with multiple motifs. Ensemble algorithm was found to perform better than other algorithms by 52%. So combining the predictions of various runs using different tools was found to yield better results [10].

Geir et al. [2] had implemented a variation of MEME algorithm on hardware and observed a significant speed up compared to other software implementations. They tested the algorithm on 5 datasets of human promoter genes [2]. Based on work of Geir et al [2], it has been proved that hardware approaches help in reducing the computation time of algorithms.

3. Expectation Maximization algorithm

Expectation Maximization (EM) algorithm is widely used in biological applications that use probabilistic models. One of the major applications of EM is to the motif finding problem. EM facilitates parameter estimation in models which have incomplete data [48]. In motif finding problem, we have a dataset of unaligned DNA sequences and we need to find a common motif which is present in all the sequences. There can be few mutations in the motif. The information available to us is the set of unaligned sequences and length of the motif W. Hence the problem boils down to finding starting position of motif in each sequence. EM uses a probabilistic approach to solve this problem. The parameters of the probabilistic model which might have generated the dataset are estimated by EM algorithm. It is a two component model. The first component is the set of motifs which have fixed width and the second component has all the other positions in dataset which can be described as background [49].

This chapter explains about EM algorithm and the various steps involved in it. An example of EM has also been presented to provide a clear picture of the algorithm.

3.1 Algorithm

        Do et al.[48] provided a very good analogy to explain EM algorithm using a coin flipping experiment. In coin flipping experiment, we consider we have two coins, for example A and B. The probability that coin A, B ( ?A, ?B) will land on heads has to be estimated. To estimate the probabilities, we choose one of the two coins randomly and toss them 10 times with each selected coin. We repeat this experiment for 5 times. So, the experiment involves 50 coin tosses.

Let us name the variables as z = (z1, z2, z3, z4, z5) which has information about the selected coin if it A or B. The outcome of the experiment is stored in a variable x where x = (x1, x2, x3, x4, x5, x6, x7, x8, x9, x10).

If we note down whether the selected coin is either A or B, and the outcome of the toss, calculation of ?A, ?B is pretty straight forward using the following formulae:

However, if we do not note down information about the selected coin and only note down the outcome of the toss, calculation of ?A, ?B becomes complex. We can solve the problem by using maximum likelihood where an intuitive guess is made on incomplete data to make it complete. We start with some ?A0, ?B0 and calculate if A or B could have generated the observed x values. So, we make assumptions on z (whether it is A or B) and apply maximum likelihood procedure to estimate ?A1, ?B1. We repeat the same procedure until the values of ?A, ?B are approximately same in two consecutive iterations. When the values of ?A, ?B are constant, the problem is said to be converged. The speed of convergence depends heavily on the initial guess.

        The EM algorithm is similar to coin flipping experiment analysis. In the coin flipping experiment, we estimate information about coin if it is A or B. In EM algorithm, we estimate the probability of missing parameters using parameters available [48]. The algorithm iterates over two steps. In the first step known as expectation step, we guess the probability distribution of missing parameters using the available model. In the next step known as the maximization step, we re-estimate the parameters of the model using these estimations. Both the steps are repeated until the algorithm converges.

3.2 Application of EM to motif finding problem

In motif finding, we are given a dataset of sequences, S. We need to find motifs which can be considered as subsequences occurring in them. The sequences can be probabilistically divided into two parts namely, motif and background. Background model contains all the subsequences which are not present on the motif model. The diagram below gives a clear picture of the model.

Figure 3.1 Probabilistic model for motif finding problem

The algorithm assumes that there is one motif in each sequence and it is referred to as one-occurrence-per sequence model [7]. Let the length of the motif be W. The location of the motif in a sequence is unknown and we need to estimate the offset location of motif in each sequence. The algorithm starts with an initial guess of the offset locations of shared motif in dataset. Assuming that the initial guess is correct, the algorithm proceeds and estimates the probability ‘zij' that the shared motif begins in location ‘j' in sequence number ‘i' [7]. Later in the maximization step, zij is used to calculate the probability (?lc) of alphabet ‘l' occurring in column ‘c' of the shared motif. ?lc is calculated for the each letter in the entire length of the motif, 1 = c = W. EM later re-estimates zij from ?lc and the loop repeats until the change in ? is very low in two consecutive iterations. The calculation of z and ? is explained in detail in the next section. The flowchart below shows the various steps and flow of EM algorithm.

Figure 2.2 Flowchart of EM algorithm

3.2.1 Expectation step

        To explain the different steps in EM algorithm, we first need to define few variables. Let S be dataset of sequences and Si be the ith sequence. Let the length of the motif be W and length of each sequence be L. We assume that all the sequences are of equal length. As defined earlier, let zijx be the estimated probability after ‘x' iterations that the motif starts in position ‘j' in sequence ‘i'. ?lcx be the probability after ‘x' iterations that letter ‘l' occurs in position ‘c' of motif [7]. We define another variable Xij which equals to 1 if motif occurs in position ‘j' in sequence ‘i'. If the motif does not occur in the specified position, the value of Xij is 0.

        To calculate zijx, we first calculate probability of sequence Si using the following equation [7]:

For correct prediction of probability, we might also include background probability in the above calculation. Bayes' rule is used for the estimation of zx from. According to Bayes' rule [7],

Applying Bayes' rule to estimate zijx, we get the following equation [7]:

zijx = P0(Xij =1) is the probability calculated initially that the shared motif begins in location ‘j' in sequence ‘i'. P0(Xij =1) is assumed to be equal and uniform and is calculated using:

P0(Xij =1) = After applying the value of P0, the equation for zijx simplifies to the following [7]:

After estimating zijx using the above equations, we normalize the values of zij such that

3.2.2 Maximization step

        In the maximization step, we calculate the ?lc, probability of letter ‘l' occurring in position ‘c' in the motif. Let nl be the total number of letter ‘l' in the given dataset. nlc be the number of letter ‘l's in position ‘c' of motif. We calculate the total number of letters in a particular location using the following formula [50]:

for c > 0

for c = 0

The values for background are represented by c= 0. Using the values of nlc, we calculate the probability of letter ‘l' occurring in position ‘c' of motif, ?lc as shown in the below equation [50]:

We add pseudo counts dlc and dic to prevent the probability from being zero.

3.3 An example showing application of EM algorithm to motif finding

In the previous section, we have explained the various steps involved in EM algorithm. We present an example in this section to give a clear picture of the flow of algorithm. Let us take a set of 4 DNA sequences and the motif width be 3. The proposed start locations for DNA sequences are shown below:

ACTATAGC

TGCGAAAT

AAAGGATC

GTCTGGTG

Table 3.1 Dataset of sequences

Let initial starting positions of motifs be:

For sequence 1: 4

For sequence 2: 2

For sequence 3: 6

For sequence 4: 3

The motif matrix formed using the above random starting positions is:

ATA

GCG

ATC

CTG

Table 3.2 Motif matrix

And the non-motif matrix formed is:

ACTGC

TAAAT

AAAGG

GTGTG

Table 3.3 Non-motif matrix

Later, we find the probability matrix to find the frequency of bases occurring in sequences.

Probability matrix

The probability matrix is formed by counting the frequency of bases in each position. For example, in the above case, the number of A's occurring in the first position of the motif matrix is 2. Similarly we find the frequency of C, T, G in the motif matrix in different positions. The values are tabulated below:

Table 3.3 Frequency of bases

To find the probabilities, we divide all the values with 4, since the bases can be either A, C, T or G. We assume the occurrence of all the bases is equally likely.

The probability matrix formed is:

Table 3.4 Probability of bases in motif

We also find the background probabilities by calculating the number of A, C, T and G in the non-motif matrix and divided it by the total number of elements in the non-motif matrix. The background probability is represented as ?0. After calculating the background probabilities, the modified probability matrix is shown below.

Table 3.5 Probability matrix

Expectation step

After computing the probability matrix, EM initially estimates the probability zij, that the shared motif starts at position ‘j' in dataset sequence ‘i'. zij is calculated using the probability matrix formed. To find z11 (probability of motif occurring in position 1 in sequence 1), the probability of occurrence of base in motif and in background are multiplied.

Table 3.6 Probability of motif occurring at a particular position

After finding ‘z' values for all the sequences, we normalize z values such that each row adds up to 1 (assuming one motif per sequence).

Once all the z values are calculated, we proceed to maximization step.

Maximization step

In maximization step, zij is used to re-estimate the probability of letter ‘l' in column ‘c' of the motif, ?lc, for each letter in the alphabet and 1 = c = W. (W is width of motif).

In the above example, to calculate the probability of ‘A' to occur in first column of motif, we look at all the possible locations where a motif starts with ‘A'. In sequence 1, ACTATAGC, the motif can start with ‘A' at locations 1, 4 and 6. Similarly for other sequences, we look at the locations where the motif can start with A, and add all the ‘z' values. The sum of the ‘z' values, where ‘A' is located in first column of motif is divided by the total sum of all the ‘z' values as shown below to get the probability of occurrence of ‘A' in location ‘1'.

Similarly, we find ?A2, ?A3, ?A0, ?C1, ?C2, ?C3, ?C0, ?T1, ?T2, ?T3, ?T0, ?G1, ?G2, ?G3, ?G0 also. To prevent probability from going to zero, we might add pseudo-counts.

The new values of ? are compared with the previous values in probability matrix. If values are less than ?, the algorithm stops. If value is greater than ?, the algorithm repeats again.

4 Hardware Implementation of Expectation Maximization Algorithm

This chapter describes the hardware implementation of Expectation Maximization algorithm. The various modules included in the hardware implementation have been explained here. All the modules have been implemented in Verilog HDL. This chapter also includes details of the architecture and explains the benefits of using it.

4.1. Details of Design

The input to the algorithm is the dataset of sequences and width of motif W. The dataset of sequences are stored in RAM named ‘sequences'. As shown in the figure below, the inputs to the RAM are din_seq, raddr_seq, waddr_seq, we_seq, re_seq, clk_seq, rst_seq. The output from the RAM is dout_seq.

Figure 4.1 Sequences RAM

din_seq - contains the input data sequence

raddr_seq - contains address of location to be read

waddr_seq - contains address lo location to be written to

we_seq - asserted high for write operation

re_seq - asserted high for read operation

clk_seq - clock for RAM module, all operations occur synchronous to clock

rst_seq - all signals in sequences RAM module are set to initial values when rst_seq is low

        DNA base pairs are formed with chemical bases namely adenine (A), guanine (G), cytosine (C), and thymine (T). Hence the dataset of sequences consists of four bases A, G, C and T. To represent four bases in binary, two digits are required. The bases can also be represented in integer format but it would occupy more memory. To save memory, binary representation has been used. The bases are represented in binary as follows:

A - 00

C - 01

T - 10

G - 11

For example, a sequence ATGCAA is represented in RAM as 001011010000. To convert all the sequences in A, C, T and G to binary format, a perl script has been developed.

The key to implementation of Expectation Maximization algorithm is dividing the whole algorithm into different steps and realizing different states in each step. The whole design is divided into hardware sub-modules and each module is realized using finite state machine. A finite state machine is a design which contains finite number of states, with transitions between those states and actions to perform in a particular state. The next section explains various sub-modules involved in the design and the state machines developed for each module have also been presented.

4.2. Modules

        The hardware implementation of EM algorithm consists of six main sub-modules, namely random number generation, formation of motif and non-motif matrix, formation of probability matrix, expectation step, maximization step and comparison between iterations. The state machine developed to realize the required logic has been presented for each module. The algorithm begins with generation of random starting positions. A Linear-Feedback Shift Register has been used to generate random numbers which is explained in random number generation section. Using the random numbers generated, two matrices known as motif and non-motif are created. Motif matrix contains the subsequences probable to be motifs and non-motif matrix contains the background sequences. After formation of motif and non-motif matrices, the probabilities for each of the bases at each location in motif are calculated. The background probabilities are also calculated. With the probabilities of model, expectation and maximizations steps are executed until the difference in values of probabilities is very low for two successive iterations.

4.2.1. Random number implementation

This module is used for random number generation. The random numbers generated are used as start locations of motif in sequences. For generation of random numbers in hardware, Linear Feedback Shift Registers (LFSR) has been used. A LFSR is a series of shift registers, for which input is a linear function of the previous state [51]. The linear function is constructed using XOR of few bits as input to LFSR. The figure below shows a 5-bit LFSR.

Figure 4.2. 5-bit LFSR [52]

        The 5-bit LFSR is constructed using five D- flip-flops, and a XOR gate with inputs as tap1 and tap4. Tap 1 is output from second flip-flop from left and Tap 4 is output from last flip-flop. The values fed to XOR gate are referred as taps. The output of the XOR gate is fed into least significant bit. The seed to LFSR is its initial value. The values can be predicted since the current state depends on previous state and it is a finite state machine [51]. But, we can create a very long sequence of bits using a good feedback function. For an n-stage LFSR, the maximum number of random numbers generated is 2N - 1. The number with all zeroes is not generated by LFSR. A LFSR does not generate random numbers if the initial value of LFSR is all zeroes. So, we need to make sure that the seed is not 0. To get a maximum length sequence, the following rules should be followed [51].

  1. The number of taps should always be even to achieve maximum length.
  2. The taps should not share a common divisor.
  3. If we get a maximum length sequence using taps as [n, X, Y, Z, 0], then its mirror sequence [n, n-Z, n-Y, n-X, 0] would also generate maximum length.

A lot of research has been done over the years to know the taps which form maximum length. The tables available online have been used to decide on the taps for our design.

4.2.2. Formation of motif and non-motif matrix

        After generation of random numbers, sequences are divided into two different sets based on the random numbers. The random numbers generated are used for guessing the starting location of motif in each sequence. Suppose there are 50 sequences in dataset, we generate 50 random numbers to guess the start location of motif in each sequence. Depending on length of motif, two matrices, motif and non-motif are formed. Motif matrix has the set of motifs. Non-motif matrix has the background sequences. Let us consider an example for better understanding. The table below shows the dataset of sequences and motif width is 4.

ACTGGGTC

GTCGACTA

GACTATAT

TTTCCCAT

CTCAGCTA

Table 4.1. Dataset of sequences

Let the random numbers generated be 3, 1, 3, 2, 4. The motif and non-motif matrix formed from the random numbers are shown below in table 4.2.

Table 4.2. Motif and non-motif matrix

        A finite state machine has been developed for formation of motif and non-motif matrices. For formation of these matrices, a ‘for' loop has to be created to iterate through all the sequences. But, using ‘for' loop in Verilog would create redundant logic in hardware after synthesis. To avoid this, a state machine is created. The state machine has four states, S0, S1, S2, and S3. Two flags are created, named as f1 and f2. Flag f1 determines whether the base belongs to motif or non-motif matrix. Flag f2 is used to repeat the loop through all the sequences in dataset. These flags are inputs to state machine. In details of hardware implementation, bases are referred as bits as they are stored in binary format.

f1 = 0 (if bit is not in motif, implies the bit has to be stored in non-motif RAM)

   = 1 (if bit has to be stored in motif RAM)

f2 = 0 (if all sequences are covered)

   = 1 (if there are more sequences to be stored in motif and non-motif RAM's)

Depending on the flags, the state machine goes from state to another. S0 is the initial state. All the flags and registers are reset in this state. S1 is used for storing bits in motif matrix. S2 state is used for storing bits in non-motif matrix. S3 is the final state used to indicate all the sequences have been covered and motif and non-motif matrices are formed. The state diagram below shows the various transitions between states depending on flags.

Figure 4.3. State diagram for formation of motif and non-motif matrix

4.2.3. Formation of probability matrix

        This module is used for calculation of probability of occurrence of each base in every location in motif. It also calculates the background probability. Let us consider the following motif and non-motif matrices.

Table 4.3. Motif and non-motif matrices

The probability of occurrence of A, C, T and G at position 1,2,3,4 are calculated in this module. The background probability is also calculated. The probability of occurrence of ‘A' in position 1 is 0.2, as A occurs only once in 5 locations. Similarly, we calculate the probability of C, T and G in position 1 as 0.2, 0.4, and 0.2. After calculating probability for position 1, we calculate the probabilities for positions 2, 3 and 4 in a similar fashion. The table below shows the probability matrix formed for dataset.

Table 4.4. Probability matrix

        A state machine has been developed to realize the logic explained above. Three flags namely f1, f2 and f3 are used to decide transition between states. Let N be the number of sequences, L be the length of each sequence and M be length of motif. Let i, j and k be three different counters.

Flag f1 = 1 if i < N

   f1 = 0 if i >= N

Flag f2 = 1 if j < M

   f2 = 0 if j >= M

Flag f3 = 1 if k < L-M

   f3 = 0 if k >= L-M

The purpose of the flags is to ensure that the state machine stays in a required state until the current flag changes it value. It essentially facilitates execution of a ‘for' loop. The state machine built for this module has 7 states. S0 is the initial state. S1 to S3 states are used for calculating probability values at motif locations. S4 to S6 states are used for calculation of background probability. The state machine below shows the transition between different states depending on the flag values.

Figure 4.4. State machine for formation of probability matrix

4.2.4. Expectation step

     Expectation step is used for calculation of an expected log likelihood based on the current estimated probability model. After formation of probability matrix, the probability of occurrence of shared motif at a particular location, zij is calculated, where ‘i' is the current sequence and ‘j' is the location in sequence. The range of ‘i' is from 1 to N, where N is the number of sequences. The range of ‘j' is from 1 to L-M, where L is the length of each sequence and M is the length of shared motif. The calculation of probability zij has been explained in Chapter 3. After calculation of zij, the values of zij for a particular sequence are normalized to 1 since the number of motifs per sequence is assumed to be one.

     For implementation of expectation step in Verilog HDL, a state machine has been developed with five states. The inputs to the state machine are three flags, namely f1, f2 and f3 as shown in Figure 4.5. Let N be the number of sequences, L be the length of each sequence and M be length of motif. Let i1, j1 and k1 be three different counters.

Flag f1 = 1 if i1 < N

   f1 = 0 if i1 >= N

Flag f2 = 1 if j1 < L-M

   f2 = 0 if j1 >= L-M

Flag f3 = 1 if k1 < L

   f3 = 0 if k1 >= L

S0 is the initial state and all flags and variables are reset in this state. S1, S2 and S3 are used for calculation of zij. The flags are used to ensure that probability values are calculated for all sequences at every location. Flag f1 is used to iterate in the same state until all the sequences are covered. Flag f2 is used to ensure that the computations are done for every valid location in sequence. The state machine reaches S4 after all the probability values are calculated and normalization is done in this step.

Figure 4.5. State machine for expectation step

4.2.5. Maximization step

     In maximization step, we compute the parameters which maximize log likelihood calculated in expectation step. For motif matching, the probability of occurrence of base at a particular location in motif is calculated using the values of zij computed in Expectation step. Essentially, the probability matrix is calculated again using the expected parameters. The calculation of probability ?lc, (where ‘l' is the location in motif and ‘c' is the current base) is explained in Chapter 3. The background probability is also calculated in this step.

     To do these computations, a state machine has been developed which is implemented in Verilog HDL. The state machine developed for maximization step is similar to the state machine developed for expectation step. But, the conditions for the flags vary and operations performed at each state are different. This state machine has five states and the inputs to state machine are three flags, namely f1, f2 and f3. Let N be the number of sequences, L be the length of each sequence and M be length of motif. Let i2, j2 and k2 be three different counters.

Flag f1 = 1 if i2 < N

   f1= 0 if i2 >= N

Flag f2 = 1 if j2 < M

   f2 = 0 if j2 >= M

Flag f3 = 1 if k2 < L-M

   f3 = 0 if k2 >= L-M

S0 is the initial state; all the flags and registers are reset in this state. The state machine stays in state S1 for L-M iterations. In this state, the base at each location is matched and corresponding counter is incremented. After matching all the bases for current sequence, state machine transitions to state S2, and stays in this state for M iterations. This entire loop is repeated for all the sequences. Counter i1 is incremented once the computations are performed for a particular sequence. In state S3, values of counters j2 and k2 are reset, and operations at states S1 and S2 are repeated for the next sequence. This is continued till i2 counter reaches N. After i2 reaches N, the state machine transitions to final state, and background probability is calculated here. The state machine for maximization step is shown below in Figure 4.6.

Figure 4.6. State machine for maximization step

4.2.6. Comparison between iterations

     In this module, the values of ?lc after maximization step are compared with initial probability values in probability matrix. If the difference in values is less than ?, the algorithm stops. If the difference is greater than ?, the required variables are reset and expectation and maximizations steps are repeated again. The comparison and resetting of variables to required values is performed in this module. The values of probability matrix are stored in two different RAM's. The first RAM contains the values at beginning of algorithm and second RAM contains the probability values at end of maximization step. If the algorithm ends, the starting locations of motif are derived from the probability values and stored in a different RAM.

4.3. Architecture

     This section explains the architecture of the hardware design. The connection between different modules has been shown below in Figure 4.7.

     The architecture consists of six main modules namely, LFSR module, Matrix formation module, Probability matrix module, Expectation module, Maximization module and Comparison module. All the modules have been implemented in Verilog HDL. LFSR module is used for random number generation. The random numbers generated are used as start positions of motif. If there are ‘N' sequences, ‘N' random numbers are generated and each number corresponds to start location of motif. The random numbers generated are stored in RAM. Depending on start locations of motif and width of motif, two matrices namely, motif and non-motif matrix are formed in Matrix formation module. The matrices formed are stored in two different RAM's.

     The probability matrix module forms the probability matrix using Motif and Non-motif matrices. The probability of occurrence of each base at any location is calculated in this module. After formation of probability matrix, expectation module calculates ‘z' values. Depending on ‘z' values calculated in expectation module, probability matrix is formed in maximization module. The values of probability matrix at end of iteration are compared with values at beginning of the iteration in comparison module. If the difference is very less, the algorithm stops execution, and the motif start locations are stored in a RAM. If the difference is high, the algorithm repeats again.

     The whole design is divided into six different modules to facilitate pipelining. The LFSR module generates random numbers for next iteration after formation of probability matrix since the random numbers are no longer necessary. The whole design can be divided into a five-stage pipeline but it would increase the hardware resources needed. Hence, we have decided to pipeline only random number generation with the remaining modules.

5 Performance Analysis

In this chapter, we will evaluate our design by testing it on various data sets. Information about data sets used for evaluation and reason for choosing these data sets has been presented here. The design is tested on 12 different data sets. The results section presents the performance of design on all the data sets, on the basis of statistical parameters explained in Chapter 2. We also compared the design with its software counterpart, MEME in terms of computation time.

The performance of the design is evaluated based on two main aspects, quality and computation time. Quality of design is very important to ensure that the results obtained are valid. In order to test the quality of design, we have computed all the statistical parameters for the data sets. It is very important for a design to have less execution time. So, we have compared the running time of our design with the most popular software tool, MEME.

5.1 Experimental Setup

     The experimental setup for our design consists of various modules, namely Random number generation module, Matrix formation module, Probability matrix module, Expectation module, Maximization module, Comparison module and RAM modules. All the modules are connected as shown in Figure 5.1

     The data set is stored in a RAM named ‘sequences'. The data set supplied consists of A, C, T and G. To convert the data set to binary format, we have developed a perl script. The Random number generation module is used to generate random numbers. Its uses a LFSR to generate random numbers. The random numbers generated are used as starting positions of motif. The convergence of EM algorithm highly depends on having good starting positions of motif. So, generation of random numbers is a very important step in this design. After random numbers are generated, the sequences are divided into two different matrices. All the predicted motifs are stored in motif matrix and background sequences are stored in non-motif matrix.

     The probability matrix module generates probability matrix which contains the probability of each nucleotide at each position in motif. The probability matrix produced is used to generate ‘z' values in the expectation module. The ‘z' values computed in expectation module are used to calculate the probability values again in maximization module. The initial values of probability matrix are compared with the new probability values, and if the difference is less than 0.001, the algorithm stops. This comparison is performed in comparison module. If the difference is greater than 0.001, the algorithms iterates. The maximum number of iterations is set to 50.

     All the modules have been implemented in Verilog HDL. The simulator used was ModelSim PE 6.5. The simulations were carried on a Cygwin platform running on Windows workstation. To execute the perl script, the inbuilt simulator in Cygwin has been used.

5.2 Functional Verification

     The Verilog HDL modeling for the design, serves as a way to design the logic which can later be synthesized. The advantage offered by Verilog HDL behavioral description is utilized here to emulate the algorithm description, without actually worrying about the hardware implementation, thus saving time and money. The functional verification of the design consists of two main steps. First, the functionality of all the modules was verified separately. Test-benches have been developed to verify the design. After all the modules were verified individually, the whole architecture was verified. To verify the whole architecture, a test-bench was developed to verify the basic functionality. Later, the design was tested on 12 different data sets. The results obtained are presented in later sections. The simulations were repeated on each data set for 10 times. Different seeds were used for each run.

5.3 Data sets used for performance analysis

     To evaluate our design, 12 data sets are selected from data sets created by Tompa et al. [8]. These data sets have been prepared exclusively by the authors for assessing the performance of motif matching computational tools. The data sets created were unbiased to any specific algorithm and hence it provides a uniform ground for comparison. Moreover, the data sets are very challenging and prediction of potential transcription factors is very complex on them. The data sets were prepared from TRANSFAC database [8]. 52 data sets were created by Tompa et al., of which 6 of the data sets are from fly, 26 from human, 12 from mouse and 8 from yeast. The data sets are of three types: Real, Generic and Markov. Real datasets had real promoter sequences, generic datasets were randomly chosen from same genome, and Markov datasets are generated using a Markov chain of order 3 [8]. Figure 5.2 shows the distribution of data sets based on the number of sequences in each data set.

Figure 5.2 Distribution of data sets

     As shown in the graph above, the data sets contain different number of sequences. The number of sequences in each data set varies from 1 to 35. Good data sets should have different number of sequences and varying sequence lengths, to cover all the corner cases. Figure 5.3 shows the distribution of sequences lengths for each data set. All the sequences in a data set are of equal length. The figure below shows that the length of the sequences is distributed from 50 bp to 3000 bp. This ensures that the algorithms are tested on varied sequence lengths. Few algorithms perform better on smaller sequence lengths but do not perform well with large sequence lengths. Preparing a data set with varied sequence lengths helps in evaluating an algorithm for these cases.

Figure 5.3 Distribution of sequence length in data sets

The datasets are in form of FASTA. FASTA is a text-based format of representing nucleotide sequences. Nucleotides are represented using letters in this format. As the nucleotides are represented in text-based format, we can easily use scripting languages like Perl, Python to parse the data and get the necessary information.

Out of 52 data sets available, 12 data sets have been picked for analysis of our design. Data sets of ‘Real' type were not used, as all the software tools did not perform well on them. Tompa et al. have suggested users not to use data sets of type ‘Real' for performance analysis [8]. Few data sets did not have any motifs. These data sets were also removed. The data sets which did not have motifs in at least 80% of the sequences were also removed. In the 12 data sets picked for our analysis, 6 data sets are of type ‘Generic' and other 6 are of type ‘Markov Chain'. Out of 6 data sets, 2 are from human, 2 are from yeast and 2 are from mouse. The data sets were chosen such that they are of varying sequence lengths and have different number of sequences. The table below shows the data sets, type of data sets, number of sequences and sequence lengths.

Table 5.1 Table showing data sets with their type, number of sequences, sequence lengths

5.4 Performance Analysis

The following factors are taken into consideration for performance analysis.

  • Predication accuracy and Reliability
  • Speedup

A good design should be accurate, reliable and also produce quick results. To make sure if our design is accurate and reliable, we tested it on 12 data sets. The source and reason for choosing these data sets has been explained in the previous section. On each of the data sets, the design was simulated for 10 times using different seeds. A total of 120 experiments were conducted on the design. All the simulations were carried out using ModelSim PE 6.5b on Cygwin platform running on Windows OS. The statistical parameters explained in Chapter 2 are calculated for each of these 120 runs. The parameters calculated for performance analysis are:

  • Nucleotide level sensitivity (nSn)
  • Nucleotide level positive predictive value (nPPV)
  • Nucleotide level performance coefficient (nPC)
  • Nucleotide level correlation coefficient (nCC)
  • Binding site level sensitivity (sSn)
  • Binding site level positive predictive value (sPPV)
  • Binding site level average site performance (sASP)

After calculation of above stated parameters for each run, we also calculated mean, minimum value, maximum value, standard deviation and 95% confidence interval. The results have been tabulated in Tables 5.2 - 5.9.

     The main aim of this work is to reduce the computation time of EM algorithm. We have compared the computation time of our design with MEME tool. MEME tool is a popularly used tool for motif matching which uses the basic principle of EM algorithm. We have summarized the speedup achieved by our current design compared to MEME in later sections.

5.4.1 Prediction accuracy and reliability

     In this section, the results of all the seven statistics are summarized. We have calculated statistics for each of the runs, and then calculated the parameter's minimum value, maximum value, mean, standard deviation and 95% confidence interval for each data set. For calculation of these parameters, perl script has been developed. The perl script uses perl modules Statistics::Point Estimation, Statistics::Descriptive for calculation of mean, standard deviation and 95% confidence interval.

Nucleotide level sensitivity (nSn)

The nucleotide level sensitivity is calculated for all data sets, and each data set is run for 10 iterations. We have calculated minimum value, maximum value, mean, standard deviation and 95% confidence interval for nSn. The values are presented below in Table 5.2.

Table 5.2 Results of nucleotide level sensitivity on the data sets

     From the table, it can be inferred that nucleotide level sensitivity is approximately between 0.10 and 0.13. These values of nSn are similar to the values obtained by Tompa et al [8] on analysis of motif matching tools. The nucleotide level sensitivity is low as we have little knowledge on the underlying biology of regulatory mechanisms, and it is very tough to develop a universal tool with high nucleotide level sensitivity.

Nucleotide level positive predictive value (nPPV)

The nucleotide level positive predictive value is calculated for all data sets, and each data set is run for 10 iterations. We have calculated minimum value, maximum value, mean, standard deviation and 95% confidence interval for nPPV. The values are presented below in Table 5.3.

Table 5.3 Results of nucleotide level positive predictive value on the data sets

     The values of nPPV range from 0.18 to 0.25. The maximum value of nPPV for the tools tested by Tompa et al is 0.30 [8]. The average value of nPPV is close to 0.11. This clearly shows that the prediction of positive nucleotides by our design is comparable to the software tools available.

Nucleotide level performance coefficient (nPC)

The nucleotide level performance coefficient is calculated for all data sets, and each data set is run for 10 iterations. We have calculated minimum value, maximum value, mean, standard deviation and 95% confidence interval for nPC. The values are presented below in Table 5.4.

Table 5.4 Results of nucleotide level performance coefficient on the data sets

     From Table 5.4, it can be inferred that the values of nPC vary from 0.04 to 0.07. The average value of nPC for software tools is 0.05 [8]. Hence, our design can be considered similar to other tools for predictions at nucleotide level.

Nucleotide level correlation coefficient (nCC)

The nucleotide level correlation coefficient is calculated for all data sets, and each data set is run for 10 iterations. We have calculated minimum value, maximum value, mean, standard deviation and 95% confidence interval for nCC. The values are presented below in Table 5.5.

Table 5.5 Results of nucleotide level correlation coefficient value on the data sets

     We can observe from Table 5.5 that nCC fall in the range of 0.07 to 0.11. The value of nCC can fall in the range of -1 to +1. Here, -1 indicates perfect anti-correlation and +1 indicates perfect correlation. As our values are just above 0, there is very little correlation. The same performance was observed for other tools by Tompa et al [8].

Binding site level sensitivity (sSn)

The binding site level sensitivity value is calculated for all data sets, and each data set is run for 10 iterations. We have calculated minimum value, maximum value, mean, standard deviation and 95% confidence interval for sSn. The values are presented below in Table 5.6.

     Table 5.6 Results of binding site level sensitivity on the data sets

     sSn helps in estimating the fraction of known binding sites that are predicted. From Table 5.6, it can be observed that site sensitivity varies from 0.08 to 0.13. The low value of sSn is because of the incomplete knowledge on regulatory mechanisms, and complexity of DNA binding sites.

Binding site level positive predictive value (sPPV)

The binding site level positive predictive value is calculated for all data sets. We have calculated minimum value, maximum value, mean, standard deviation and 95% confidence interval for sPPV. The values are presented below in Table 5.7.

Table 5.7 Results of binding site level positive predictive value on the data sets

     From Table 5.7, we can observe that the sPPV varies from 0.14 to 0.25. sPPV provides the fraction of predicted sites that are known. The values of sPPV are better than sSn observed in the previous table. The average value of sPPV for the tools assessed by Tompa et al is approximately 0.16 [8]. So, our design is equivalent to other tools in terms of sPPV.

Binding site level average site performance (sASP)

The binding site level average site performance was calculated for all data sets. Average site performance is the average of site level sensitivity (sSn) and site level positive predictive value (sPPV). To calculate minimum value of sASP, the average of minimum sSn and minimum sPPV was computed. Similarly, the maximum values of sASP were calculated. The values are presented below in Table 5.8.

Table 5.8 Results of binding site level average site performance on the data sets

     From Table 5.8, we can observe that the values of sASP are in the range of 0.13 to 0.16.

The minimum and maximum values of all the statistical parameters are represented in graph shown below in Figure 5.5

Figure 5.5 Graph showing minimum and maximum values of statistical parameters

After calculation of statistical parameters, it can be observed that the performance of our design is comparable to the performance of other tools in terms of prediction accuracy and reliability. The performance of the design is similar on all the data sets. This shows that the design is not very sensitive to variation in sequence lengths and number of sequences in data set. The values of statistical parameters are not very high due the complexity of biological systems. Moreover, the knowledge on transcription factors is still incomplete. We rely on the information of true binding sites from the data available in TRANSFAC database. Tompa et al [8] have stated that any database is prone to have errors and hence cannot be considered as a standard for evaluation of motif matching tools. The average length of the binding sites stated in TRANSFAC database is long, while the real binding sites are normally short.

5.4.2 Speedup (need to add results and speedup)

     Speedup, with reference to our analysis, can be defined as the ratio of execution time of software tool, MEME to the execution time of our design. To test the execution time on hardware, the Verilog RTL code was synthesized and later ported on to Cyclone and Stratix II family. Altera Quartus II was used for synthesis, and for porting the design onto these devices. Quartus II is a software tool provided by Altera for analysis and synthesis of HDL designs. It enables the developer to compile their designs, perform timing analysis and examine RTL diagrams. For our design, we uploaded our code to Quartus II software and obtained the RTL view of the design. The summary of our synthesis is shown below (table below):

     The hierarchy tab below displays the top-level design entity of the current project. After analysis and elaboration, it shows a top-level design entity and the names of any sub-entities of the top-level entity. The Logic Cells (LCs), look up tables (LUT-Only LCs), Register-Only LCs, LUT/Register LCs, and Carry Chain LCs columns list the logic cells used by the design entity (including the design entity) and the number of logic cells (in parenthesis) instantiated by the design entity at that level in the hierarchy.

Because of a limited number of pins on FPGA and restriction on number of devices, we cannot port huge designs on it.

Explain about frequency of operation and insert speedup table

6 Conclusions and Future Work

6.1 Conclusions

     The thesis presents the hardware design of Expectation Maximization algorithm for motif finding. The entire design has been divided into sub-modules, and all the modules are developed using Verilog HDL. As the entire design has been implemented in Verilog HDL, it can be used with any of the HDL simulators, synthesis tools available in market. An architecture containing all the modules has also been presented.

     Each module was initially verified for proper functionality by subjecting it through a number of test cases. Later, the entire design was functionally verified. After functional verification, performance analysis was carried out on the design by testing it on 12 data sets. The data sets used for performance analysis are unique, and have varying sequence lengths and different number of sequences. The performance of the design is measured in terms of its reliability, accuracy and speedup. The design was observed to perform well on all the data sets. The results obtained are almost equal to the results of software tools in terms of accuracy and reliability. Our design was observed to perform on an average x times faster than the software tools.

6.2 Future work

     This work only deals with motif matching in DNA sequences. It can be extended in the future for motif matching in protein sequences. The current algorithms for motif matching in protein sequences are very computationally intensive and have high execution times. An efficient hardware design for this problem would be very useful.

     The design is currently implemented on FPGA. During synthesis and placement of the design, the default set of optimizations have been selected in Quartus II. The computation time could be further reduced by using proper logic optimizations. If critical paths are optimized using optimization techniques like factorization and equation flattening, the design would operate at a higher frequency.

     In this work, we have used Expectation Maximization algorithm for motif matching. Based on the analysis carried out by many researchers, it has been observed that efficient motif matching can be achieved by combining two or more algorithms. This work can be extended to use other algorithms along with Expectation Maximization to achieve better results.