Identification Of Transcription Factor Binding Sites Biology Essay

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

The quest for determining what constitutes and causes a given gene's particular expression pattern has long been a matter of research. Many expression differences between genes can be explanied by the existence of cis-acting regulatory elements that modulate the transcription of a gene, through binding of a protein. In addition to a host of other mechanisms that can regulate expression, so a major challenge in computational biology is to understand the mechanisms of all the functional elements in the genome expression, including genes and the regulatory sequences. Methods for prediction of genes from sequence data are now quite well developed. However, the equally important problem of identifying the regulatory elements of DNA remains a challenge (Stone J. R. and Wray G.A., 2001). Since regulatory elements are frequently short and variable, so their identification and discovery using in silico algorithm is very tedious. Although the problem of prediction of regulatory sites had been addressed for over 25 years, it is still far from being solved. One reason behind that the learning sample rarely contains more than 45-55 sites. However, even for large samples, it proved to be extremely difficult to develop a good recognition rule. The Chemistery of protein-DNA interaction is poorly understood, making it virtually impossible to derive a proper set of features for statistical or pattern recognition algorithms, however, significant advances have been made in the in silico methods for modeling and detection of DNA regulatory elements in bacterial genomes. The availability of >250 completely sequenced bacterial genomes and high-throughput experimental methods for mapping protein-binding sites in DNA, have contributed to the development of methods for detection of transcriptional regulatory elements. Nevertheless, in some cases, simple profile methods perform well, in the sense that they can correctly identify true sites if the number of alternatives is not too large.

In this study, we start with comparision of in silico methods for identification of transcription factor binding sites expected to be useful for deciphering genetic regulatory networks of organism. However, despite the availability of a large number of web tools, their strengths and weaknesses are not sufficiently understood. Here, (Chapter 3) we designed a comprehensive set of performance measures and benchmarked sequence-based motif discovery tools using large datasets generated from Escherichia coli genome, retrieved from RegulonDB database. Key factors that affect the prediction accuracy were studied in details. It was revealed that the nucleotide and the binding site level accuracy were very low, while the activator binding site level accuracy was relatively high, which indicates that tools can usually capture at least one correct DNA motif in an input sequence. Our study illustrates benefits and limitations of existing motif discovery web tools.

In the continuation of this study, Chapter 4 describes evalution of prediction performance of various DNA motif discovery tools based on weight matrices as input query data to scan whole genome for conserved motifs. For this, we retrieved known PurR transcription factor regulated promoters of E. coli for benchmarking study and derived different weight matrices through publicly available web tools such as RegulonDB, AlignACE, MEME, D-Matrix and Consensus (RSAT). Here PurR weight matrix of RegulonDB was used as reference matrix. The prediction performance of each matrix was evaluated statistically and latter validated through ROC analysis.

Lastely after evaluation of existing web tools, to overcome the limitation of studied tools, we developed a biological rule based tool for DNA motif discovery and weight matrix construction called as DNA-MATRIX, for searching potantial transcription factor binding sites in whole genome promoter sequences (Chapter 5). This tool can predict the different types of weight matrices based on user requirement by using simple statistical approach for weight matrix construction. Matrices can be converted into different file formats according to user requirement. It provides the possibility to identify the conserved motifs in the co-regulated set of genes or whole genome non-coding sequences.

All of these methods have been valuable in expanding our limited knowledge of regulatory elements in the genome.




Understanding the mechanisms governing basic biological processes requires the characterization of regulatory elements controlling gene expression at transcriptional level (Qiu P., 2003; Studholme et al., 2004). The constantly growing amount of genomic data, complemented by other sources of information such as expression data derived from microarray experiments and other approaches have opened new opportunities to researchers (Grifantini et al., 2003). It appears that as long as a gene coding for a transcription regulator is conserved in the compared bacterial genomes, the regulation of the respective group of genes (regulons) also tends to be maintained (Mironov et al., 1999). In silico molecular biology tools are becoming the method of choice for high throughput screening of newly determined DNA sequences. Such in silico methods indeed offer invaluable tools for the analysis of novel genomic sequences. Effective DNA sequence analysis demands not only the faithful identification of gene elements and boundaries, but it also requires reliable information on the potential function and regulation of the identified genes. Consequently, powerful programs are more and more relying on the coupling and integration of various prediction algorithms. Such integrated systems include different approaches for the recognition of DNA sequences that act as binding sites for regulatory proteins called as 'Transcription Factors' (TFs) e.g. TFs from the MerR family (Permina et al., 2006), zinc-finger proteins (Hyde-DeRuyscher et al., 1995; Liu et al., 2005), FNR proteins as global transcription regulators (Scott et al., 2003), at least 27 proteins belong to the XylS/AraC family of prokaryote transcriptional regulators (Gallegos et al., 1993), Helix-Turn-Helix and LysR-type transcriptional regulators (Zaim J. et al., 2003) etc. LysR-type transcriptional regulators comprise the largest family of prokaryotic TFs. These LysR-type proteins are composed of an N-terminal DNA binding domain (DBD) (Nikolskaya A. N. et al., 2002; Perez-Rueda et al., 1998) and a C-terminal cofactor binding domain. To date, no structure of the DBD has been solved. According to the SUPERFAMILY and MODBASE databases, a reliable homology model was build using the structure of the E. coli ModE TF, containing a winged helix- turn-helix (HTH) motif, as a template (Zaim J. et al., 2003). The 26046 bacterial regulatory proteins included in 'ExtraTrain' database belong to the families AraC/XylS, ArsR, AsnC, Cold shock domain, CRP-FNR, DeoR, GntR, IclR, LacI, LuxR, LysR, MarR, MerR, NtrC/Fis, OmpR and TetR (Pareja et al., 2006). Recently a special type of regulator namely, RcsB first interacts with other coactivators e.g. RcsA to form heterodimer complex to control the expression of biosynthetic operons in enterobacteria through RcsAB box consensus (Pristovsek et al., 2003). The identification of such sites is not only relevant for locating the promoter as the 5' boundary of a gene, but they may also allow the prediction of a specific gene-expression pattern and responsiveness to known biological signaling pathways (Benitez-Bellon et al., 2002; Meng et al., 2006). However, DNA binding sites for TFs are typically short and degenerate in nature and their efficient prediction requires sophisticated in silico tools (Hu et al., 2005). Different databases of promoters and TFs have been established (Bucher et al., 1990; Ghosh et al., 1993; Lewis et al., 1994; Wingender et al., 1997) and these compiled data were in turn used for the development of algorithms and softwares for the identification of TFBS on genomic DNA (Frech et al., 1997a). Besides, different methods have been proposed for the identification of TFBS in the regulatory regions of the genes (Weidenhaupt et al., 1993; Schneider T.D., 1996), but still this is a very challenging problem from the computational as well as biological viewpoint. We reviewed here those in silico methods which are commonly used for the identification/discovery of bacterial regulatory elements.


2.2.1. Consensus based method

(Enumerative algorithms)

The consensus for a set of TFBS can be seen as a 'perfect' form recognized by a TF. Thus, the idea is to consider all the oligos that differ from a given consensus in no more than e positions as belonging to the same group, i.e. to be binding sites for the same TF. The number of substitutions allowed should in turn depend on the length of the consensus. The algorithmic strategies for consensus based motifs are mainly based on the following steps. Suppose we know in advance the length m of the motif to be found and are given as input a set of regulatory sequences.

• Enumerate all the possible oligos of length m. Each one represents a candidate motif consensus. For each one, count how many times it appears in the sequences (and/or in how many sequences it appears) with no more than e substitutions.

• Save all the motifs that appear in all (or most of) the sequences of the set.

• Rank the motifs found according to some statistical measure and report the highest ranking motifs.

Quite naturally, if the length m is not known in advance, different values have to be tried. This is essentially the first approach introduced to the problem, starting from the mid-1980s (Galas et al., 1985; Sadler et al., 1983; Waterman et al., 1984).

However, methods of this kind have been considered for a long time to be 'too slow'. This bad reputation derives mainly from the fact that, given length m, there are 4m candidate consensuses to evaluate, with an exponential growth of the execution time on the motif length. On the other hand, when working on TFBS the length is never too large (it seldom exceeds 12 or 14 nts); the exhaustive search can be significantly accelerated by organizing the input sequences in a suitable indexing structure, such as the suffix tree, (Apostolico et al., 2000; Marsan L. et al., 2000; Pavesi et al., 2001; Gusfield D. et al., 1997) that yields an execution time exponential in the number of substitutions allowed only (that in turn seldom exceeds four or five); the initial set of candidates of exponential size can be downsized in different ways. All these considerations have led to a rediscovery of this kind of approach in recent years, both in genome-wide scans and in set-specific algorithms.

Clearly, if only exact oligos are considered, that is, no substitutions are allowed in the instances of the same motif, the problem becomes much simpler and its complexity is just linear in the length of the input. Given it's in silico efficiency this strategy can be employed in genome-wide analyses of over-represented motifs (Van Helden J. et al., 2003).

2.2.2. Alignment based method

As briefly mentioned before, consensus based methods have been considered, for quite a long time, unsuitable for the problem. This fact, together with the opinion that consensuses were not flexible enough to describe motifs (Frech et al., 1997a, 1997b; Stormo G. D. et al., 2000; Berg O. G. et al., 1988) has led to the introduction of a completely different approach. The idea is to build solutions by picking some oligos from the sequences and aligning them in the corresponding profile. Alignments usually do not allow gaps, that is, the oligos must be of the same size. The motifs reported will be those described by the best (highest-scoring) alignments and the oligos building (or better fitting) each alignment will be considered possible binding sites for the same TF. In this way, the number of parameters needed is reduced mainly to just the motif length, with no need to specify the degree of approximation allowed or a quorum value.

On the other hand, given k input sequences of length n, there are about nk possible combinations of oligos to be evaluated, regardless of the motif length chosen. From a theoretical point of view, it has been proven that finding the best profile is a NP-hard problem (Akutsu et al., 2000): in practice this means that, whatever the score used, evaluating all the possible profiles is not in silico feasible.

Thus, methods that look for the best alignment have to rely on some heuristic, that is, some way to prune the search space, avoiding the enumeration of all the possible oligo combinations and building only those alignments that, according to some principle (the heuristic), seem to be more likely to be the best ones. While in this way a significant amount of time can be saved, the obvious downside is that the solutions reported cannot be guaranteed to be optimal, but are just the best ones among those considered by the algorithm. In the following we will introduce the algorithms, assuming that they look for exactly one site instance per input sequence. All of them, however, can be run also in the so-called 'zoops' mode, meaning that each sequence can contain either zero or one motif instance, or in 'zero, one or more than one' mode.

In the studied research work we have used five different types of sequence based programs for the TFBS identification problem in different groups of bacteria. Their prediction performances were also studied (Chapter 4). The principles and algorithms of the programs are explained as follows: CONSENSUS program at RSAT webserver

(Heuristic algorithms)

Even if the name might be a little deceptive, Consensus is an alignment based method that employs a greedy heuristic (Hertz et al., 1990; Hertz G. Z. et al., 1999). Given as input a set of sequences S1 . . . Sk the basic version of the algorithm requires as input the length m of the motif to be found and assumes that the latter occurs once in each sequence. The steps performed by the first version of the algorithm can be summed up as follows:

• All the length m oligos of S1 are compared with the oligos of length m of S2. Each comparison produces a 4 x m profile M. Each profile is scored according to its IC and the highest scoring matrices are saved.

• Each oligo of length m of sequence S3 is aligned with the matrices saved at the first step, generating a new set of three-sequence profiles; each one is scored as in the previous step and the highest scoring ones are saved.

• The second step is repeated for each sequence of the set; the final profiles, output by the program, will contain one oligo for each input sequence.

The algorithm is greedy, that is, at each step saves the best partial alignments only, hoping that they will eventually lead to the optimal one. Obviously, the more conserved the motif is, the more likely is the algorithm to find it. Otherwise, the risk is to store in the first steps matrices corresponding to random (but similar enough) oligos and to discard the one that would have led to the highest-scoring solution. Further improvements are presented in the WConsensus algorithm (Hertz G. Z. et al., 1999). They include the possibility of finding motifs that do not occur or appear more than once in each sequence and avoid explicitly requiring the length parameter from the user. Moreover, profiles are built by comparing directly all pairs of sequences and hence the problem of the result depending on the order of sequences is avoided. Also, the calculation of a p-value for an alignment is introduced. The p-value gives an estimate of the probability of finding a profile with the same IC score by chance, which is especially useful in comparing alignments with different lengths and different numbers of sites, cases where comparisons based on IC alone are not sufficient. MEME program

(Multiple Expectation Maximization for Motif Elicitation)

(Expectation Maximization algorithm)

Another way of looking at the problem of finding the best alignment profile is to 'guess' the position in the input sequences of the regions forming it. Given a profile M, the MEME (Multiple Expectation Maximization for Motif Elicitation) algorithm (Bailey T. L. et al., 1994) evaluates the likelihood of each sequence region of a length m to fit the profile with respect to the background of the sequences, while the rest of the sequence should fit the background better than the profile. According to this principle, a likelihood value zi,j (normalized such that the sum over all the zi,j values of sequence j equals 1) is computed for each position i of each input sequence j. This is the E (Expectation) step. Then, the algorithm builds a new alignment profile by putting together all the sequence regions of length m, but weighting each one with the corresponding zi,j value. This is the M (Maximization) step. The algorithm starts by building a different profile from each m-mer in the input sequences, using a frequency value of ½ for the nucleotides of the oligo and 1/6 for the others. Then, for each profile (each m-mer in the input) it performs a single E and a single M step.

The highest MAP scoring profile obtained (after the single iteration) is further optimized with additional EM steps, until no further increase on the score is obtained. Finally, the profile is reported and its oligos are removed from the input sequences. Then, the algorithm is restarted, until a number of profiles that can be specified as input have been generated. Thus, MEME can detect multiple motifs within the same set of sequences within a single run. GIBBS SAMPLERS program

(Gibbs sampling algorithm)

One of the most successful approaches to the problem, for the part concerning the heuristic used to find the highest-scoring profiles, has been the Gibbs sampling strategy, first introduced for motif discovery in protein sequences (Lawrence et al., 1993; Neuwald et al., 1995) but nevertheless perfectly suitable also for nucleotide sequences (and recently further fine-tuned to TFBS) (Qin et al., 2003). The best measure of its success is perhaps the number of times it has been used in the algorithmic part of different methods, which varied the statistical measures used to generate and evaluate the results. The main motivation was to improve a EM local search strategy (Lawrence C. E. et al.,1990), (similar to the one employed by MEME), so to avoid the problem of premature convergence to local maxima of the IC and MAP functions. The basic idea, designed for sequence sets with exactly one site instance per sequence, can be summarized as follows:

• An oligo of length m is chosen at random in each of the k input sequences (at the beginning, with uniform probability).

• One of the k sequences is chosen at random: let S be this sequence.

• A 4 x m profile M is built with the oligos that had been selected in the other k-1 sequences.

• For each position i in S, let pi = the m-mer of S starting at position i. For each pi a likelihood value L(pi) is computed, representing how well pi fits the model induced by the matrix M with respect to the background nucleotide distribution.

• A new probability value, proportional to L(pi), is assigned to each position i of S. Thus, the oligos that fit well in the alignment described by M are more likely to be chosen at the next cycle.

• Go to the first step: now the probability with which the m-mers of sequence S can be picked are those computed at the previous step.

These steps are iterated a number of times or until convergence is reached. This variant of the algorithm is also known as the site sampler. The main difference with MEME is in the first step: while the local search always picks oligos deterministically according to how much they fit a profile, the Gibbs sampler chooses the fragment that has to be added to the profile in a stochastic way. At the beginning all the oligos have the same probability of being chosen; in successive iterations, those that better fit the profile are more likely (but not certain) to be selected. The algorithm is thus less likely to get stuck in local optima; on the other hand, given its probabilistic nature, it has often to be run different times and the final results can be obtained by comparing the outputs of each run. Additions to the basic algorithm were presented successively (Neuwald et al., 1995), allowing multiple occurrences of a motif within the same sequence or conversely, the motif did not have to occur in every sequence (algorithm known as motif sampler). This variant however, needs an estimate of the overall number of times a motif is expected to appear in the input sequences. Modifications of the basic Gibbs sampling technique especially devised for DNA sequences are described in (Hughes et al. 2000) and . (Workman C. T. et al., 2000). AlignACE (Hughes et al., 2000) is a program where the basic Gibbs sampling algorithm is fine-tuned in order to work on DNA regulatory sequences, including for example both strands of each input sequence and introducing a different sampling technique that also considers similarity in the position relative to the TSS of each of the oligos of a group. That is, a functional motif should correspond to similar regions appearing at similar distance from the TSS. In the ANN-Spec algorithm (Workman C. T. et al., 2000), a Gibbs sampling method is combined with an artificial neural network that replaces the frequency matrix. Instead of aligning the oligos selected and scoring the matrix, the algorithm trains a neural network in order to recognize the oligos selected against the rest of the sequences. POSSUMSEARCH program (Non-heuristic algorithm)

To efficiently find matches of PSSMs in large databases, a new non-heuristic algorithm, called ESAsearch was developed (Beckstette et al., 2006). POSSUMSEARCH includes fast index based algorithms and software for matching position specific scoring matrices. The approach preprocesses the search space, e.g., a complete genome or a set of protein sequences and builds an enhanced suffix array that is stored on file. This allows the searching of a database with a PSSM in sublinear expected time. Since ESAsearch benefits from small alphabets, author present a variant operating on sequences recoded according to a reduced alphabet. Authors address the problem of non-comparable PSSM-scores by developing a method which allows the efficient computation of a matrix similarity threshold for a PSSM, given an E-value or a p-value. Its method is based on dynamic programming and in contrast to other methods; it employs lazy evaluation of the dynamic programming matrix. Authors evaluated algorithm ESAsearch with nucleotide PSSMs. Compared to the best previous methods, ESAsearch showed speedups of a factor between 17 and 275 for nucleotide PSSMs. Comparisons with the most widely used programs even show speedups by a factor of at least 3.8. The lazy evaluation method is also much faster than previous methods, with speedups of a factor between 3 and 330.

Analysis of ESAsearch reveals sublinear runtime in the expected case, and linear runtime in the worst case for sequences not shorter than where m is the length of the PSSM and 'A' a finite alphabet. In practice, ESAsearch showed superior performance over the most widely used programs, especially for DNA sequences. The new algorithm for accurate on-the-fly calculations of thresholds has the potential to replace formerly used approximation approaches.

2.2.3. Improvements in Alignment based methods

Historically speaking, the algorithms we have described were the first alignment based methods to be introduced and as we have seen they are still widely used today with good results. In any case, their heuristic (in how solutions are generated) and probabilistic (in how solutions are evaluated) nature lends itself to different improvements. Current directions of research are mainly the following:

Improving the heuristics

The algorithms happen to miss a motif altogether because they do not include the optimal matrix (corresponding to the motif) among the candidate solutions. The reason could also lie in the choice of initial profiles that are optimized.

Improving the scoring function

Heuristics work just fine, but the algorithms happen to miss a motif because if we use the traditional IC and MAP scores, the corresponding frequency matrix is not the highest scoring one or alternatively, the real motif is 'lost' among many random motifs with higher or similar scores and further work is needed to discriminate it.

Looking for a single motif is not enough

A motif is made of two or more TFBS, located within short distance from each other, whose biological function (and statistical relevance) is the effect of their simultaneous appearance within a promoter and their relative distance (Chiang et al., 2003). Moreover, each of the cooperating TFBS is not overrepresented enough to be detected by itself. These have also been called composite or structured or (in case of pairs) dyad motifs.

Beside this, a number of web-resources are developed especially for the identification of prokaryotic TFBS; these are well summarized in Table 2.1. Besides, recently some developments were reported based on statistical methods and machine learning algorithms in TFBS identification; these are summarized in Table 2.2.

Table 2.1. List of web-resources available on the web for the identification of bacterial TFBS.





Bioinformatics Links Directory, Bioinformatics Web Server & Databases list

a public online resource that lists the servers published in Nucleic Acids Research journal 2010.

2010 Joanne et al. (2010)


prokaryotic genome sequences were screened for SmtB/ArsR DNA binding sites

2006 Bose et al. (2006)

MEME' (Multiple EM for Motif Elicitation)

for searching novel 'signals' in sets of biological sequences. Applications include the discovery of new TFBS and protein domains

2006 Bailey et al. (2006)


a new algorithm for identifying cis-regulatory modules in genomic sequences

2006 Carvalho et al. (2006)


searches for locally overrepresented TFBS in a set of coregulated genes via PWM

2006 Defrance M. and Touzet H. (2006)

Galaxy, the UCSC Table Browser and GALA;

blastZ and multiZ

described use of publicly available servers to find genomic sequences whose alignments show properties associated with cis-regulatory modules, such as high conservation score, high regulatory potential score and conserved TFBS and

2006 Elnitski et al. (2006)

HTPSELEX database

Information about the protein material used, details of the wet lab protocol, an archive of sequencing trace files, assembled clone sequences (concatemers) and complete sets of in vitro selected protein-binding tags. It also offers reasonably large SELEX libraries obtained with conventional low-throughput protocols and

2006 Jagannathan et al. (2006)

Onto-Tools suite

an annotation database and eight complementary, web-accessible data mining tools

2006 Khatri et al. (2006)

Open Regulatory Annotation (ORegAnno) database

a dynamic collection of literature-curated regulatory regions, TFBS and regulatory mutations (polymorphisms and haplotypes)

2006 Montgomery et al. (2006)


a database which covers all extragenic regions of available genomes and regulatory proteins from bacteria and archaea included in the UniProt database

2006 Pareja et al. (2006)


a novel gapped-alignment algorithm to compare Position Frequency Matrices (PFM) for TFBS

2006 Su et al. (2006)

A database containing information on regulation of tunicate genes collected from literature.

It includes information of about 184 promoters, 73 identified binding sites and >2000 newly predicted binding sites

2006 Sierro et al. (2006)



TFBS identification method utilizes several data sources, including DNA sequences, phylogenetic information, microarray data and ChIP-chip data.

2006 Tsai et al. (2006)



batch extraction and analysis of cis-regulatory regions facilitates identification, extraction and analysis of regulatory regions from the large amount of data

2006 Vega V. B. (2006)



for the comparison of discovered motifs from different programs e.g. MEME, BioProspector and BioOptimizer

2006 Wei Z. And Jensen S. T. (2006)



TFs predicted binding sites in prokaryotic genomes,

2005 Gonzalez et al. (2005)


EcoCyc database

a comprehensive source of information on the biology of the prototypical model organism E. coli K12

2005 Keseler et al. (2005)


PRODORIC+ package

a new online framework for the accurate and integrative prediction of TFBS in prokaryotes

2005 Munch et al. (2005)



described a PFM similarity quantification method based on product multinomial distributions

2005 Schones et al. (2005)



multiple TFs coordinately control transcriptional regulation of genes in eukaryotes

2004 Keles et al. (2004)


DBTBS database

originally released in 1999 as a reference database of published transcriptional regulation events in B. subtilis. Update contains information on 114 TFs, including sigma factors and 633 promoters of 525 genes. The number of references cited in the database has increased from 291 to 378.

2004 Makita et al. (2004)


MDscan tool with Motif Regressor

integrated the MDscan tool with Motif Regressor, which performs a linear regression of microarray expression values based, instead of single oligos, on the motif profiles reported by MDscan

2003 Conlon et al. (2003)



a software system for identification, visualization and analysis of protein binding sites in complete genome sequences

2003 Gadiraju et al. (2003)



proposed a new likelihood based method for identifying structural motifs in DNA sequences

2003 Keles et al. (2003)



database aims to systematically organize information on prokaryotic gene expression and to integrate this information into regulatory networks

2003 Munch et al. (2003)



a Gibbs sampling-based Bayesian motif clustering (BMC) algorithm to address TFBS identification problem

2003 Qin et al. (2003)



a method was developed to identify over-represented cis-elements with PWM-based similarity scores

2003 Zheng et al. (2003)


TFBS software

includes set of integrated, object-oriented Perl modules for TFBS detection and analysis

2002 Lenhard B. and Wasserman W. W. (2002)


Co-Bind (for COperative BINDing)

an algorithm for discovering DNA target sites for cooperatively acting TFs. The method utilizes a Gibbs sampling strategy to model the cooperativity between two TFs and defines position weight matrices for the binding sites.

2001 Guha Thakurta D. and Stormo G. D. (2001)



developed a complex approach to recognize TFBS based on four methods: (i) weight matrix, (ii) information content, (iii) multidimensional alignment and (iv) pairwise alignment with the most similar representative of known sites.

2001Pozdniakov et al. (2001)


ooTFD (object-oriented TFs Database)

an object-oriented successor to TFD. This database is aimed at capturing information regarding the polypeptide interactions which comprise and define the properties of TFs. ooTFD contains information about TFBS, as well as composite relationships within TFs

2000 Ghosh D. (2000)



Developed a TFBS prediction tool which also reviewed a log-likelihood scoring scheme called information content. Programs were developed under the UNIX operating system

1999 Hertz G.Z. and Stormo G.D. (1999)



a tool to search a database of annotated sequences for TFBS located in context with other important transcription regulatory signals and regions, like the TATA element, the promoter etc.

1999 Lavorgna et al. (1999)


ACTIVITY database

commonly accepted statistical mechanical theory is now multiply confirmed by using the weight matrix methods successfully recognizing DNA sites binding regulatory proteins in prokaryotes. vity/

1999 Ponomarenko et al. (1999)


TRANSFAC, TRRD (Transcription Regulatory Region Database) and COMPEL databases

databases related to transcriptional regulation in eukaryotes or

1998 Heinemeyer et al. (1998)



TRANSFAC is a database on TFs and their DNA binding sites. TRRD (Transcription Regulatory Region Database) collects information about complete regulatory regions, their regulation properties and architecture. COMPEL comprises specific information on composite regulatory elements. or

1997 Wingender et al. (1997)

Table 2.2. Details of research work done in the area of computational TFBS discovery using statistical & machine learning algorithms.