Prediction of enzyme functions

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

Abstract. The purpose of this research is to search for motifs directly at binding and catalytic sites called reactive motifs, and then to predict enzyme functions from the discovered reactive motifs. The main challenge is that the data of binding, or catalytic sites is only available in the range 3.34% of all enzymes, and many of each data provides only one sequence record. The other challenge is the complexity of motif combinations to predict enzyme functions.

In this paper, to search for reactive motifs, we propose a new model, which combines the statistical method with bio-chemistry background in the similar way of expert working process. We develop a procedure called block scan filter to alter the 1-sequence record of binding or catalytic site, using similarity score, to produce a block. These blocks are input to mutation control, where in each position of the sequences, amino acids are analyzed an extended to cover other possible related amino acid, resulting to a motif. Binding or catalytic sites in the same specific subgroup, having same mechanism, will result to motifs with same mechanism representation. Therefore the motifs are grouped, resulting to reactive motifs. The reactive motifs together with known enzyme sequence dataset are input to C4.5 learning algorithm, to obtain an enzyme prediction model. The accuracy of this model is checked against testing dataset. At 235 enzyme function class, the reactive motifs yield the best prediction result with C4.5 at 72.58%, better than PROSITE motifs.

Keywords: mutation control, sequence motif, reactive motif, enzyme function prediction, binding site, catalytic site


An enzyme function or enzyme reaction mechanism is the combination of two main sub-functions: binding, and catalyzing. The parts in an enzyme sequence are called binding sites, and catalytic sites. A site is a short amino acid sequence. To perform one type of binding or catalyzing may be achieved by each of several short amino acid sequences. These sequences can be represented in one pattern (motif).

One of the most well-known collections of motif sequences is PROSITE [1]. PROSITE contains only 152 motifs of binding and catalytic sites, covering 396 out of 3,845 classes of enzyme functions. Therefore the insufficient of data is one of main challenges. In addition, one of the motifs can be a part of 46 enzyme functions, while 139 enzyme functions can have more than one of the motifs. These create complexity. Therefore, many methods [2,3,4,5] avoid the direct usage of motifs generated from binding and catalytic sites to predict enzyme functions. Those methods use other resources and need data in the form of blocks [6] or multiple sequence alignment [7], which contain very few sequences of binding and catalytic sites.

In this paper, we choose to develop the method to predict enzyme functions based on the direct usage of these binding and catalytic sites. Principal motivation is that information of enzyme reaction mechanism is very important for applied science, especially bioinformatics. We introduce a unique process to determine reactive motifs using block scan filter, mutation control, and reactive site-group define. The main step, mutation control, is a method based on motif pattern of PROSITE, which involve amino acid substitution, insertion-deletion, and conserved region, to generate amino-acid substitution group. For example, with the PROSITE motif [RK]-x(2,3)-[DE]-x(2,3)-Y, mutation in position 1 is a substitution [8,9] of amino acids R or K, while maintaining same function. The mutation in position 2 is a insertion-deletion (ins-dels/gap) of amino acids of 2 or 3 residues, and the last position is a conserved region Y necessary in most mutation sequences. In our work, only conserved region and substitution are used in the mutation control operation.

The amino acid substitution has been described in 2 paradigms; expert-based and statistic-based motifs. In the case of expert-based motifs such as PROSITE, the substitution is manually resulted from expert knowledge and bio-chemistry background. The main principle is that the different enzymes with the same reaction mechanism on binding sites and catalytic sites perform the same enzyme function [8]. Due to the need of expertise, motifs discovered by experts are in slow progress. In the case of statistic-based motifs such as EMOTIF [3], motifs are discovered using statistical methods. Therefore the fast predictions of enzyme functions can be achieved. Almost of the statistic-based motifs are not discovered directly from the binding sites or catalytic sites; but from the surrounding sites. Statistic-based motifs yield high enzyme function prediction accuracy. However, in certain applications, it is necessary to understand how motifs of these sites are combined to perform enzyme function. This is the reason why the statistic-based motifs cannot replace the expert-based motifs completely.

In this paper, we propose a method for searching motifs directly at binding and catalytic sites called reactive Motifs. The proposed method combines statistics with bio-chemistry background in the similar way of expert working process. We develop a procedure called block scan filter to alter the 1-sequence record of binding or catalytic site to generate a block, which will be input to the mutation control step. As a result, 1-sequence record can produce one motif. The motifs generated from a set of input sequences are then grouped using the reactive site-group define procedure to produce reactive motifs. Those reactive motifs together with known enzyme sequence dataset are used as the input to C4.5 learning algorithm, to obtain an enzyme prediction model.

The following will be the details in reactive motifs discovery (phase I), and reactive motif-based prediction of enzyme class (phase II). The overall process is described in figure 1, and details are described in section 2 and 3. Experimental results, conclusion and discussion are given respectively in section 4 and section 5.

Reactive Motifs Discovery with Mutation Control (Phase I)

This phase consist of 4 steps; data preparation step, block scan filter step, mutation control step, and reactive site-group define step. The result is reactive motif representing each enzyme mechanism. These details are explained in the next.

Data Preparation

We use the protein sequences data from the SWISSPROT part [10] in the UNIPROT database release 9.2, and the enzyme function class from ENZYME NOMENCLATURE [11] in the ENZYME NOMENCLATURE database release 37.0 of Swiss Institute of Bioinformatics (SIB). The enzyme protein sequences are grouped to be used as a database, called Enzyme Sequence Dataset. Some of the enzyme proteins in the Enzyme Sequence Dataset provide the information of the amino acid position, which is a part of a binding or catalytic site. Setting the position as the center, a binding or catalytic sequence with the length of 15 amino acids, forming a binding or catalytic site, is retrieved from the enzyme protein sequence (See Figure 2). These binding and catalytic sites are grouped and used as another database called Binding and Catalytic Site Database.

In this Binding and Catalytic Site Database, the sites are divided to subgroups. In case the sites are binding sites, the sites are in the same subgroup when they have the same reaction descriptions, which are the same substrate, the same binding method (i.e. via amide nitrogen), and the same type of amino acid(s). In case the sites are catalytic sites, the sites are in the same subgroup when they have the same mechanism (i.e. the same proton acceptor), and the same type of amino acid(s). There are in total 291 subgroups in this Binding and Catalytic Site Database. The sites in each subgroup will be used to scan to all related enzyme protein sequences in the step block scan filter in order to get quality blocks.

In a function class, if only one type of binding or catalytic site is found, the function class is also neglected. The reason is the enzyme classes having only one motif cannot represent the complexity of the sub-function combination. Only the function classes having enzyme members between 10 and 1000 are used. The rest of the classes are neglected. Therefore, the Enzyme Sequence Dataset covers 19,258 enzymes in 235 function classes. And the Binding and Catalytic Site Database covers 3,084 records of binding or catalytic sites with 291 enzyme reaction descriptions.

Block Scan Filter

Objective of the block scan filter step is to alter one record of binding or catalytic site data to form a block. This step is divided into two subtasks: the similarity block scanning and the constraint filter. The first subtask is to use only 1 record of binding or catalytic site to induce the related binding and catalytic sites in order to create a block. One record of binding site or catalytic site is used to scan over the related protein sequences, all sequences of enzyme functions that have some sequences which have the site with the same descriptions of the scanning site. Several similarity scores, such as BLOSUM62 [12], are given, while the record scans over. The part of the protein with the length of 15 amino acids, giving highest score will be stored in a block. The scanning is repeated to other related protein sequences. Therefore, the result is a block containing sets of highest score binding or catalytic sites.

From this block, some of the sites inside the block are filtered out using the second subtask, constraint filter. To achieve that, the sites in the block are sorted from highest scores to the lowest. Based on the works of Smith [13], a block has "high quality" when each site in the block having at least 3 positions presenting the same type of amino acids. This is the criteria to filter the block.

Mutation Control

An enzyme mechanism can be represented by several amino acid sequences of binding or catalytic sites. Therefore specific positions in sequences necessary for controlling the properties of the enzyme mechanism shall have common or similar properties. In some positions, they are of the same types of amino acids, which are called conserved regions. In some positions, there are many types of amino acids, however having similar properties. All amino acids in the same position are grouped with respect to the mutation in biological evolution and the resulting group is called substitution group. The characteristics of the substitution group can be of two types submitted from patterns of PROSITE motif:

  1. The substitution group shall have some common properties representing by [], for example [ACSG].
  2. The amino acids having prohibited properties shall not be included in the group. This prohibition is represented by {}, for example {P}, meaning any amino acids but P.

We call mutation control when the substitution group at each position of binding or catalytic sites is controlled by the above characteristics. Thus, mutation control in biological evolution is important for enzyme mechanism to function. The objective of mutation control is to determine complete substitution group at each position in sequences. In the following, we give an example to illustrate mutation control process.

For first pattern [ACSG], the common properties using background knowledge from physico-chemistry table showed as table 1 is {small, tiny} which necessary for enzyme mechanism. For second pattern {P}, the prohibited property is {proline}, the property which other amino acid do not has and obstruct the enzyme mechanism. The complement of {proline} is the "boundary properties" that not obstruct the enzyme mechanism.

For example (see figure 3), at position 1 of BLOCKS, the original substitution group is {H,T}. The common properties, which have in H and T, are hydrophobic and polar. The amino acid group which has the common properties is {H, T, W, Y, K, C}.

All properties representing in any amino acid in {H,T} mean the "boundary properties", or the properties - polar charge positive hydrophobic aromatic and aliphatic. Any amino acid which has the other properties may be the prohibited properties obstruct the enzyme mechanism. So the amino acid group which has the properties only in the boundary properties is {H, T, F, K, M, N, Q, R, W, Y}.

The complete substitution group of {H, T} is controlled by the common properties and the boundary properties, representing by the intersection of the amino acids groups generated from the common properties and the boundary properties. From this example, we obtain the complete substitution group as {H, T, W, Y, K}. This process is repeated to all other positions of the quality block. The result is 1 motif representing the quality block from 1 binding or catalytic site.

However, the motif discovered from block scan filter and mutation control should use the same background knowledge. In case using physico-chemistry table in mutation control step, we should use the similarity score from the same background knowledge transformed from physico-chemistry table to create quality block in block scan filter step. Similarly, when we use BLOSUM62 table as similarity score table in block scan filter step, we should use amino acids properties table transformed from BLOSUM62 table.

The similarity score Table transformed from physico-chemistry is given in Table 2. The score is given in relation to the number of the same properties, for example, if two amino acids have 3 same properties, the similarity score is 3. For example, amino acids A and C have properties {small, tiny, hydrophobic} and {small, tiny, polar, hydrophobic}, the similarity score is the shared properties weight by 1 = |{small, tiny, hydrophobic} ? {small, tiny, hydrophobic}| = 3. However in case of pairing the same amino acid type, the score is weighed more than one, in our case, it is weighed by 4. The reason is to give higher score for conserved region. For example, the similarity score of amino acid A and itself is 4 x |{small, tiny, hydrophobic} ? {small, tiny, hydrophobic}| = 12.

On the other hand, the amino acids properties table transformed from BLOSUM62 table consist of 3 steps. Firstly, BLOSUM62 is transformed to binary table using threshold at zero. The score more than or equal to zero is replaced by 1, and the score less than zero is set to zero. Then, a property is set by the biggest group of amino acids sharing binary 1. Last step, all the new properties are put a table in relation to the amino acids. All steps are described in figure 4.

These two background knowledge types give different potentials. The background knowledge based on BLOSUM62, in general, is better statistically, while the background knowledge based on physico-chemistry yields motifs closer to PROSITE.

From figure 3, we can discover different motifs from 1 binding or catalytic site with different background knowledge. Using physico-chemistry table, we obtain [HTWYK] [CDENQST] [CNST] P H [KNQRT] [DNP] R [FILMV] [DENQS] . [ACDGNST] . . . as motif. Using BLOSUM62, we obtain . . [ST] P H . . R . [ENS] . . . . . as motif.

Reactive Site - Group Define

From the previous step, motifs are produced from different records of the same binding or catalytic function, by definition, are redundant, and should be grouped together and represent as one motif, namely reactive motif. It means that the 291 subgroups in the Binding and Catalytic Site database will yield 291 reactive motifs.

Although motifs are retrieved from the same original binding or catalytic sites in the same subgroup of the Binding and Catalytic Site Database; they can have different binding structures to the same substrate. In other words, there are many ways to "fit and function". Therefore these motifs, in some cases, can be rearranged to several reactive motifs. The separate method called "conserved region group define" is based on the conserved region, where the motifs with the same amino acids at the same positions of conserved regions are grouped together.

As the results, 1,328 reactive motifs are achieved by BLOSUM62 tool, and 1,390 by physico-chemistry table tool. This grouping process is called the "reactive site " group define".

Reactive Motif-Based Prediction of Enzyme Class (Phase II)

In this phase, the problem is to construct an enzyme prediction model using reactive motifs together with known Enzyme Sequence Dataset (train data set). The efficiency of our prediction model is compared with the one of PROSITE, the original pattern subscribed by mutation control to create reactive motif automatically.

Concerning data preparation for phase II, the motifs and enzyme classes in PROSITE are selected for the comparison purpose. In PROSITE, there are 152 motifs of binding and catalytic sites. To be comparable, the same conditions used with reactive motif are applied. However using the condition, which the function classes having members between 10 and 100, covers very small number of motifs (36 motifs in 42 functions, 2,579 sequences) and yields very low accuracy. Therefore we use the function classes having members between 5 and 1000 instead, which covering 65 motifs in 76 Enzyme Classes (2,815 sequences).

To construct the prediction model, given a set of the motifs (reactive motifs or PROSITE motifs), we aim to induce classifiers that associate the motifs to enzyme functions. As suggested in [14], any protein chain can be mapped into a representation based attributes. Such a representation supports efficient function of data-driven algorithms, which represent instances as classified part of fixed set of attributes. In our case, an enzyme sequence is represented as a set of reactive motifs (or PROSITE motif for comparison purpose).

Suppose that from phase I, N reactive motifs have been obtained. Each sequence is encoded as an N-bit binary pattern where the ith bit is 1 if the corresponding reactive motif is present in the sequence; otherwise the corresponding bit is 0. Each N-bit, sequence is associated with an EC number (Enzyme Commission Number). A training set is simply a collection of N-bit binary patterns each of which has associated with it, an EC number. This training set can be used to train a classifier which can then be used to assign novel sequences to one of the several EC-numbers represented in the training set. The reactive motif-based representation procedure is given in Figure 5.

In this paper, we use Weka [15], the machine learning suit, to compare the efficiency of different enzyme function prediction models. C4.5 decision tree (J4.8 Weka's implementation) has been used as a prediction learner in order to assess efficiency of the reactive motifs used for predicting enzyme functions.

Experimental Results

In this part, we present the results of the efficiency and the quality of the reactive motifs to predict enzyme functions. The results are divided to 2 sections 1) the prediction accuracy comparison between reactive motifs resulted from different background knowledge (BLOSUM62 or physico-chemistry table), 2) The quality of each reactive motif.

Prediction Accuracy Comparison between Reactive Motifs Resulted from Different Background Knowledge

This section compares the accuracy of prediction between different enzyme function prediction models, which resulted from the reactive motifs, which are generated from different background knowledge. The reactive motif generated from BLOSUM62 is called BLOSUM " reactive motif. The reactive motif generated from Taylor's physico-chemistry table is called physicochemistry " reactive motif. The reactive motifs with out substitution group element are used as reference of reactive motif, that retrieve from conserved region of reactive motif generated from BLOSUM62, called conserved amino acid ? reactive motif. In addition, the prediction accuracy of enzyme function prediction model from PROSITE motifs is presented. The dataset we used covers 235 enzyme function classes, 19,258 protein sequences. The enzyme function prediction models are created by learning algorithm C4.5 with 5 fold ? cross validation. The results are presented in the table 3 and 4.

In case the conserve region -group define step is not applied, the prediction model with using BLOSUM " reactive motifs gives the best result: 68.69% accuracy. The prediction model with using physicochemistry " reactive motifs with application of conserve region-group define gives the best result: 72.58% accuracy, however, the accuracies of all models are very close.

Quality of Discovered Reactive Motifs

In case the learning algorithm is not used, the quality efficiency of motifs/reactive motifs to represent the sub-functions of binding or catalytic sites are measured and compared. The quality is represented by 2 values: coverage value, and motifs found per enzyme sequence. The coverage value is the percentage of the motifs that relevant to enzyme sequences in all related enzyme classes. From the sequences, which motifs/reactive motifs cover, each sequence is checked on how many motifs are matched, and the average value from all sequences is calculated, called motifs found per enzyme sequence.

From table 5, the higher the coverage value is the better. However, the motifs found per sequence, theoretically, should close to 2, because one enzyme at least has one type of binding site and one type of catalytic site. The reactive motifs using physico-chemistry background knowledge gives the result closest to PROSITE, both coverage value and motifs found per sequence.

Conclusion and Discussion

The process introduced here yields good results (~70% accuracy of enzyme function prediction), and can solve the main problems such as the insufficient of data: binding sites and catalytic sites (~5.8% in our dataset). The reactive motifs using physico-chemistry background knowledge gives the best results, although the coverage value is not satisfied, the reactive-motifs found per enzyme sequence is very good. It means the motifs are very specific. The improvement of accuracy caused from conserved region group define shows that the details in the mechanism descriptions are not complete. The quality of the descriptions of binding and catalytic sites should be improved.

Future Work

Beside the regular expression motif, HMM is another popular tools to identify the protein type, especially in "domain" and "family" protein types, like as PFAM and TIGRFAM. So the PROSITE used HMM tool to identify 33 domains and families related to 154 enzyme functions, but not used HMM profile to identify any binding or catalytic site.

However, sine we can apply quality BLOCKs from many background knowledges make the opportunity to apply HMM profile for identify these protein types, binding and catalytic site. We will do this interesting task in the future work.


  1. Bairoch, A. : PROSITE: a dictionary of sites and patterns in proteins. Nucleic Acids Res. (1991) 19:2241-2245.
  2. Sander,C. and Schneider,R. : Database of homology-derived protein structures and the structural meaning of sequence alignment. Proteins Struct. Funct. Genet., (1991) 9, 56-68.
  3. Huang,J.Y. and Brutlag,D.L. : The EMOTIF database. Nucleic Acids Res., (2001) 29, 202-204.
  4. Eidhammer, I., Jonassen, I. and Taylor, W.R. : Protein structure comparison and structure patterns. Journal of Computational Biology, (2000) 7(5):685-716.
  5. Bennett, S. P., Lu, L., and Brutlag, D. L. : 3MATRIX and 3MOTIF: a protein structure visualization system for conserved sequence. Nucleic Acids Res. (2003) 31, 3328-3332
  6. Henikoff, S. and Henikoff, J.G. : Automated assembly of protein blocks for database searching. Nucleic Acids Res. (1991) 19: 6565-6572.
  7. Barton, G. J. : Protein multiple sequence alignment and flexible pattern matching. Methods Enzymol. (1990) (183), 403-428.
  8. Taylor W. R. : The classification of amino acid conservation. J Theor Biol. (1986) Mar 21;119(2):205-18.
  9. Wu,T. D. and Brutlag, D.L.. : Discovering Empirically Conserved Amino Acid Substitution Groups in Databases of Protein Families. Proc Int Conf Intell Syst Mol Biol. (1996) (4):230-240.
  10. Bairoch,A. and Apweiler,R. : The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000. Nucleic Acids Res. (2000), 28, 45-48.
  11. Nomenclature Committee of the International Union of Biochemistry and Molecular Biology (NC-IUBMB). Enzyme Nomenclature. Recommendations 1992. Academic Press, 1992.
  12. Henikoff, S. and Henikoff, J.G. : Amino acid substitution matrices from protein blocks. Proc. Natl. Acad. Sci. USA (1992) (89): 10915-10919.
  13. Smith HO, Annau TM, Chandrasegaran S. Finding sequence motifs in groups of functionally related proteins. Proc Natl Acad Sci U S A. 1990 Jan;87(2):826-830.
  14. Diplaris S., Tsoumakas G., Mitkas P.A., and Vlahavas I. : Protein Classification with Multiple Algorithms, In Proc. of 10th Panhellenic Conference in Informatics, Volos, Greece, November 21-23, Springer-Verlag, LNCS, 2005
  15. Frank ,E., Hall, M., Trigg, L., Holmes, G. and Witten, I.H.. : Data mining in bioinformatics using Weka. Bioinformatics 2004 20(15):2479-2481. doi:10.1093/bioinformatics/bth261.