Nearest Neighbor Algorithm Based Predictor Biology Essay

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

It is of great use to find out and clear up the interactions between enzymes and compounds, for understanding the molecular and cellular functions of organisms. In this study, we developed a novel method for the prediction of enzyme-compound interactions based on machine learning approach. The functional group composition of compounds and the biochemical and physicochemical description of proteins are used for representing enzyme-compound pairs. Tested by Jackknife cross-validation, our predictor achieved an overall accuracy of 87.47%, showing an acceptable efficiency. An online predictor is available at

Keywords: machine learning, feature selection, Minimum Redundancy Maximum Relevance, biochemical and physicochemical description of proteins, chemical functional group

1 Introduction

The environment organisms living in is constantly changing. To adapt to this changing, organisms regulate the functions of their functional biomolecules, especially some macromolecules such as proteins and nucleic acids. However, many much smaller organic molecules, referred to as small molecules, also play very important roles in the regulatory processes. They are responsible for most of the essential biological processes, such as energy production, signal transduction and many others. By interacting with biological macromolecules, small molecules can influence the functions of macromolecules. If we can predict the interactions between small and macro molecules in organisms accurately and find out their mechanism, it would be great news for the fields of molecular biology and also medicine, for many of drugs are also small molecules and interact with some macromolecules in organisms.

In this study, we focus on the interaction between enzymes and small molecules in metabolism. Metabolism is the overall process producing energy and product to perform biological functions for living systems. Enzymes, which play important roles in organisms by catalyzing different biological reactions, are surely important parts of metabolism. The network of interactions between enzymes and small molecules forms a vital part of metabolism.

It is difficult and costly to study such a network because of its complexity. Some resources have been published, providing access to some experimental validated interactions between enzymes and small molecules. [1, 6, 20, 32-33] However, there are still many unknown interactions that they could not deal with. It is in great demand to develop a reliable and effective approach to predict interactions between enzymes and small molecules.

Here we report an enzyme - small molecule interactions predictor based on Nearest Neighbor Algorithm, one of the most famous machine learning approaches. 505 small molecules and 787 enzymes obtained from KEGG [16, 19] were used, forming a data set with 3551 interactional pairs and 31959 non-interactional ones. To do the prediction, each sample is encoded to be a feature vector, respectively. Generally, compounds with similar structure and functional groups have some similarities in interactions with enzymes, and the number of each chemical functional group [33] is known as one type of representation of compounds. As biochemical and physicochemical features [13, 22, 30] are important for a protein's function, they could be used to represent different proteins. Because most enzymes are proteins, and we only focus on protein ones, the biochemical and physicochemical features representation is used to represent the protein in each sample. To improve the predictor's performance, minimum Redundancy Maximum Relevance (mRMR) algorithm [25] is used to analyze features. Incremental Feature Selection and Forward Feature Selection are also used for feature selection. Cross-validation method is used to test accuracy of our predictor, and finally our predictor obtained an overall accuracy of 78.57%.

2 Materials and methods

2.1 Data set

In this paper, we used the same method as Niu et al. [23] did to obtain the data set for analysis. Molecules available in KEGG [16, 19] ( were downloaded, and the ones do not participate in any metabolic pathway were removed. Finally, 505 small molecules and 787 enzymes were obtained. The positive sample set contains pairs of the enzyme and small molecule which interact with each other in metabolic pathways, while the negative sample set containing pairs which do not interact with each other. We cannot avoid the chance that some pairs in negative sample set are in fact positive ones. However, the chance is small and we ignored it in this study. 3647 interactional pairs were obtained in this way. After removing the ones with their small molecules that cannot be represented by functional groups, we finally got 3551 positive pairs. To improve the reliability of our predictor, we chose 31959 negative pairs from the negative sample set with totally 393768 non-interactional pairs randomly. This data set with the 35510 (3551+31959) pairs of enzymes and small molecules were used in our study to train and test our predictor. Each pair was coded as described below into a vector with 160 dimensions. Finally, we obtained an input matrix with 35510 rows and 161 columns (the additional 1 column is for the class mark. Supplemental Material 1 shows the whole matrix).

2.2 Coding small molecules into fixed-length vectors with their chemical functional groups

The number of organic compounds is tremendously large. However, many of them are composed of some fixed structures, and these structures are usually responsible for the characteristic chemical reactions of those molecules. There are over one hundred types of functional groups available, and we selected 28 of them in our study, which are: alcohol, aldehyde, amide, amine, hydroxamic acid, phosphorus, carboxylate, methyl, ester, ether, imine, ketone, nitro, halogen, thiol, sulfonic acid, sulfone, sulfonamide, sulfoxide, sulfide, a_5c_ring, ar_6c_ring, non_ar_5c_ring, non_ar_6c_ring, hetero ar_6_ring, hetero non_ar_5_ring, hetero non_ar_6_ring, and hetero ar_5_ring. For each organic compound, the number of each one of the 28 functional groups is counted and is used as one feature of the small molecule, respectively. With this method, we can code each small molecule into a vector with 28 dimensions. This vector shows the appearance of the 28 important functional groups in the molecule, and therefore represents the characteristic chemical reactions of the molecule.

2.3 Biochemical and physicochemical description of proteins

To code a protein into a vector, the biochemical and physicochemical description of proteins is use. [14, 23] Biochemical properties include amino acid composition and secondary structure of a protein sequence, while physicochemical properties, in the other hand, include 5 types of amino acid characteristics: hydrophobicity, polarizability, solvent accessibility, normalized van der Waals volume, and polarity. [14]

For the secondary structure of a protein sequence and its physicochemical properties, its amino acids can be classified into different groups. For secondary structure, amino acids in a given sequence are represented as their secondary structure state: helix, sheet or coil, respectively, as predicted by Predator. [15] For hydrophobicity, each amino acid is classified into a polar, neutral, or hydrophobic one.[7] Predicted by PredAcc, [22] each amino acid residue is marked as buried or exposed to solvent for its solvent accessibility. For the other three types of properties polarizability, normalized van der Waals volume and polarity, amino acids are divided into three groups respectively according to their values.

For each type of properties, a protein sequence can be transformed to a property sequence based on the type of each amino acid. For instant, in our study, P, N and H are used to represent the three groups of hydrophobicity: polar, neutral, and hydrophobic. Suppose a protein sequence "DMAEIEEKFSKTIEQEKQAGMLLPKNATDKSKESTQEPLTEPLKKMSDKP", it can be transformed into a letter sequence "PHNPHPPPHNPNHPPPPPNNHHHNPPNNPPNPPNNPPNH NPNHPPHNPPN".

For each letter sequence, there are three important properties: composition (C), transition (T), and distribution (D). C describes the global percent composition of each of the groups in the letter sequence; T describes the percent frequencies with which the letter changes to another along the entire length of the letter sequence; and D represents the distribution pattern of the letters along the sequence, measuring the percentage of the sequence length within which the first, 25%, 50%, 75%, and 100% of the amino acids of each letter is located. Because the lengths of letter sequences for different protein sequences are various, these three properties are used to represent the protein.

Take the 50-length letter sequence described above as an instant. It is composed of 10 Hs, 16Ns, 24Ps. The first feature C is 10/50=20.0%, 16/50=32%, and 24/50=48% for H, N and P respectively. For the feature T, there are totally 32 transitions in the sequence, with 9 between H and N, 16 between N and P, and 7 between H and P, so the feature T can be calculated as 9/32=28.12%, 16/31=50.00% and 7/31=21.88% respectively. The first, 25%, 50%, 75% and 100% of H is located at the position of 1st, 10th, 18th, 37th, and 46th in the letter sequence. Thus the feature D of H is 2/50=4%, 5/50=10%, 21/50=42%, 39/50=78%, and 46/50=92%. The feature D of N and P can be calculated with the similar method, that for N is 6%, 38%, 56%, 76%, and 100%, and for P is 1%, 22%, 50%, 72%, and 98% respectively. With all these, the three properties of this letter sequence are C=(20%, 32%, 48%), T=(25.81%, 51.61%, 22.58%), and D=(4%, 10%, 42%, 78%, 92%, 6%, 38%, 56%, 76%, 100%, 1%, 22%, 50%, 72%, 98%), totally 21 features.

For solvent accessibility, there are only two groups, not three groups, available, resulting in only seven features of its letter sequence. The amino acid compositions have 20 features, each of which represents the percentage composition of the corresponding amino acid. For each of the other five types of protein properties, 21 features can be obtained as described above. With all these processes, a total of 132 features (5 Ã- 21 + 20 + 7 = 132) can be obtained to represent a protein sequence.

2.4 Minimum Redundancy Maximum Relevance (mRMR)

Minimum Redundancy Maximum Relevance (mRMR) method, developed by Peng et al. in 2005, is firstly used to deal with microarray data.[25] Here we use it for feature analysis and selection. It ranks each feature according to its relevance to target and redundancy to other features. As the name shows, a good feature means a maximum relevance to target and minimum redundancy to other features in mRMR. To calculate relevance and redundancy, mutual information (MI), which estimates how much one vector is related to another, is used:


where x and y are two vectors, p(x,y) is the joint probabilistic density, p(x) and p(y) are the marginal probabilistic densities.

Suppose Ω is the whole feature set with all features, and Ωs demotes the already-selected feature set which contains m vectors, while to-be-selected feature set with n features is denoted by Ωt. Relevance of the feature f in Ωt with target c can be calculated as:


while the redundancy of the feature f in Ωt with features in Ωs can be calculated as:


To get a feature with maximum relevance and minimum redundancy, Eq(2) and Eq(3) are combined into the mRMR function:


The feature fi selected by mRMR function is seen to be the best feature in Ωt.

For a whole feature set with N features, this feature evaluation procedure will run N rounds. In each round, one feature with maximum relevance and minimum redundancy would be selected and added to Ωs. After these N rounds' feature evaluation with mRMR function, we can get an ordered feature set:


In this feature set, each feature has an index h, which indicates which round that the feature is selected. The better a feature is, the earlier it will satisfy the Eq(4), the earlier it will be selected, and the smaller its index h will be.

Because in the calculation of MI, the joint probabilistic density and the marginal probabilistic densities of the two vectors should be given, for a continuous variable, it should be scaled into several groups according to its value. In mRMR, a parameter t is used to discrete each feature in our data to three categorical states according to the equation mean ± (t · std): the ones with their value smaller than mean-(t·std), the ones with values between mean-(t·std) and mean+(t·std), and the ones with values larger than mean+(t·std), where mean is the mean value of the feature in all samples, and std is the standard deviation. In our study, t was set to be 1.

2.5 Nearest Neighbor Algorithm (NNA)

Nearest Neighbor Algorithm (NNA) is a simple but useful algorithm to solve the problem of vector classification. Its classification is based on the distances between the vector to be tested and all the vectors in the training set. In NNA, the vector to be predicted pt would be designate to be the same class as its nearest neighbor pn in the training set {p1,p2,…,pN}, that is, the vector with the smallest distance in the training set.

In this study, the distance between two vectors, px and py, is defined as:[26]


where px·py is the inner product of px and py, and ||p|| is the module of vector p.

2.6 Jackknife cross-validation

Up to now, there are many predictors using jackknife cross-validation [3-4, 8-11, 18, 28-29, 34] to examine the prediction quality. And as demonstrated by Eq.50 in [12], jackknife test is deemed the most objective and effective among the three cross-validation methods, i.e., independent dataset test, subsampling test, and jackknife test. Here we use it to test performances of our classifiers. [2, 21] In Jackknife cross-validation method, each sample is used as a to-be-test sample and all the other samples are used as the training set for one time. To evaluation the performance, the following three accuracies are used:

2.7 Incremental Feature Selection (IFS)

After the mRMR step, we can see which features are better with the ordered feature set S as mentioned above. The next step for feature selection is to obtain which features should be selected. Firstly, Incremental Feature Selection method is used. Based on the ordered feature set S, we can get N feature subset. The i-th subset is defined as:


For each feature subset, NNA is used to construct a classifier for the data set, and Jackknife cross-validation method is used to evaluation the classifier's performance. The results are plotted to build an IFS curve, with its x-axis to be the index i, and its y-axis to be the accuracy.

2.8 Forward Feature Selection (FFS)

To obtain an even better performance of classifier and feature selection, Forward Feature Selection (FFS) based on the result of IFS is also used in our study. Suppose the apex of IFS curve produced in IFS procedure lies at the point with its x-axis apex, we initialize the FFS-selected feature set Ss with the first k features in S:


Generally, k can be set to be the x-axis of the first maximum value in the IFS curve. More features would be added to form a larger FFS-selected feature set until it reaches the appointed final size. With the last FFS-selected feature subset Ss, the next FFS-selected feature set can be obtained by the following steps:

1) Suppose the candidate feature set Sc=S-Ss, every feature in Sc is put into Ss. An NNA predictor based on each Ss is built and evaluated by Jackknife cross-validation method.

2) The feature f which gets the highest accuracy when added to Ss is selected.

3) The feature f is removed from Sc and added to Ss, forming the next FFS-selected feature set.

3 Results and discussions

3.1 The results of mRMR

The mRMR program used in our study was downloaded from The output of mRMR consists of two lists. mRMR list is the list showing the index of each feature in the ordered feature set S mentioned in the section of Materials and Method. Besides mRMR list, there is also another list called MaxRel list, which contains the relevance of features with their target, as defined in Eq(2). However, only mRMR list was used in the feature analysis and selection process. The output of mRMR can be seen in Supplemental Materials 2.

3.2 The result of IFS

With the mRMR list, we obtained the ordered feature set S. 160 feature sets were built based on S, and 160 different models based on the 160 feature sets were constructed. With the Jackknife cross-validation, the performances of these models were plot in IFS curve (see Figure 1) as mentioned in the section of Materials and methods. The overall accuracy of the apex of IFS curve is 87.47% with 53 features. Figure 1 also shows the "accuracy of positive samples" and "accuracy of negative samples" of each point. For the imbalance between the number of positive samples and negative samples, the classifiers tended to perform better for the negative samples.

3.3 The result of FFS

The initial number of features in FFS, the parameter k mentioned above, was set to be 30, while the final number of features was set to be 60 in this study. Because of the relatively low accuracy of positive samples, it was used in FFS as the accuracy for feature selection. Figure 2 shows the FFS curves with their y-axis being the overall accuracy, accuracy of positive samples, and accuracy of negative samples, respectively. FFS performed better than IFS in prediction of positive samples. Figure 3 shows the comparison of accuracy of positive samples curves of the two methods. With the overall accuracy higher than 0.75, the highest accuracy of positive samples was 0.6956 with index 38. Table 1 shows the 39 features selected by FFS.

3.4 Discussions

To find out which features are more important to the prediction of interaction, we counted the number of features with different biological meanings in the FFS selected feature sub set. There are 16 features of the biochemical and physicochemical description of proteins, while the other 22 ones are the functional groups of small molecules. It shows the apparent fact that both enzymes and the compounds are critical for the interaction process.

We also counted the 16 selected features of the biochemical and physicochemical description of enzymes and grouped them into their corresponding groups. Figure 4 shows the frequencies of features in the 7 different groups mentioned in the Method section. It mirrors that the secondary structure and amino acids composition are essential to enzyme-substrate interactions in our bioinformatics 132-D prediction model. Nowadays, many Enzyme-Substrate interactions structures were examined from X-ray crystallography, NMR, electron and single molecule microscopy. Configuration of a stereocenter in the 17β-sheet chain on the inhibitory activity on the enzyme 5a-reductase (5AR) and some synthesized steroidal compounds [17]. The key site chain of FGly69 decides the active center in arylsulfatase A and its substrate interaction during catalysis [31]. These data support the crucial role that secondary structure in enzyme-substrate interactions. Forster resonant energy transfer (FRET) has been used for imaging enzyme-substrate interaction for PTP1B for RTK signaling in the cytoplasm, finding that D181 of PTP1B may be important for generating distinct cellular environments that permit RTK signal transduction and that mediate eventual termination [35]. Using X-ray crystallography, a glycine in the P5 position and hydrophic residues P4 (Val34) and P9 (Val29) of Serine protease thrombin play a central role in the blood coagulation process [27]. Tyr265 is important for the complex of the enzyme alanine racemase with its natural substrate (L-alanine) and cofactor (pyridoxal 5-phosphate) interaction in a theoretical model [24]. These data support the crucial role that amino acids composition in enzyme-substrate interactions.

There were totally 28 functional group features considered in our study, 22 of which were selected. It shows that the functional groups that participate in enzymatic reactions have no direction, i.e. all the functional groups would be participated in some reaction with enzymes. Because nearly all the features of functional groups were selected by IFS and FFS, both based on the NNA classifier, in our study, it also indicates that compounds with similar functional groups may work with enzymes with similar biochemical and physicochemical features, and act in similar class of pathways with correlated functions.[5] The classical glycolysis pathway is a significant instance. 11 compounds and 10 enzymes are participated in the pathway. In this pathway with only 10 reactions, 8 of the 26 functional groups are present. Table 2 shows the functional groups composition of the 11 compounds. With the table, we can easily discover the transformations of different functional groups in reactions, such as additional alcohol, phosphorus, ketone, and hetero_non_ar_6c_ring are formed when d-Glucose is transformed into alpha-d-Glucose 6-phosphate with the enzyme of hexokinase. It shows that we should not discard any feature of functional group easily, for most of them are important in some interactions with particular enzymes.

4 Conclusions

In this study, we tried to construct a novel predictor for enzyme-compound interaction by analyzing the functional group composition of compounds and the biochemical and physicochemical features of enzymes, and achieved an acceptable performance with the overall accuracy of 87.47% in Jackknife cross-validation test. Therefore, our predictor may be useful for predicting new enzyme-compound interaction pairs in different pathways. An online predictor is also developed, available at