This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.
Given a compounds-forming system, i.e., a system consisting of some compounds and their relationship, can it form a biologically meaningful pathway? It is a fundamental problem in system biology. Nowadays, there are a lot of information on different organisms, at both genetic and metabolic levels and many specific databases, such as KEGG/LIGAND, ENZYME, BRENDA, EcoCyc and MetaCyc, have been established to stored these information. Based on these data, it is feasible to address such an essential problem. Metabolic pathway is one kind of compounds-forming systems and we analyzed them in yeast by extracting different (biological and graphic) features from each of the 13,736 compounds-forming systems, of which 136 are positive pathways, i.e., known metabolic pathway from KEGG; while 13,600 were negative, i.e., not formed as a biologically meaningful pathway according to the current information. Each of these compounds-forming systems was represented by 144 features, of which 88 are graph features and 56 biological features. "Minimum Redundancy Maximum Relevance" and "Incremental Feature Selection" were utilized to analyze these features and 16 optimal features were selected as being able to predict a query compounds-forming system most successfully. It was found through Jackknife cross-validation that the overall success rate of identifying the positive pathways was 74.26%. It is anticipated that this novel approach and encouraging result may give meaningful illumination to investigate this important topic.
Keywords: Compounds-forming system, Metabolic pathway, Minimum redundancy maximum relevance, Nearest neighbor algorithm, Jackknife cross-validation, Compound similarity, Chemical functional group
A compounds-forming system, i.e., a system consisting of some compounds and their relationship, is an important research area in system biology. It is still a great challenge to predict that a given compounds-forming system can form a meaningful biologically pathway or not.
During the past decade, bioinformatics has seen an explosion in the amount of biological data and undergone rapid development. The information on different organisms has been accumulated a lot both on genetic and metabolic levels. Thus some specific databases, such as KEGG/LIGAND [1, 2], ENZYME , BRENDA , EcoCyc and MetaCyc [5, 6], has been established to store these information. KEGG (Kyoto Encyclopedia of Genes and Genomes) [1, 2, 7] is a widely used database for systematic analysis of gene functions in terms of the interactions between genes and molecules, which consists of graphical diagrams of biochemical pathways. It is known that metabolic pathways, one kind of compounds-forming systems and one of the most important components in KEGG, have been developed and accumulated quickly, which make it possible to study metabolic pathways systematically . Analysis of metabolic pathways is very useful to understand the relationship between genotype to phenotype. It involves the following important problems: biological evolution processes interpretation, recognition of metabolites common to a set of functionally-related metabolic pathways and alternative metabolic pathways determination. Moreover, it can give help in metabolism modeling and function prediction. Metabolic pathway network is very important to study pharmacological targets, biology evolution and other biotechnological applications [9, 10].
For most pathways in KEGG, it is barely possible to acquire their graph characteristics by manual query execution. The current study developed a new approach to address this problem that may give meaningful illumination to in-depth studying various pathway network systems.
MATERIALS AND METHODS
The data of metabolic pathways was collected from the public available database KEGG/LIGAND (ftp://ftp.genome.jp/pub/kegg/pathway/map/). Since there are compounds without information of compound similarity or biological properties, we remove these compounds in each pathway. Also, pathways involving less than three compounds were also excluded. As a result, 136 metabolic pathways, or compounds-forming systems, in yeast were obtained and they are termed as "positive pathways". The 136 positive pathways as well as the compound codes contained in each of such pathways are given in Online Supporting Information A1.
The data of negative pathways was generated by the following two ways. Firstly, randomly select compounds as the vertices of a graph, followed by creating some arcs between these compounds in random. Note that the number of arcs in each negative pathway was assigned according to the size distribution of the arcs in the positive pathways. Secondly, replace about half of compounds in each positive pathway by other compounds, and the arcs between the compounds, including both the original and the replaced ones, remain unchanged. Since positive pathways are very rare comparing to the vast majority of negative pathways, the number of negative pathways thus generated was 100 times as many as that of the positive ones in the current study. The 13,600 negative pathways are given in Online Supporting Information A2.
As graphic approaches can give useful intuitive insights, it is widely utilized to study biological systems, such as drug metabolism systems , protein folding kinetics , inhibition of HIV-1 reverse transcriptase [13-15], and biological and medical related problems [16-19].
Both graph features and biological properties were used to code each pathway in the current study. Since reactions between two compounds in each metabolic pathway are directional, i.e. one compound can be transformed into another compound with the participation of certain enzyme while the reverse direction does not always hold, each metabolic pathway can be seen as a directed graph where the vertices indicate compounds and the arcs reactions. In this study, 88 graph features are extracted from each directed graph that represents a pathway, and 56 features about biological properties were derived from chemical functional groups. Therefore, there are totally 88+56=144 features. In this case, we can define each of the 13,736 pathways in a 144-D (dimensional) space, see Online Supporting Information B1 and Online Supporting Information B2 for the codes of 136 positive pathways and 13,600 negative pathways, respectively.
Many graph features were derived in [20-22], where the features were extracted from an undirected graph and these features were successfully used to identify protein complex [23, 24]. Since each pathway can be deem as a directed graph, we made some modification to these features. The arcs are weighted by the compound similarity of the corresponding two compounds, which will be given detail explanation in Section "Compound Similarity". The 144 features are divided into the following groups.
Graph size and graph density: Let be a pathway graph, with vertices and arcs. The graph size is the number of compounds in the pathway. Suppose is the theoretical maximum number of possible arcs in. The graph density is defined as . 
Degree statistics: The in-degree (out-degree) is defined as the number of in-neighbors (out-neighbors) of a vertex. Mean in-degree, variance of in-degrees, median in-degree, maximum in-degree, mean out-degree, variance of out-degrees, median out-degree and maximum out-degree are taken as features. 
Edge weight statistics: Let be a weighted pathway graph where each arc is weighted by a weight in the range of . It is possible that for some arc , we extracted features from two cases: (a) all arcs in graph are considered including those with zero weights, then take mean and variance of these weights as features; (b) arcs with non-zero weights are considered so as to take mean and variance of the non-zero weights as features. 
Topological change: Let be a weighted pathway graph. This group of features is obtained by measuring the topological changes when different cutoffs of the weights are applied to the graph. The weight cutoffs included 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7 and 0.8. Let be the graph that only includes the arcs with weights higher than remained, i.e.. Topology changes are measured as for ( if ).
Degree correlation: Let be a pathway graph with . For each vertex , denote its in-neighbors as and out-neighbors as . Suppose and are two induced subgraphs of . Define ( if ) and ( if ). Take the mean, variance and maximum of and , respectively. 
Clustering: Let be a pathway graph with . For each vertex , denote its in-neighbors as and out-neighbors as . Let and be two induced subgraphs of . Define ( if ) and ( if ). Take the mean, variance and maximum of and , respectively, as features. 
Topological: Let be a pathway graph with . For each pair of vertices , denote as the number of both in-neighbor of and in-neighbor of , as the number of both in-neighbor of and out-neighbor of , as the number of both out-neighbor of and in-neighbor of and as the number of both out-neighbor of and out-neighbor of . For each vertex , denote and as the number of in-neighbors and out-neighbors of . Let ( if ), ( if ), ( if ), ( if ). For each vertex , let be the mean of for . Topological features are defined as the mean, variance and maximum of for , respectively. 
Singular values: Let be a pathway graph and be its adjacent matrix. The first three largest singular values are taken as the features. 
Local density change: Let be a pathway graph with . For each vertex , let and be the in-neighbors and out-neighbors of . We only show how to gain features from out-neighbors of each vertex under different cutoffs, which included 0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8 and 0.9. Construct a weighted undirected complete graph with vertices and the weight of each pair of vertices is the compound similarity of the corresponding compounds (see Section "Compound Similarity"). Suppose the cutoff is , which may be 0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8 or 0.9. Extract a spanning subgraph of with edges whose weights are greater than . Compute ( if ). Take the mean and maximum of as features under cutoff .
The above features are for graph representation, while the following features are for biological properties. Encouraged by the successes of using chemical functional groups in tackling some biological problems [25, 26], they were taken as the biological properties in the current study. Suppose a pathway consists of compounds, the mean and maximum values of biological properties of the compounds are taken as features.
Chemical functional groups: Organic compounds consisted of some fixed structures, which are called functional groups of atoms combined in different ways. Compounds with the same functional group generally react in a similar way. Currently, there are more than 100 functional groups. In the current study, we selected 28 common ones. These groups are: "alcohol", "aldehyde", "amide", "amine", "hydroxamic acid", "phosphorus", "carboxylate", "methyl", "ester", "ether", "imine", "ketone", "nitro", "halogen", "thiol", "sulfonic acid", "sulfone", "sulfonamide", "sulfoxide", "sulfide", "a_5c_ring", "ar_6c_ring", "non_ar_5c_ring", "non_ar_6c_ring", "hetero ar_6_ring", "hetero non_ar_5_ring", "hetero non_ar_6_ring", and "hetero ar_5_ring". In this study, chemical functional group is calculated by software "fc_analyzer" which can be downloaded at http://pcal.biosino.org/fc_analyzer.html. 
In conclusion, the total number of features is
As for the detail distribution of the 144 features, see Table 1.
Using graph representations to measure the similarity of two compounds was first proposed by Hattori et al. , and it has been used to tackle some problems in biological system, such as the prediction of interactiveness between small molecules and enzymes , the prediction of network of substrate-enzyme-product triads , and drug-target interaction prediction . Since each chemical structure can be represented by a two-dimensional (2D) graph where vertices denote atoms and edges denote bonds between them, the similarity of two compounds, according to their method, can be estimated based on the size of the maximum common subgraph between two corresponding graphs using a graph alignment algorithm. The similarity score between two compounds by this method can be calculated by an online website at http://www.genome.jp/ligand-bin/search_compound.
Minimum Redundancy Maximum Relevance
Feature selection can reduce feature dimensions and improve computational efficiency. In the current study, Minimum Redundancy Maximum Relevance (mRMR), first proposed by Peng et al. , is utilized as it aims to balance minimum redundancy and maximum relevance and has been widely used to tackle various biological problems [23, 25, 33-36]. The maximum relevance guarantees that features that contribute most to the classification will be selected, while the minimum redundancy guarantees that features whose prediction ability has already been covered by selected features will be excluded. mRMR tries to add each feature in order into the feature list. In each round, a feature with maximum relevance and minimum redundancy is selected. As a result, a feature list with the selection order can be obtained. Both redundancy and relevance can be computed through mutual information (MI), which is defined as follows
where and are two random variables, is the joint probabilistic distribution of and ; and are the marginal probabilities of and , respectively.
Let denote the whole feature set. Suppose is the selected feature set with features, while is the to-be-selected feature set with features. The relevance of a feature and the target variable can be computed as , and the redundancy between a featureand the selected feature set can be computed as
(, if ) (3)
For each feature in , compute the following equation
To maximize relevance and minimize redundancy, select a feature such that
Then remove from and take it into . For the rest features, each time the most relevant and least redundant feature is selected from and taken into , until all features are in . Thus, for a feature pool with features, mRMR program will execute rounds and provide an ordered feature list:
where denotes the round at which the feature is selected.
Nearest Neighbor Algorithm
In this study, Nearest Neighbor Algorithm (NNA) [37, 38] was adopted to predict the class of pathway (positive or negative). The "nearness" is defined by the Euclidian distance as below
where is dot product of two vectors and , and are the modulus of vector and , respectively. The smaller the , the nearer the two variables are .
In NNA, suppose there are training pathways, each of them is either positive or negative, and a new pathway needs to be determined to be either positive or negative. The distances between each of the training pathways and the new pathway are calculated, and the nearest neighbor of the new pathway is found. If the nearest neighbor is positive or negative, then the new pathway is assigned to be positive or negative, respectively.
The prediction model was tested by jackknife test . There are three cross-validation methods in statistical prediction: jackknife test, K-fold cross-validation test, and independent dataset test . However, jackknife test is deemed more objective and effective than other two methods [39, 41]. So it has been widely used to evaluate the accuracy of various prediction models [23, 25, 42-47]. In such a test, each sample in the dataset is singled out in turn as the testing data and the rest samples are used to train the prediction model. Thus every sample is tested exactly once.
Incremental Feature Selection (IFS)
From mRMR, an ordered feature list was obtained. Define the i-th feature set as , i.e. contains the first features of . For every , we perform NNA with the features in and an accuracy of correctly predicting the positive pathways, evaluated by jackknife cross-validation, was obtained. As a result, we can plot a curve named IFS curve, with identification accuracy as its y-axis and the index of as its x-axis.
RESULTS AND DISCUSSION
Results of mRMR
The mRMR program can be downloaded from http://research.janelia.org/peng/proj/mRMR/. Also, it was run with default parameters. There are two feature lists in the result of mRMR program: (I) MaxRel features list; (II) mRMR features list (see Online Supporting Information C).
For the MaxRel features list, the most relevant 10% of the features (totally 14) was investigated and the distribution of them was shown in Fig. 1. Among these 14 features, 12 (85.7%) features were extracted from the corresponding graph of a pathway, indicating that among the adopted features, graph features contribute most to the forming of metabolic pathways. Of the 14 features, 9 (64.29%) features come from the 9th feature group "local density change", which quantifies the similarity between some compounds where these compounds can be transformed into or out of a particular compound. It indicates that compounds linked by an arc, regardless of its direction, are often very similar. From which it is easy to deduce that compounds that can be transformed into the same compound are often very similar in structure.
Results of IFS
Shown in Fig. 2 is the IFS curve. The highest accuracy of IFS for positive pathways is 74.26% using 16 features (see Online Supporting Information C). Moreover, for the readers' interest, the accuracy of identifying the negative pathways and overall accuracy using these optimized 16 features are 99.64% and 99.39%, respectively. The detail data of IFS can be found in Online Supporting Information D.
Shown in Fig. 3 is the distribution of the optimized 16 features. It is straightforward to see that 10 (62.5%) features come from pathway graph, among which 6 (37.5%) features come from the 9th feature group "local density change" reaching the same results as that in Section "Results of mRMR". In addition, 6 (37.5%) features come from the chemical functional groups, indicating that chemical function groups also contribute towards the forming of metabolic pathways.
Analysis of the Important Features
In this study, we present a novel metabolic pathway network analysis method based on hybrid properties, the graph properties and biological properties. The most contributed individual feature is the "weight_edge_mean (with_missing_edges)" which is the mean of weighted edges in a metabolic pathway including the zero weights. If fewer zero-weighted edges or more heavily weighted edges present in the graph, the feature tends to give a greater value. Fewer zero-weighted edges mean that the graph is more densely connected, and more heavily weighted edges mean that the compounds in the metabolic pathway are linked more strongly with higher confidence. The "out_local_density" and "in_local_density" mean the downstream and upstream chemical similarity, respectively. The features such as "topological_change_0.7_0.8", "topological_change_0.6_0.7" mirror the chemical structure changes in the metabolic pathways. Several algorithms for general topological properties of metabolic networks are well characterized [48, 49]. Several methods of minimization metabolic adjustment have shown that knockout metabolic screens undergo a minimal redistribution with respect to the configuration of the wild type [50, 51]. The importance of these features shows that the correlation exists between the structural similarity and the pathway connectivity of chemical compounds. The tendency that structurally similar compounds are closely positioned on the pathway can be confirmed by the distribution of compound similarity scores along the KEGG pathways .
Some other contributed features such as "non_ar_6c_ring_mean", "ar_6c_ring_max", "methyl_max", "sulfonamide_max", "carboxylate_mean", "halogen_max" are highly related to the specific chemical structures of the metabolic pathways. The largest clusters of similar compounds were related to carbohydrates, 10 and features derived from chemical functional groups reveal the intensive carbohydrate metabolism. In fact, enzymatic reactions corresponding to the connector between two sub-pathways are lyases acting on carbons, such as a decarboxylase for reducing or raising the number of carbon atoms .
In this study, we tried to analyze 144 features extracted from each of the positive and negative pathways. Of the 144 features, 88 were graph features, as each pathway can be deem as a directed graph; and 56 were derived from chemical functional group of compounds. The "Minimum Redundancy Maximum Relevance" and "Incremental Feature Selection" techniques were employed to analyze these features. Nearest neighbor algorithm and jackknife test were used to evaluate the accuracy of our model in prediction of the positive pathways. As a result, 16 features among the adopted features were found as the important features for the classification. This contribution might be of use for stimulating in-depth studies on such an important and challenging topic and might be helpful for improving the understanding of metabolic pathways.