Classification and analysis of regulatory pathway

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

Abstract: Given a regulatory pathway, a system consisting of some proteins and their relations, can we predict its biological function, i.e. classify it into correct pathway class? This is a fundamental problem in systems biology and proteomics. It is also a great challenge although much information of regulatory pathway has been collected and stored in some public database during the past decade. The main reason is that regulatory pathway is so complex that we still poorly understand it and do not know which factors or features are more important to determine its biological function. In this study, we analyzed known regulatory pathways in humans, which were collected from KEGG, by extracting different features from each pathway. All features consisted of three categories: graph property, biochemical and physicochemical property, and functional property. Each pathway was encoded into a 5570-D vector. Some feature selection methods, including minimum redundancy maximum relevance and incremental feature selection, were utilized to analyze these features. As a result, 36 features were extracted as optimal features and the overall accuracy using these features through jackknife cross-validation and nearest neighbor algorithm was 75%. Since this is the first work to try to predict the biological function of regulatory pathway, it is still preliminary. However, we believe that it can stimulate extensive investigations into this important topic.

Keywords: Regulatory pathway; Minimum redundancy maximum relevance; Graphic feature; Gene ontology; Biochemical and physicochemical property; Gene ontology enrichment score

1 Introduction

During the past decade, much information on different organisms has been accumulated both at genetic and metabolic levels and many specific databases, such as KEGG/LIGAND [1,2,3,4], ENZYME [5], BRENDA [6], EcoCyc and MetaCyc [7,8], have been developed to provide service. However, biological meaningful pathway, such as regulatory pathway and metabolic pathway, is still poorly understood. As one of the most important pathways in systems biology, regulatory pathway includes two kinds of interactions: protein¿½Cprotein interactions, such as physical binding and phosphorylation, and indirect protein¿½Cprotein interactions, such as the relations between transcription factors and downstream gene products [2].

KEGG (Kyoto Encyclopedia of Genes and Genomes) [1,2,3,4] is a collection of online databases dealing with genomes, enzymatic pathways, and biological chemicals. KEGG provides five main database [4], KEGG Atlas, KEGG Pathway, KEGG Genes, KEGG Ligand, and KEGG BRITE. KEGG BRITE (, which includes some known regulatory pathways, is an ontology database representing functional hierarchies of various biological objects. It also includes molecules, cells, organisms, diseases and drugs, as well as relationships among them [9,10]. In this database, experimental knowledge is collected and diagramed as pathways, i.e. smaller networks of specific function. Several visualization tools have been developed to view and analyze the global networks through web interfaces [11,12,13].

According to the data in KEGG BRITE, regulatory pathways are classified into six pathway classes. Developing a successful classifier to predict the biological function of each regulatory pathway, i.e. classify it into correct pathway class, is very useful to understand regulatory pathway and extend the existing system of regulatory pathway. However, it is a great challenge in both systems biology and proteomics, because the most important information of them is very hard to recover and transform into the data which can be processed by computers. The purpose of this study is not to achieve high accuracy. We want to analyze some features, which were deemed to provide possible contribution to characterize a meaningful regulatory pathway. Some feature selection methods, including minimum redundancy maximum relevance [14] and incremental feature selection, were employed to analyze the involving features, while nearest neighbor algorithm [15,16] and jackknife cross-validation [17] were used to evaluate accuracy. As a result, 36 features were selected as optimal features and the overall accuracy using these features was 75%. The analysis of optimized features suggested that biochemical and physicochemical property and functional property are important to determine the biological function of each regulatory pathway. Although it is the first work to investiagte the classification problem of regulatory pathway and is still preliminary, we believe that it can stimulate extensive investigations into this important topic.

2 Materials and Methods

2.1 Benchmark dataset

We downloaded the human KGML (KEGG XML) files from KEGG FTP site ( We reduced the original data by the following two steps: 1. Remove proteins without GO information or biochemical and physicochemical properties in each pathway; 2. Exclude pathways with less than three proteins. As a result, 146 regulatory pathways were obtained. According to the data in KEGG BRITE (, these pathways belong to six pathway classes: Metabolism, Genetic Information Processing, Environmental Information Processing, Cellular Processes, Organismal Systems, and Human Diseases. Shown in Table 1 is the distribution of regulatory pathways in this study.

2.2 Features construction

2.2.1 Graph property

Graphic approaches are deemed as useful tools to study complex networks as they can provide intuitive insights and the whole structure property, as indicated by various studies on some important biological topics [18,19,20,21,22,23,24,25]. In order to use these approaches, each regulatory pathway was deemed as a graph, where vertices represent proteins and arcs represent the relations between the corresponding proteins. In fact, it is a directed graph or digraph, because the relation between two proteins is directional, i.e. one protein P1 can regulate another protein P2 while P2 cannot always regulate P1. In this paper, we extracted 88 graph features from each directed graph which represents a regulatory pathway. Most of the graph features were derived in [25,26,27,28,29], where they request the graph is an undirected graph. Thus we extended them into directed graphs in this study. The brief introduction of these features is described as follows.

(1) Graph size and graph density. Let G = (V, E) be a pathway graph, where V denotes vertex set and E arcs set. The graph size is the number of vertices in the graph. |E|max=|V|2 is the theoretical maximum number of arcs in G with |V| vertices. The graph density is calculated by |E|/|E|max. [26]

(2) Degree statistics. The in-degree (out-degree) of a vertex is the number of its in-neighbors (out-neighbors). Mean, variance, median, and maximum of in-degree and out-degree, respectively, were taken as features in this feature group. [27]

(3) Edge weight statistics. Let G = (V, w(E)) be a weighted pathway graph where each arc is weighted by a weight w in the range of [0,1]. e is called a missing edge if w(e) = 0. Considered in this study were mean and variance of arc weights including two different cases (with and without missing edges) as features. [26]

(4) Topological change. Let G = (V, w(E)) be a weighted pathway graph. This group of features is to measure the topological changes when different cutoffs of the weights are applied to the graph. The weight cutoffs included 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7 and 0.8. Topology changes were defined as the change rate of the number of arcs in subgraphs under two consecutive cutoffs.

(5) Degree correlation. Let G = (V, E) be a pathway graph with V = {u1,u2,¿½¿½,un}. For each vertex ui, calculate the average number of arcs of its in-neighbors and out-neighbors, respectively. Considered in this study were mean, variance and maximum of the two kinds of property, respectively, as features. [28]

(6) Clustering. Let G = (V, E) be a pathway graph with V = {u1,u2,¿½¿½,un}. For each vertex ui, calculate the graph density of the subgraph induced by its in-neighbors and out-neighbors, respectively. Take mean, variance and maximum of the two kinds of property, respectively, as features. [27]

(7) Topological. Let G = (V, E) be a pathway graph with V = {u1,u2,¿½¿½,un}. Define four function as follows: in-in(ui, uj) denoted the number of both in-neighbors of ui and in-neighbors of uj; in-out(ui, uj) denoted the number of both in-neighbors of ui and out-neighbors of uj; out-in(ui, uj) denoted the number of both out-neighbors of ui and in-neighbors of uj; out-out(ui, uj) denoted the number of both out-neighbors of ui and out-neighbors of uj. For each vertex ui, calculate four values Ti1, Ti2, Ti3, and Ti4 as follows: Ti1 is the mean of in-in(ui, uj)/ni1; Ti2 is the mean of in-out(ui, uj)/ni1; Ti3 is the mean of out-in(ui, uj)/ni2; Ti4 is the mean of out-out(ui, uj)/ni1; where ni1 and ni2 are the number of in-neighbors and out-neighbors of ui, respectively. Take the mean, variance and maximum of Ti1, Ti2, Ti3, and Ti4, respectively, as features in this group. [28]

(8) Singular values. Let A be the adjacent matrix of the pathway graph. Take the first three largest singular values as features in this group. [26]

(9) Local density change. Let G = (V, E) be a pathway graph with V = {u1,u2,¿½¿½,un}. For each vertex ui, let Vi1 = {ui1,ui2,¿½¿½,uik} and Vi2 = {uj1,uj2,¿½¿½,ujl} be its in-neighbors and out-neighbors. We only introduce how to extract features from out-neighbors of each vertex under cutoffs w, which may be 0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8 and 0.9. Construct a weighted undirected complete graph Ki with vertices uj1 ,uj2 ,¿½¿½,ujl and the weights of each edge can be calculated by Eq. 2 in Section 2.2.2. Extract a spanning subgraph Gi(w) of Ki with edges whose weights are greater than w. Calculate Li(w) = 2|E(Gi(w))|/(l(l-1)) (Li(w) = 0 if l¿½¿½1). Take the mean and maximum of L1(w), L2(w),¿½¿½, Ln(w) under cutoff w as features.

2.2.2 Gene ontology

As mentioned in Section 2.2.1, some features need the arc weight to evaluate the relation between two proteins. Thus we used gene ontology consortium (GO) [30] to represent each protein and evaluate the relation between two proteins. ¿½¿½Ontology¿½¿½ is a specification of a conceptualization and refers to the subject of existence. GO is established according to the following three criteria: molecular function, biological process, and cellular component. Since these three criteria can reflect the attribute of gene, gene product, gene-product groups and core features reflecting the subcellular localization [31,32], GO consortium is considered to be a very powerful and useful tool to study protein-protein interactions [25,33]. Using the similar method in [25], each protein sample can be formulated as a 5,218-D vector:

P = [p1, p2,¿½¿½, p5218]T (1)

where pi = 1 if the sample hit the GO item; otherwise, pi = 0. The interaction between Pi and Pj, i.e. the weight of arc between the two corresponding proteins, is defined as follows:


where is dot product of Pi and Pj, || Pi || and || Pj || are their modulus.

2.2.3 Biochemical and physicochemical property

Beside the graph property, biological property of each pathway is also indispensable to characterize meaningful regulatory pathways. In this study, biochemical and physicochemical properties, which have been widely used in the tackling many biological problems [34,35,36], were employed to represent the biological property of each pathway, and they included hydrophobicity, normalized van der Waals volume, polarity, polarizability, secondary structure, solvent accessibility, and amino acid compositions. Suppose there were n proteins in a regulatory pathway, the mean and maximum values of biological properties of the n proteins were taken as the features.

(1) Hydrophobicity, normalized van der Waals volume, polarity and polarizability: 42 features can be extracted from each of these properties [37,38], respectively. Here we only introduce how to obtain features from the hydrophobicity property, while features from other properties can be achieved in a similar way. Each amino acid is substituted by one of the three letters, polar (P), neutral (N) and hydrophobic (H). Given a protein sequence, use P, N or H to substitute each amino acid in the sequence, and the resulting sequence is called a protein pseudo-sequence. Composition (C) is the percentage of P, N and H in the whole pseudo-sequence. Transition (T) is the changing frequency between any two characters. Distribution (D) is the sequence segment (in percentage) of the pseudo-sequence which is needed to contain the first, 25%, 50%, 75% and the last of the Ps, Ns and Hs, respectively. In conclusion, there are three, three, and fifteen properties for (C), (T) and (D), respectively. (3+3+15)¿½¿½2 = 42 features were extracted for this property and totally 42 ¿½¿½4 = 168 features were obtained in this group.

(2) Secondary structure: each protein sequence is coded with three letters like hydrophobicity property. For details, please see [39,40]. 21¿½¿½2 = 42 features can be derived from this property.

(3) Solvent accessibility: ACCpro [41] can predict each amino acid as hidden (H) or exposed (E) to solvent. Then the protein sequence is coded with letters H and E. Use composition (C) for H, transition (T) between H and E, and five distributions (D) for H in this property. Thus there were (1+1+5)¿½¿½2 = 14 features about this peoperty.

(4) Amino acid compositions: the percentage of each amino acid in the sequence. Totally, there were 20¿½¿½2 = 40 features about amino acid composition.

Shown in Table 2 are the numbers of each property in the above feature groups. It is clear that there were 132¿½¿½2 = 264 features about biological property. Before taking the mean and maximum values of each property in these groups, the following equations were used to adjust values according to a standard scale [35]:


where Tj is the standard deviation of the j-th feature and uj the mean value of the j-th feature.

2.2.4 Functional property

The last category of features was about the functional property of each regulatory pathway. The gene ontology enrichment score of pathway i on gene ontology item j was defined as the ¿½Clog10 of the hypergeometric test p value [35,36,42,43] of proteins in pathway i and can be computed by the following equation:


where N is the number of overall proteins in KEGG of human, M is the number of proteins annotated to gene ontology item j, is the number of proteins in pathway i, is the number of proteins in pathway i that are annotated to gene ontology item j. The larger the enrichment score of one gene ontology item, the more overrepresented this item is. There were 5218 gene ontology (GO) enrichment score features.

2.2.5 Representation of each pathway

As described in Section 2.2.1, 2.2.3 and 2.2.4, the total number of features was Q = 88+264+5218 = 5570. For the detail distribution of 5570 features, please see Table 3.

Each feature in Section 2.2.1, 2.2.3 and 2.2.4 can be seen as a dimension in a space. Thus, each of the 146 pathways can be represented by a vector in the 5570-D (dimensional) space, see Online Supporting Information S1 for the codes of 146 pathways.

2.3 mRMR method

Minimum Redundancy Maximum Relevance (mRMR), first proposed by Peng et al. [14], was employed in this study, as it is established according to two excellent criteria: Max-Relevance and Min-Redundancy. Max-Relevance guarantees that features giving most contribution to the classification will be selected, while Min-Redundancy guarantees that features whose classification ability has already been covered by selected features will be excluded. By mRMR program, we can obtain two feature lists: MaxRel features list and mRMR features list. MaxRel features list sort features only according to the Max-Relevance criteria, while mRMR features list is obtained in terms of both Max-Relevance and Min-Redundancy. Thus, for a feature set ¿½¿½ with N features, mRMR program will execute N rounds and a feature with maximum relevance and minimum redundancy is selected in each round. Finally, we can obtain an ordered feature list, i.e., mRMR features list:


For detail description of the mRMR method, please refer to Peng et al.¿½¿½s paper [14]. Now, mRMR method has been widely utilized to tackle various biological problems [25,44,45,46,47,48,49] and deemed as a powerful and useful tool to extract important information in complex systems.

2.4 Nearest neighbor algorithm

In this study, Nearest Neighbor Algorithm (NNA) [15,16], which has been widely used in bioinformatics and computational biology [25,44,50,51,52,53], was adopted to predict the pathway class of each required pathway. The ¿½¿½nearness¿½¿½ is calculated as below


where and are two vectors, is their dot product, and are the modulus of vector and . The smaller the , the nearer the two variables are [54].

In NNA, suppose there are m training pathways, each of them belongs to exact one pathway class, and a required pathway needs to be classified into one pathway class. The distances between each of the m training pathways and the required pathway can be calculated, and the nearest neighbor of the required pathway is found. If the nearest neighbor belongs to the i-th pathway class, the required pathway is classified into the i-th pathway class.

2.5 Jackknife cross-validation

In this study, Jackknife test was utilized to examine the prediction model. In statistical prediction, three cross-validation methods are often utilized to evaluate a predictor for its accuracy: independent test, K-fold cross-validation, and jackknife test [17]. However, among the three methods, the jackknife test is regarded as the most objective as it can always give a unique result with a given benchmark dataset. For detail explanation, please refer to the related text in [54,55,56]. Hence it is widely used by investigators to evaluate the accuracy of several predictors [25,31,35,36,43,46,51,57,58,59,60,61,62,63]. Therefore, here we also utilized jackknife test to evaluate the accuracy of the classifier. In such a test, each sample in the dataset is singled out in turn as the testing sample and the rest samples are used to train the prediction model and predict the testing sample. Thus each sample is tested exactly once.

2.6 Incremental feature selection (IFS)

As described in Section 2.3, mRMR features list F = [f0, f1,¿½¿½,fN-1] can be obtained by mRMR program. Denote the i-th feature set by Fi = { f0, f1,¿½¿½,fi} (0¿½¿½i¿½¿½N-1). For each i (0¿½¿½i¿½¿½N-1), execute NNA with the features in Fi, then the overall accuracy of the classification (ACC), defined by ¿½¿½the number of correctly predicted pathways¿½¿½/¿½¿½the total number of pathways¿½¿½, evaluated by jackknife test, was obtained. As a result, we can plot a curve named IFS curve with ACC as its y-axis and the index i of Fi as its x-axis.

3 Results and Discussion

3.1 Results of mRMR

The mRMR program was achieved from It was run with default parameters and two feature lists were obtained by executing mRMR program: (i) MaxRel features list; (ii) mRMR features list (see Online Supporting Information S2).

MaxRel features list was obtained by sorting features according to their contribution to the classification. We investigated the most relevant 1% of the features (totally 55) and Table 4 shows the distribution of these features. It is clear that 32 (32/55, 58.18%) features come from biochemical and physicochemical property and 23 (23/55, 41.82%) features come from functional property. All of these indicate that among the adopted features the biochemical and physicochemical property of each pathway provide the most contribution to classification and functional property also gives important contribution. It is startling that none of the features about graph property was the most relevant 1% feature, while they were considered as important factors to form some biological meaningful systems, such as protein complex [25,29]. In this study, we only take care of classifying a regulatory pathway into correct pathway class but not to analyze which feature is more important to form a regulatory pathway. In this stage, graph property maybe is not very important while biological and functional properties are more important to determine the biological function of each pathway.

3.2 Results of IFS

Shown in Figure 1 is the IFS curve. The highest ACC value of IFS is 75% using 36 features (See Table 5 for the detail 36 features). The detailed IFS data can be found in Online Supporting Information S3.

Figure 2 shows the distribution of the optimized 36 features. It is straightforward to see that 23 (23/36, 63.89%) features were from the biochemical and physicochemical property and 13 (13/36, 36.11%) features were from the functional property, while none of features in graph property was selected into the optimized feature set. All of these indicate the same conclusion as described in Section 3.1.

3.3 Analysis of optimal features for pathway classification

It was seen from Table 5 and Figure 2 that the biochemical and physicochemical properties and Gene Ontology functional properties were important for pathway classification.

Within the selected 23 biochemical and physicochemical properties, there were 6 secondary structure features, 6 amino acid composition features, 3 solvent accessibility features, 3 polarity features, 2 VanDerWaal features, 2 polarizability features and 1 hydrophobicity feature. Obviously, secondary structure features and amino acid composition features were more important than other biochemical and physicochemical properties. The correct secondary structure of protein is essential to its function. Structural incorrect proteins are associated with many different kinds of disease such as Alzheimer's disease, Huntington's and Parkinson's disease [64]. In KEGG pathway classification, there are 28 disease pathways. Some of the disease pathways, such as neurodegenerative disease pathways and cancer pathways, are caused by or associated with protein misfolding [64]. Amino acid composition has been used to explain a lot of biological phenomenon, such as translation rate [36] and metabolic stability of proteins [35]. Amino acid composition has a close relationship with protein synthesis and degradation [35,36]. In KEGG pathway classification, there are 73 metabolism pathways. The amino acid composition features may affect these metabolism pathways.

Within the selected 13 Gene Ontology functional properties, there were 4 gene ontology features (GO:0048519 negative regulation of biological process, GO:0002687 positive regulation of leukocyte migration, GO:0090068 positive regulation of cell cycle process and GO:0031330 negative regulation of cellular catabolic process) strongly associated with KEGG ¿½¿½Cellular Processes¿½¿½ pathways and 3 gene ontology features (GO:0043627 response to estrogen stimulus, GO:0043330 response to exogenous dsRNA and GO:0042088 T-helper 1 type immune response) strongly associated with KEGG ¿½¿½Environmental Information Processing¿½¿½ pathways.

Combining the 23 biochemical and physicochemical properties and 13 Gene Ontology functional properties together, most KEGG pathways can correctly classified with reasonable biological meanings. The prediction model can be used to classify new pathway into existing pathway function groups. This means predicting the function of new pathways which is one of the ultimate goals of biology research.

4 Conclusion

In brief, we analyzed 5570 features extracted from each of known regulatory pathway in humans and extract important features which were deemed to be more important to determine the biological function of each regulatory pathway. Of the 5570 features, 88 were graph property, 264 were derived from the biochemical and physicochemical property of proteins, and 5218 were about functional property. The mRMR method and IFS techniques were employed to analyze these features. Nearest neighbor algorithm and jackknife test were utilized to evaluate the accuracy of the classifier. As a result, 36 features were found to be the important features for the classification. These findings might be of use for stimulating in-deep studies on such an important and challenging problem.