Omposition For Prediction Of Drug Target Interaction Biology Essay

Published:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

Drug and target interaction networks are the most important part in the research area of drug discovery [1]. The function of many classes of target proteins, which include enzymes, ion channels, G protein-coupled receptors (GPCRs), and nuclear receptors, can be modulated by interactions with ligands [2]. It is fortunately that with the emergence of molecular medicine and the completion of the human genome project, discovering unknown target proteins of drugs become possible. Many researchers have applied themselves to discover new drugs in the past few years. However, since the toxicity of many drug candidates is unacceptable, the efficiency of discovering new drugs is still very low. Therefore, it is necessary to develop computational methods for helping to discover new drugs. Since drug effects are reflected by many interactions with target proteins, identification of interactions between drugs and target proteins is very helpful for drug discovery.

Since it is still challenging to determine the compound-protein interactions or drug-target interactions by experiment alone [3, 4], developing effective prediction model is very necessary. Many computational methods have been developed to predict drug-protein interactions. Docking simulation [5, 6] and literature text mining [7] are the most commonly used methods. Recently, Yamanishi et al. developed a prediction model by combining chemical structure, genomic sequence, and 3D structure information [2], and He et al. employed feature selection methods to analyze drug-target interactions [8], where drugs and target proteins are encoded by functional group, and biochemical and physicochemical properties, respectively.

Encouraged by the successes of using machine learning and data mining methods to tackle various problems in different biological areas, such as protein structural class prediction [9-13], protein subcellular location prediction [14-18] and so on, here we developed a prediction model to identify interactions between drugs and target proteins based on Nearest Neighbor Algorithm [19, 20] and a novel metric, which was established by combining compound similarity [21] and functional domain composition [17, 22]. In this paper, the target proteins for drugs are divided into four groups: enzymes [2], ion channels [23-26], G protein-coupled receptors (GPCRs) [27, 28], and nuclear receptors [2]. As a result, four independent prediction models with the optimal parameters were developed. We hope that our contribution may provide useful help for drug discovery.

Materials and Methods

Benchmark Datasets

The information about drug-target interactions was obtained from the KEGG BRITE [29], BRENDA [30], SuperTarget [31], and DrugBank databases [32]. These drug-target interactions were also used in two previous work [2, 8]. We removed some interactions which satisfy one of the following conditions: (1) contain drugs which have no information to calculate their similarity with other drugs; (2) contain target proteins whose functional domain compositions are not available. Finally, we obtain totally 4729 drug-target pairs, of which 2,686 for enzymes, 1,359 for ion channels, 598 for GPCRs, and 86 for nuclear receptors. All these pairs compose four groups of positive dataset in the current study. There are totally 763 drug compounds and 936 target proteins involving in this study.

In order to train the predict model, we construct corresponding four groups of negative dataset by randomly picking one drug from 763 drug compounds and one target from 936 target proteins. It is important that none of them occurs in the positive dataset. To reflect the real world that the number of positive pairs is much less than that of negative ones, the negative pairs in each group were generated 50 times as many as the positive ones. The number of positive and negative pairs in the final benchmark dataset for each group is shown in Table 1.

Table 1: The distribution of benchmark dataset

Group

Positive pairs

Negative pairs

enzymes

2,686

134,300

ion channels

1,359

67,950

G protein-coupled receptors

598

29,900

nuclear receptors

86

4,300

The detailed information of benchmark dataset for enzymes, ion channels, GPCRs, and nuclear receptors can be found in Online Supporting Information A1, A2, A3, and A4, respectively.

Encoding Methods

An important step for obtaining successful prediction results is to encode and compare the two components: drug compounds, and target proteins, effectively. For drug compound, some established compound representations, such as SMILES [33, 34] and MACC keys [35, 36], can be used to estimate the similarity of two given compounds. However, these representations can not reflect the two-dimensional structure of a compound very well. Hattori et al. [21] used graph representation to measure the similarity of two compounds, which is deemed to be more effective and more accurate to capture important aspects of compound similarities. For target proteins, some established encoding schemes, such as functional domain composition [17, 22] and gene ontology [37], can be used to encode a protein into a vector. The similarity of two proteins can be seen as the distance of the corresponding vectors. In this study, graph representation and functional domain composition are used to estimate the similarities of two drug compounds and two target proteins, respectively. The detailed definitions are described as follows.

The similarity of drug compounds obtained by corresponding graph representations. Hattori et al. [21] firstly used graph representations to measure the similarity of two compounds. Since a chemical structure is a two-dimensional (2D) object, each chemical structure can be represented by an undirected graph where vertices correspond to atoms and edges correspond to bonds between them. According to their method, the similarity of two compounds is estimated based on the size of the maximum common subgraph between two corresponding graphs using a graph alignment algorithm. Furthermore, they established a procedure SIMCOMP [21] (http://www.genome.jp/ligand-bin/search_compound) to compute the chemical structure similarity of compounds. For drug compounds and , we denote their similarity using graph representations by .

The similarity of target proteins obtained by corresponding functional domain compositions. Functional domain composition is a very useful encoding scheme to represent each protein by a vector and has been widely applied in tackling many biological problems about proteins [11, 17, 18, 38-42]. The original concept of functional domain composition was proposed by Chou and Cai to predict protein subcellular location [17]. It was defined in the SBASE-A [43] database , which contains 2005 functional domain entries. Now, there is a more complete database, InterPro database (release 23.1, December 2009) [22] which include 21,144 functional domain entries. Following the similar procedure in [17], a protein can be represented by the following 21144-D vector

(1)

where if and only if there is a hit on a entry, which is the i-th functional domain entry in the InterPro database. According to many previous work [11, 17, 41, 42], the similarity between two proteins and is defined by

(2)

where is the dot product of two vectors and , and and are the modulus of vector and , respectively.

Thus, the similarity of two drug-target pairs can be estimated using and . However, to utilize the Nearest Neighbor Algorithm, we have to define the distance between two drug-target pairs, instead of the similarity of them. The detailed definition will be described below.

The distance between two drug-target pairs. Let and be two drug-target pairs, where , represent the drug compound and target protein in the first pair , and , those in the second pair . Since there are two members in each pair and we do not know which one plays the important role in the determination of a real drug-target interaction network. Thus we define the following metric with a weight parameter to measure the distance between two pairs

(3)

where the weight factor can take any value in the interval from 0 to 1.

Nearest Neighbor Algorithm

In this research, Nearest Neighbor Algorithm (NNA) [19, 20], which has been widely used in tackling many biology problems [44-46], was applied to predict the interaction of any query drug-target pair. According to Eq.3, the distance between the query pair and any training pair is calculated and the nearest neighbor can be found. If the nearest neighbor is a positive sample, then the query sample is seen as a positive drug-target pair. Otherwise, it is seen as a negative one.

Jackknife Cross-validation Test

In this study, jackknife cross-validation test [47] was employed to evaluated the prediction model, because it is deemed more objective and effective than other two cross-validation methods: independent dataset test and K-fold corss-validation [48, 49]. In such a test, every sample in the dataset is singled out in turn as the testing sample and the rest samples are used as training samples. Thus every sample is predicted exactly once.

Accuracy Measure

True Positives (TP), True Negatives (TN), False Positives (FP), False Negatives (FN) [50-52] are always used to evaluated accuracies. Based on these qualities, the overall accuracy of prediction is defined by

(4)

The sensitivity (SN) and specificity (SP) are defined as

(5)

(6)

To evaluate the whole performance of each prediction model, Matthew's correlation coefficient (MCC) [53] was employed, which is defined by

(7)

Results and Discussion

Prediction Results

The predicted accuracies with w=0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, and 0.9 for enzymes, ion channels, GPCRs, and nuclear receptors are given in Table 2, Table 3, Table 4, and Table 5, respectively. The detailed prediction results are provided in Online Supporting Information A5.

Table 2: Prediction accuracies for enzyme group

w

Prediction accuracy for each class (%)

Overall prediction accuracy (ACC) (%)

Matthew's correlation coefficient (MCC) (%)

Positive pairs (SN)

Negative pairs(SP)

0.1

88.94

88.47

88.48

31.86

0.2

89.43

89.04

89.05

32.89

0.3

89.54

89.54

89.54

33.72

0.4

89.54

90.24

90.23

34.91

0.5

90.80

84.45

84.58

27.76

0.6

90.13

81.87

82.03

25.18

0.7

90.25

77.96

78.20

22.35

0.8

90.73

70.54

70.94

18.42

0.9

89.99

63.86

64.37

15.45

Table 3: Prediction accuracies for ion channel group

w

Prediction accuracy for each class (%)

Overall prediction accuracy (ACC) (%)

Matthew's correlation coefficient (MCC) (%)

Positive pairs (SN)

Negative pairs (SP)

0.1

88.45

91.96

91.89

37.81

0.2

89.26

92.30

92.24

38.94

0.3

89.26

92.27

92.60

39.81

0.4

89.11

92.99

92.92

40.57

0.5

89.11

91.17

91.13

36.45

0.6

89.77

94.84

94.74

46.54

0.7

90.29

93.44

93.38

42.30

0.8

91.98

90.41

90.44

36.22

0.9

94.04

85.55

85.72

30.09

Table 4: Prediction accuracies for GPCR group

w

Prediction accuracy for each class (%)

Overall prediction accuracy (ACC) (%)

Matthew's correlation coefficient (MCC) (%)

Positive pairs (SN)

Negative pairs (SP)

0.1

79.26

86.91

86.76

26.15

0.2

80.60

87.32

87.19

27.13

0.3

80.77

87.58

87.45

27.51

0.4

80.77

87.96

87.82

27.98

0.5

82.27

88.12

88.01

28.78

0.6

82.78

98.10

97.80

61.11

0.7

86.45

95.74

95.55

48.46

0.8

95.48

91.50

91.58

39.85

0.9

93.98

90.94

91.00

38.05

Table 5: Prediction accuracies for nuclear receptor group

w

Prediction accuracy for each class (%)

Overall prediction accuracy (ACC) (%)

Matthew's correlation coefficient (MCC) (%)

Positive pairs (SN)

Negative pairs (SP)

0.1

55.81

94.35

93.59

27.94

0.2

66.28

94.51

93.96

33.76

0.3

66.28

94.67

94.12

34.23

0.4

72.09

94.72

94.28

37.34

0.5

96.51

92.40

92.48

42.35

0.6

97.67

97.51

97.51

64.67

0.7

97.67

97.44

97.45

64.14

0.8

96.51

97.33

97.31

62.66

0.9

97.67

97.02

97.04

61.22

It is easy to see from Table 2 that, when w=0.4, we obtained the best prediction results for enzymes due to the maximum value of Matthew's correlation coefficient, which is 34.91%. In detail, SN=89.54%, SP=90.24%, and overall success rate ACC=90.23% under this model.

With the same argument, from Table 3, Table 4, and Table 5, the best prediction results for ion channels, GPCRs, and nuclear receptors all occur at w=0.6. In detail, for ion channels, the overall success rate ACC=94.74% with SN=89.77% and SP=94.84%; for GPCRs, the overall success rate ACC=97.80% with SN=82.78% and SP=98.10%; for nuclear receptors, the overall success rate ACC=97.51% with SN=97.67% and SP=97.51%.

Discussion

Our results have shown that using graph representations to represent drug compounds and using functional domain compositions to represent target proteins are very effective to identify drug and target interaction networks. Compare to another study [8], the overall success rates for enzymes, ion channels, GPCRs, and nuclear receptors are 85.48%, 80.78%, 78.49%, and 85.66%, respectively. It is easy to see that our best results for each group are 4.75%, 13.96%, 19.31%, and 11.85% higher, respectively. On the other hand, the data ratio between positive pairs and negative pairs in He et al.'s paper [8] was 1:2, while it is 1:50 in this paper, which indicates that our results are rather trustworthy.

As indicated in Table 1, the number of positive pairs in enzyme group, ion channel group, GPCR group, and nuclear receptor group are 2,686, 1,359, 598, and 86, respectively. For each of these positive pairs , we calculated the distance of Eg.3 (with w=0.4 for pairs in enzyme group and w=0.6 for pairs in ion channel group, GPCR group, and nuclear receptor group) form to its nearest positive pair and nearest negative pair, respectively. and denote these two distances, respectively. The distribution of and for each group is given in Table 6. For Enzyme group, there are 2233 (83.13%) with while there are only 888 (33.06%) with , the interval containing maximum number of (1078, 40.13%) is from 0.35 to 0.40. All these indicate that the distance defined by Eq.3 with w=0.4 for NNA can separate positive pairs and negative pairs very well when identifying drug-target interaction networks in enzyme group. For ion channel group, there are 1091 (80.28%) with while there are only 431 (31.71%) with , the interval containing maximum number of (534, 39.29%) is from 0.35 to 0.40, indicating that the distance defined by Eq.3 with w=0.6 for NNA can separate positive pairs and negative pairs very well when predicting drug-target interaction networks in ion channel group. For GPCR group, there are 448 (74.92%) with while there are only 58 (9.70%) with , the interval containing maximum number of (345, 57.69%) is from 0.25 to 0.30, which indicate that the distance defined by Eq.3 with w=0.6 for NNA can separate positive pairs and negative pairs very well when predicting drug-target interaction networks in GPCR group. For nuclear receptor group, there are 58 (67.44%) with while there are only 6 (6.98%) with , the interval containing maximum number of (71, 82.56%) is from 0.35 to 0.40. All these indicate that the distance defined by Eq.3 with w=0.6 for NNA can separate positive pairs and negative pairs very well when identifying drug-target interaction networks in nuclear receptor group. These statistical results imply that when the distance defined by Eq.3 with an optimal parameter w, NNA predictor can separate positive drug-target pairs and negative ones very well, that is why we can obtain perfect success rates for each group reported in section "Prediction Results". Also, since the distance of Eq.3 is defined based on the similarities of two drug compounds and two target proteins, the smaller the distance between two pairs, the more similar the two pairs are. It is very interesting that based on our definition of the distance, the similar pairs always exhibit the same interaction, i.e. they are all positive or negative.

Table 6: Distribution of and for each group

Interval

Frequency for enzyme group

Frequency for ion channel group

Frequency for GPCR group

Frequency for nuclear receptor group

0.00-0.05

1,818

340

614

143

63

6

21

1

0.05-0.10

228

322

364

144

38

9

26

3

0.10-0.15

187

226

113

144

347

43

11

2

0.15-0.20

69

287

101

236

35

69

0

6

0.20-0.25

10

160

3

37

5

26

0

2

0.25-0.30

26

165

49

71

27

345

0

0

0.30-0.35

13

108

1

50

2

63

3

1

0.35-0.40

219

1,078

112

534

74

37

25

71

0.40-0.45

1

0

1

0

0

0

0

0

0.45-0.50

3

0

1

0

1

0

0

0

0.50-0.55

3

0

0

0

2

0

0

0

0.55-0.60

100

0

0

0

2

0

0

0

0.60-0.65

0

0

0

0

0

0

0

0

0.65-0.70

0

0

0

0

0

0

0

0

0.70-0.75

1

0

0

0

0

0

0

0

0.75-0.80

1

0

0

0

2

0

0

0

0.80-0.85

0

0

0

0

0

0

0

0

0.85-0.90

1

0

0

0

0

0

0

0

0.90.0.95

1

0

0

0

0

0

0

0

0.95-1.00

5

0

0

0

0

0

0

0

As indicated in Table 2, Table 3, Table 4, and Table 5, the prediction accuracies of positive pairs for each group under the corresponding best prediction model are 89.54%, 89.77%, 82.78%, and 97.67%, respectively, which means that 2,405, 1,220, 495, and 84 positive pairs in each group are classified into correct class. For any of these positive pairs , we calculate the differences . The distribution of for each group is given in Table 7. For each group, there are 1,937 (80.54%), 980 (80.33%), 438 (88.49%), and 58 (69.05%) positive pairs with, indicating that the distance defined by Eq.3 (with w=0.4 for pairs in enzyme group and w=0.6 for pairs in ion channel group, GPCR group, and nuclear receptor group) not only separate positive pairs and negative pairs very well but also make the gap between positive pairs and negative pairs large.

Table 7: Distribution of for each group

Interval

Frequency of for enzyme group

Frequency of for ion channel group

Frequency of for GPCR group

Frequency of for nuclear receptor group

0.00-0.05

468

240

57

26

0.05-0.10

354

220

50

7

0.10-0.15

243

131

218

2

0.15-0.20

208

111

78

5

0.20-0.25

157

71

62

0

0.25-0.30

176

90

20

21

0.30-0.35

108

131

10

9

0.35-0.40

691

226

0

14

Total

2,405

1,220

495

84

Like other prediction methods for tackling many biology problems, the methods reported in this paper also has their limitation. For example, since our methods are based on the similarity between two pairs, if potential positive pairs without any similarity at all to any ready positive pair in the training dataset, the performance might be poor. Our results for enzyme group in part prove that it is possible. From Table 6, there are 116 positive pairs with distance to nearest positive pairs greater than 0.4, which means that these pairs are not very similar to any of other positive pairs. As a result, all these pairs are misclassified. Although the distance defined by Eq.3 is well according to the successful prediction results, it is not well enough to tackle all cases.

Conclusions

Identifying drug and target interaction networks is very helpful for drug discovery. It is both time-consuming and costly to determine drug-target pairs. Hence, it is desired to develop computation methods in this regard. The prediction model designed in this paper can identify drug and target interaction networks successfully, because a novel metric established in this paper can separate the real interaction networks and false ones excellently. It is also reported that compound similarity and functional domain composition are very effective to predict drug-target interaction networks. We hope that our contribution is helpful for drug designing.

Writing Services

Essay Writing
Service

Find out how the very best essay writing service can help you accomplish more and achieve higher marks today.

Assignment Writing Service

From complicated assignments to tricky tasks, our experts can tackle virtually any question thrown at them.

Dissertation Writing Service

A dissertation (also known as a thesis or research project) is probably the most important piece of work for any student! From full dissertations to individual chapters, we’re on hand to support you.

Coursework Writing Service

Our expert qualified writers can help you get your coursework right first time, every time.

Dissertation Proposal Service

The first step to completing a dissertation is to create a proposal that talks about what you wish to do. Our experts can design suitable methodologies - perfect to help you get started with a dissertation.

Report Writing
Service

Reports for any audience. Perfectly structured, professionally written, and tailored to suit your exact requirements.

Essay Skeleton Answer Service

If you’re just looking for some help to get started on an essay, our outline service provides you with a perfect essay plan.

Marking & Proofreading Service

Not sure if your work is hitting the mark? Struggling to get feedback from your lecturer? Our premium marking service was created just for you - get the feedback you deserve now.

Exam Revision
Service

Exams can be one of the most stressful experiences you’ll ever have! Revision is key, and we’re here to help. With custom created revision notes and exam answers, you’ll never feel underprepared again.