This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.
Protein methylation, one of the most important post-translational modifications, typically takes place on arginine or lysine amino acid residue. The reversible modification involves a series of basic cellular processes (e.g. repressing or activating gene expression). Identification of methyl proteins with their sites will facilitate the understanding of the molecular mechanism of protein methylation. Besides the conventional experimental methods, computational predictions of methylated sites are much more desirable for their convenience and fast speed. In this research, we proposed a method dedicated to predicting the methylated sites of proteins. Feature selection was made on sequence conservation, physicochemical and biochemical properties and structural disorder by applying Maximum Relevance Minimum Redundancy and Incremental Feature Selection methods. The prediction model was built according to nearest neighbor algorithm and evaluated by the jackknife test. As a result, the prediction accuracies are 74.06%, 80.97% for methylarginine and methyllysine training sets, respectively. Feature analysis suggested evolutionary information, physicochemical and biochemical properties, and structural disorder play an important role in the recognition of methylated sites. These findings may provide valuable information for exploiting the mechanisms of methylation. Our method may serve as a useful tool for biologists to find the potential methylated sites of proteins.
Post-translational modification (PTM) is the chemical modification of a protein after its translation, which is one of the later steps in biosynthesis for many proteins. Post-translational modifications such as acylation, glycosylation, methylation, phosphorylation, and sulfation, serve many functions. As a result, the analysis of the post-translational modifications is in particular significant for the study of diseases where more than one gene is involved, such as cancer, diabetes, and heart disease. Protein methylation, which is one of the most important post-translational modifications, typically occurs on arginine or lysine residues in the protein sequence. Arginine residue can be methylated once or twice, with either both methyl groups on one terminal nitrogen or one on both nitrogens by peptidylarginine methyltransferases (PRMTs); while lysine residue can also be methylated once, twice or three times by the lysine methyltransferases 1.
Relative researches traced back to 1959, Ambler and Rees 2 first discovered e-N-methyllysine in the flagella protein of S. typhimurium and, subsequently, during the early 1960s Murray 3 found methylation in the acid-hydrolysates of histones isolated from calf thymus, wheat germ and various rabbit organs. Nowadays, protein methylation has been most well studied in the histones. The transfer of methyl groups from S-adenosyl methionine to histones is catalyzed by enzymes known as histone methyltransferases, which can act epigenetically to repress or activate gene expression 4,5. Protein methylation is a reversible type of PTM, just like phosphorylation and sumoylation. It has been reported that the LSD1 (lysine-specific demethylase 1) is responsible for the demethylation of histone H3 K4 6, and the JHDM1 (JmjC domain-containing histone demethylase 1) is responsible for the demethylation of K36 7. Most histone lysine methylation, with the exception of histone H3 K79, has been shown to be catalyzed exclusively by the conserved SET domain superfamily proteins 8,9.
The full extent of regulatory roles of protein methylation is still under elusive investigation. Importantly, identification of methylated proteins with their sites will be a foundation of understanding the molecular mechanism of protein methylation. Besides the conventional experimental methods, such as mutagenesis of potential methylated residues, methylation-specific antibodies 10 and mass-spectrometry 11,12, computational predictions of methylated sites are much more desirable for their convenience and fast speed. Up to now, a few of previous works were done for protein methylated site prediction. Such as MeMo, a protein methylated site prediction online tool for based on Support Vector Machines (SVMs) 13; while another group developed another novel approach called BPB-PPMS also based on SVMs algorithm 14.
In this work, we presented a new generation algorithm for predicting methylated sites based on amino acid factor 15-17, Position-Specific Scoring Matrices (PSSM) 18-20, and structural disorder 21 plus multivariate statistical analyses the amino acid attributes to resolve this sequence metric problem 15. Incremental Feature Selection (IFS) method was used to construct the predictor with best performance. Based on the training set, the prediction accuracies evaluated by the jackknife test were 74.06%, 80.97% for methylarginine and methyllysine, respectively. The independent test was also used to evaluate the constructed predictors and make comparisons between our predictor and other predictors.
Materials and Methods
The methyl proteins used in this research were extracted from UniProt/Swiss-Prot 22 (Release 57.9, 28-Jul-2009) by searching the "Methylarginine" or "Methyllysine" in the field "Sequence annotation [FT]" with experimental verification. Protein sequences with less than 50 amino acids were excluded as well because they might be just fragments 23,24. And Protein sequences with more than 5000 amino acids were also excluded as well because they might be protein complexes. As a result, 86 methylarginine proteins and 162 methyllysine proteins were collected for the current study.
The 86 methylarginine proteins were randomly divided into two parts: 76 proteins for training the predictor and 10 proteins for testing the predictor. In the same way, the 162 methyllysine proteins were broken into 142 proteins for training and 20 proteins and testing. In each dataset, peptides that consisted of an arginine/lysine residue, 4 residues upstream and 4 residues downstream of the arginine/lysine residue were retrieved from the protein sequences. If the peptides exceeded the boundary of protein sequence, non-existent residues coded by '-' were inserted to make up of the peptides with 9 residues. Peptides with methylarginine (or methyllysine) as the middle residue were positive samples, and the remaining peptides with non-methylated arginine (or non-methylated lysine) as the middle residue were negative samples. For arginine methylation, the 76 proteins for training contain 187 positive samples and 2174 negative samples, the 10 proteins for testing contain 10 positive samples and 206 negative samples. For lysine methylation, the 142 proteins for training contain 289 positive samples and 2733 negative samples, the 20 proteins for testing contain 50 positive samples and 252 negative samples.
To keep the balance between positive and negative samples in the training sets, the negative samples of the same number as the number of the positive samples were randomly selected from all the negative samples. As a result, we obtained the learning (training) dataset , test dataset , learning dataset , test dataset from 76 methylarginine proteins, 10 methylarginine proteins, 142 methyllysine proteins, 20 methyllysine proteins, respectively (Online Supporting Information S1). The numbers of positive samples and negative samples in each dataset are listed as follows.
Representation of Peptides
In this research, amino acid factors, conservation, and structural disorder of amino acid were used to code the peptides.
Amino Acid Factors
Physicochemical and biochemical properties of amino acids can be described by the five multidimensional patterns (called "amino acid factors) proposed by Atchley et al. 15-17. The attributes of each amino acid can be scored by five factors: codon diversity, electrostatic charge, molecular size or volume, polarity, secondary structure. And for the non-existent residues, the scores of which were set to 0. The factors used to code each amino acid are detailed in Online Supporting Information S2.
Evolutionary information is an important characteristic of protein for the conserved residues at specific sequence sites are under strong selective pressure and therefore are always functional relevant. Here, the sequence conservation was quantified by the Position Specific Scoring Matrix 25 (PSSM), which has been demonstrated effective for the identification of many post-translational modification sites 26,27. PSSM depicts the conservation of each amino acid in the sequence by a 20-dimensional numerical vector, which measures the likelihood that the amino acid mutates to the 20 amino acids. Therefore, protein sequences with N amino acids will have a dimensional PSSM. Protein sequence PSSM was produced by the powerful sequence searching method - Position Specific Iterated BLAST 28 (PSI-BLAST, Release 2.2.12). As a matter of experience, for the parameters of the program, the expectation value (e-value) was set to 0.0001, the e-value threshold for inclusion in multipass model was set to 0.0001, the maximum number of passes used in multipass version was set to 3 and default values for other parameters. And the UniProt Reference Clusters-UniRef100 (Release: 15.9) containing 9,385,165 clusters was chosen as the database for alignments.
Intrinsic disorder regions 21 has been found to be rich in binding sites that are important loci for diverse protein post-translational modifications such as methylation, acetylation 29. Therefore, we take the structural disorder of residue in the sequence as a feature to code the peptides. Each residue was scored by VSL2 30 (one of the best predictors for disorder) to weight the likelihood that it lacks fixed structure.
In the peptides, each amino acid can be represented by the 20 conservation scores, 5 amino acid factors and 1 disorder score. Because the central residues of the peptides are the same in each dataset and therefore share the common amino acid factors. Such residues can be represented by the 20 conservation scores and 1 disorder score. Totally, the 9-length peptides can be coded by 229 numerical features. In other words, the feature space is 229 dimensional.
After the representation of the peptides, we firstly prioritized the 229 numerical features according to Maximum Relevance, Minimum Redundancy (mRMR) criteria. Based on the order of the sorted features, we constructed 229 feature sets to recode the peptides in each dataset. The prediction model was then built for each feature set by utilizing the Incremental Feature Selection method, and evaluated by the jackknife test. Among the 229 constructed predictors, we selected the model with the best performance to make prediction.
In the feature selection, there are two important concepts: relevance and redundancy. An important feature is considered to be strongly correlated with target and lowly redundant to the already selected features. According to the importance, Maximum Relevance, Minimum Redundancy 31 (mRMR) method was employed here to prioritize the 220 features. In the method, mutual information (MI) was used to quantify both relevance and redundancy. MI measures how much one feature is related to another, and can be defined as follows.
where is the joint probabilistic density, and are the marginal probabilistic densities for feature x and feature y.
Let us denote the whole feature set by Î©, the already-selected feature set with m features by Î©s and the feature set with n features to be selected by Î©t. The relevance D between the feature f in Î©t and the target c can be calculated as follows.
The redundancy R of the feature f in Î©t with all the features in Î©s can be calculated as follows.
To obtain the feature fj in Î©t with maximum relevance to target and minimum redundancy to Î©s, Eq. (2) and Eq. (3) are combined to generate the mRMR function:
For a feature set with N features, the evaluation will be executed N rounds. Then the following feature set S can be obtained.
where the subscript indicates at which round the feature is selected. The better the feature is, the earlier it will subject to Eq. (4), the earlier it will be selected, the smaller the subscript index will be.
Nearest Neighbor Algorithm
Nearest neighbor algorithm (NNA) is a simple but effective and widely used machine learning method (see, e.g., a comprehensive review 32 and the references cited therein). The algorithm predicts an unknown sample to share the common category as the nearest neighbor. The distance between two samples is defined as follows.
where is the vector module of the sample, and is the dot product of coding vectors of the two samples.
Given a query peptide with the coding vector , and the training set consisting of n known peptides with the vector set respectively. Then the query peptide will be assign to belong to the category of , which fulfils
If more than one satisfies Eq. (8), the category of one of these peptides will be randomly selected as the prediction for the query peptide.
After its learning is completed, the machine learning method should be evaluated to ensure that it serves as a good prediction model. In this research, jackknife test 33-35 was applied to evaluate the constructed NNA predictors because it has been widely used for evaluating diverse classifiers 36-39. In the validation, each sample is taken away in turn from the data set as a test sample, and then assigned by the predictor trained with the rest samples. Four sophisticated measurements consisting of sensitivity (Sn), specificity (Sp), accuracy (AC) and mathews correlation coefficient (MCC) were introduced to qualify the capability of the NNA predictors. Sn, Sp and AC indicate the prediction success rates on positive, negative and overall datasets, respectively. MCC is always used when the positive and negative datasets are out-of-balance. It ranges from -1 to 1, and the larger MCC is, the better the prediction results are. These four measurements can be defined as follows
where TP, FP, TN and FN indicate the numbers of true positive, false positive, true negative, false negative samples, respectively.
Incremental Feature Selection
In the classification problem, feature selection can be treated as a combinatorial optimization problem that is to find the feature set that maximizes the performance of the classifier. To seek the optimal feature subset of k features, all the combinations of k features need to be tried according to the exhaustion principle. Owing to the computational intractability, Incremental Feature Selection 40,41 (IFS), one effective feature selection method based on mRMR method, was employed to tackled the problem.
According to the order of feature ranked in mRMR procedure, 229 feature sets were constructed as follows
where is the i-th feature from mRMR procedure.
For each feature set, the NNA prediction model was built and then validated by jackknife test. With 229 overall prediction accuracies, a curve called IFS curve was plotted with the number of features in the feature set as x-axis and the jackknife test prediction accuracy as y-axis. Then the optimal feature set would be selected with the highest prediction accuracy, and the corresponding predictor was used to predict the methylated sites of proteins.
Results and Discussion
The sorted features by mRMR
After the mRMR procedure, we obtained the prioritized features listed in mRMR feature lists for the two datasets (see Online Supporting Information S3 and Online Supporting Information S4). Within the lists, a smaller index of a feature implies that the feature is more important for classifying the methylated sites and the non-methylated sites.
Performance of NNA predictors
According to the order of the features in the mRMR feature list, 229 feature sets were constructed according to Eq. (10). Then the prediction model was built on each feature set to predict methylated sites using the nearest neighbor algorithm. The jackknife test prediction accuracies of 229 NNA predictors on the training sets for the two datasets are shown in the IFS curves (Figure 1). For methylarginine dataset, the optimal feature set contained the first 38 features in the mRMR feature list (see Online Supporting Information S3), and the corresponding model achieved the highest accuracy - 74.06%. And the Sn, Sp and MCC are 75.40%, 72.73% and 0.48, respectively. For methyllysine dataset, the optimal feature set contained the first 51 features in the mRMR feature list (see Online Supporting Information S4), and the corresponding model achieved the highest accuracy - 80.97%. And the Sn, Sp and MCC are 82.70%, 79.24% and 0.62, respectively. The performances of these predictors for methylarginine and methyllysine datasets are listed in Online Supporting Information S5 and Online Supporting Information S6, respectively.
Independent Test and Comparison with the Existing Predictors
After the jackknife cross-validation, the independent test was applied to evaluate the constructed predictors. Generally, if the performance (Sn, Sp) in the independent test is much worse than that in the cross-validation, then the trained predictor is likely over-fitting. Based on the independent tests, the Sn, Sp are 100%, 69.9% for arginine methylation, 88%, and 71.03% for lysine methylation. Therefore, the trained predictors avoid the over-fitting problem.
The independent test sets are also used in the comparison of different predictors. Table 1 listed the performances on the test sets of MeMo 13, BPB-PPMS 14, and our predictor. For both arginine and lysine methylation test sets, the positive samples prediction accuracies (i.e. Sn) of our predictor are higher than the other predictors, though the negative samples prediction accuracies (i.e. Sp) are lower. The mechanism of protein methylation is complex and the methylated sites are difficult to predict. In fact, we tend to care about the positive samples prediction accuracies because the predictor with a higher Sn will provide more potential methylated sites for biologists' experimental verification.
Biological feature analysis
Protein methylation, one of most important post-translational modification, typically takes place on arginine or lysine amino acid residue in the protein sequence. Due to this issue, we analyzed the arginine or lysine methylation separately.
Figure 2 shows the position specific distribution of conservation feature in the optimal feature set. All the surrounding positional sites play a role in the recognition of the substrate arginine. Conservation of sites AA3, AA5, and AA7 influents more on the prediction of methylarginine than other sites. Figure 3 depicts the position specific distribution of amino acid factors in the optimal feature set. All the 8 positions, especially AA4, AA6, and AA7, contribute to the identification of methylated arginine amino acid residues. Arginine is the unique amino acid as its guanidino group includes 5 potential hydrogen bond donors which are positioned for favorable interactions with the hydrogen bond acceptors. Especially in the protein-DNA complexes, arginine amino acid residue is the most frequent hydrogen bond donor to backbone phosphate groups and to adenine, guanine, and thymine bases 42. The context of methylated residues in these proteins differs from the original consensus for asymmetrically dimethylated proteins, suggesting the importance of residues in the 4th, 6th, and 7th positions for recognition of the substrate arginine 43. From the two distribution (Figure 2 and Figure 3)¼Œwe found that the residues at the center are more important for the prediction of methylarginine than the remaining residues. Figure 4 displays the distribution of each amino acid factor in the optimal feature set. In the 5 amino acid factors, electrostatic charge and secondary structure properties contribute a lot in determining the methylation assurances. PRMT, formerly known as protein methylase I, is not only specific for Gly-Arg-Gly or Gly-Ala-Arg primary sequences but is also highly specific for the higher structure of proteins. The diversity of these enzymes is increases by the alternative splicing reactions that give rise to amino acid sequence variants 44. The PRMT family currently consists of nine highly related members; the conserved THW loop and all harbor signature motifs I, post-I, II, and III. Based upon the substrates, few consensus recognition protein sequences have emerged. The structures of the conserved core region have been known for Hmt1 45, PRMT3 46, and PRMT1 47, under some circumstances in complex with S adenosylhomocysteine and/or some substrate peptides, show that the core region includes two domains aggregated into an integral structure. And The N-terminal half of the core that is comprised of a typical Rossman fold and two Î±-helices, is the AdoMet-binding domain, which is the most conserved region among PRMTs. The c-terminal half of the core shapes a barrel-like structure that is unique to the PRMT family, which folds against the N-terminal AdoMetbinding region. The generated cleft provides a protein substrate catalysis site and binding site. The three-dimensional structure analysis is also consistence with which our parameters including electrostatic charge and secondary structure properties are essential for the enzymatic activity.
Figure 5 shows the position specific distribution of conservation feature in the optimal feature set. The surrounding positional sites role of lysine methylation shows a little difference 48,49. Conservation of sites AA1, AA3, AA5, and AA7 contributes more to the prediction of methyllysine than other sites, while conservation of AA4 and AA6 shows no influence on the recognition of methyllysine sites. Figure 6 depicts the position specific distribution of amino acid factors in the optimal feature set. The factors of all the 8 positions contribute to the identification of methyllysine residues. Figure 7 displays the distribution of each amino acid factor in the optimal feature set. All the 5 physicochemical and biochemical properties, especially amino acid composition, secondary structure, and molecular size or volume properties play a role in determining the lysine methylation. Regarding protein-lysine methylation, the bulk of recent research efforts have primarily been focused on histone methylation 50,51. This is undoubtedly a result of the lure that histones possess, as components of the nucleosome, in controlling genetic expression. The chemical mechanisms and biological importance of methylation at each particular lysine residue in the H3 and H4 histone subunits continue to be unraveled. Methylation of several lysine residues in the H3 subunit (i.e. Lys36 and Lys79) are associated with euchromatin and transcriptional activation, whereas methylation of other residues in H3 and H4 (i.e. H3 Lys4, H3 Lys9, H3 Lys27 and H4 Lys20) are associated with heterochromatin and transcriptional repression 52,53. The reaction requires iron and a-ketoglutarate as cofactors. Structural analyses reveal that histones are bound predominantly through backbone interactions. The catalytic center is located in a deep pocket and the peptide chain must be bent to fit into this cavity. As a result, the enzyme motif secondary structure is essential for that targeting reaction. Extensive site-directed mutagenesis shows that this binding mode critically depends on the presence of flexible amino acid residues, allowing proper peptide bending to achieve a catalytically productive position 54. The amino acid composition for the enzymes reaction site is also understandable. Most enzymes bind the methylated lysine in a polar environment, which resembles the 'carbonyl cage' of SET domains rather than the hydrophobic pockets of chromo domain-related motifs 55. The methyl-groups are coordinated by a set of electrostatic interactions between polar residues of the protein and the trimethylammonium. CHâˆ™âˆ™âˆ™O-H-bonds form between oxygen on the enzyme's sidechains and methyl-groups of the methyllysine 56,57. These interactions cumulatively position one of the methyl-groups in vicinity of the iron for hydroxylation to occur. All this researches strengthen the role of surrounding sites the enzymes reorganization.
Disorder and Methylation
Besides the conservation and physicochemical and biochemical properties, the structural disorder also contribute to the identifying the methylated sites. In the optimal feature set for arginine methylation, the 10th feature is the disorder of AA8. And in the optimal feature set for lysine methylation, the 6th, 21st, 34th, 48th, features are the disorder of AA1, AA9, AA6, and AA5, respectively. The results reveal that protein methylation has much do with the structural disorder of protein. It mirrors to the previous study 58 that indicates both methylarginine and methyllysine sites are likely to structurally disordered. Our results also imply that the structural disorder has more influence on lysine methylation than arginine methylation. This may provide a clue for the biologist to design the experiments to find the relation between intrinsic disorder and methylation.
In this research, we proposed a new method dedicated to predicting the methylated sites of proteins, containing arginines and lysines. Sequence physicochemical properties and biochemical properties, conservation, and structural disorder of the flanking sequences were selected to discuss the relation of them with the methylated sites. Based on the optimal feature set, the success rates of overall methylarginine and methyllysine training datasets are 74.06%, 80.97%, respectively. These statistical results may imply that the flanking sequences play an important role in the methylation of arginine or lysine residues. Biological feature analysis suggests evolutionary information, physicochemical and biochemical properties, and structural disorder play an important role in the recognition of methylated sites. These promising results may provide clues for further in depth investigation of the effects of flanking sequences. We feel that our method may serve as a useful tool (the software is available on request) for biologists to find the potential methylarginine and methyllysine of proteins.
This research is supported by the grant from the National Basic Research Program of China (2011CB510102, 2011CB510101).