This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.
Ubiquitination, one of the most important post-translational modifications of proteins, occurs when ubiquitin (a small 76 amino-acid protein) is attached to lysine on a target protein. It often commits the labeled protein to degradation and plays important roles in regulating many cellular processes implicated in a variety of diseases. Since ubiquitination is rapid and reversible, it is time-consuming and labor-intensive to identify ubiquitination sites using conventional experimental approaches. To efficiently discover lysine-ubiquitination sites, a sequence-based predictor of ubiquitination site was developed based on Nearest Neighbor Algorithm (NNA). The Maximum Relevance & Minimum Redundancy (mRMR) principle and the Incremental Feature Selection (IFS) procedure are used to optimize the prediction engine. PSSM conservation scores, amino acid factors and disorder scores of the surrounding sequence formed the optimized 456 features. The Mathewï¿½ï¿½s correlation coefficient (MCC) of our ubiquitination site predictor achieved 0.142 by jackknife cross-validation test on a large benchmark dataset. In independent test, the MCC of our method was 0.139, higher than the existing ubiquitination site predictor UbiPred and UbPred. The MCC of UbiPred and UbPred on the same test set were 0.135 and 0.117, respectively. Our analysis shows that the conservation of amino acids at and around lysine plays an important role in ubiquitination site prediction. Whatï¿½ï¿½s more, disorder and ubiquitination have a strong relevance. These findings might provide useful insights for studying the mechanisms of ubiquitination and modulating the ubiquitination pathway, potentially leading to potential therapeutic strategies in the future.
In the post-genomic era, knowledge of post-translational modifications (PTMs) of proteins is crucial for understanding the dynamic proteome and various signaling pathways or networks in cells(Aguilar and Wendland 2003; Saghatelian and Cravatt 2005; Herrmann et al. 2007; Hicke and Dunn 2003; Welchman et al. 2005). One of the most important and universal post-translational modifications, protein ubiquitination is a rapid and reversible biochemical process in which an iso-peptide bond forms covalently between the C-terminal double-glycine carboxy group of a ubiquitin protein and the ï¿½ï¿½-amino group of lysine residues of a substrate protein(Pickart 2001). Ubiquitination regulates a variety of biological processes, such as signal transduction, cell division/mitosis, apoptosis, and endocytosis (Sun and Chen 2004; Reinstein and Ciechanover 2006; Hoeller et al. 2006; Hicke 2001). An aberrance of the ubiquitin-proteasome system (UPS) is associated in numerous pathological diseases, such as inflammatory diseases, neurodegenerative disorders, and cancers(Hoeller et al. 2006; Reinstein and Ciechanover 2006).
Identification of ubiquitinated proteins sites is one of the greatest challenges in gaining a full understanding of the regulatory roles of ubiquitination regulation and the molecular mechanism of the ubiquitin system. It is time-consuming and labor-intensive to use conventional experimental approaches to identify the potential ubiquitinated proteins sites, such as site-mutagenesis(Lin et al. 2005), antibodies of Ub (anti-Ub)(Gentry et al. 2005), and high-throughput mass-spectrometry (MS)(Kirkpatrick et al. 2005). Therefore, it is convenient and efficient to use in silico algorithms in prediction of ubiquitination sites.
In this work, we developed a new computational method to predict lysine-ubiquitination. Specifically, we used a machine learning approach (Nearest Neighbor Algorithm) combined with feature selection (IFS based on mRMR(Peng et al. 2005b)). Twenty-six parameters were used to describe each amino acid of the lysine site and its surrounding ones (from -10 to +10). The twenty-six parameters can be broken down into three categories: twenty PSSM conservation scores, five amino acid factors and one disorder score. A score assigned using Position-Specific Scoring Matrices (PSSM) represents the conservation status of each amino acid in the protein sequence (Altschul et al. 1997). Amino acid factors were defined by Atchley et al(Atchley et al. 2005) through multivariate statistical analyses on AAIndex(Kawashima and Kanehisa 2000) to produce five amino acid factors that reflected polarity (AAFactor 1), secondary structure (AAFactor 2), molecular volume (AAFactor 3), codon diversity (AAFactor 4), and electrostatic charge (AAFactor 5). Disorder score(Peng et al. 2006) quantified the disorder status of each amino acid in the protein sequence. Disordered regions in proteins lack fixed three-dimensional structures under physiological conditions, but they play important roles in regulation, signaling, and control.
This study focuses on the computational identification of lysine (K) ubiquitination. The Mathewï¿½ï¿½s correlation coefficient (MCC) of lysine (K) ubiquitination site predictions was 0.142 on training set evaluated by jackknife cross-validation and 0.139 on independent test set. The following features distinguish our study from previous ubiquitination prediction models (Radivojac et al. 2010; Tung and Ho 2008): (1) a larger benchmark dataset was used, (2) the feature set was much smaller and more compact, (3) jackknife cross-validation and independent test were used to evaluate effectively and objectively the performance of our classifier, (4) the applied prediction model nearest neighbor algorithm was much simpler and faster than SVM (Tung and Ho 2008) or random forest,(Radivojac et al. 2010) both of which could have easily introduced over-fitting problems, and (5) on independent test our model has better performance than two existing predictors: UbiPred and UbPred. Our analysis shows that the conservation of amino acid at the lysine site and around plays important roles in ubiquitination site prediction. It also shows that electrostatic charge, molecular volume, secondary structure, codon diversity, and polarity of amino acids in the flanking sequences are important for the ubiquitination process. Interestingly, disorder and ubiquitination have a strong relevance.
Materials and Methods
The ubiquitinated protein sequences we used for training comes from SysPTM (Li et al. 2009). Peptides containing lysine (K) were extracted as our training samples. According to Tungï¿½ï¿½s work (Tung and Ho 2008), the best window size for ubiquitination site prediction is 21. So we adopted their windows size and represent each lysine ubiquitination site with a peptide fragment consisted of 21 residues with 10 residues upstream and 10 residues downstream of the lysine (K). The original dataset downloaded from SysPTM has 514 lysine-ubiquitination sites from 349 proteins. After removing the redundancy of the 349 protein sequences against homology bias using the program cd-hit (Li and Godzik 2006), we obtained 273 distinct sequences among which the sequence identity was lower than 0.6. We randomly selected 12 proteins to form the independent test set and the left 271 proteins to construct the training set. Since the number of ubiquitinated lysine sites and non-ubiquitinated lysine sites were highly imbalanced; we randomly select three time negative samples to match the positive ones in training set. In independent test set, we remained the all the positive and negative samples to make it close to real situation. There were 364 positive samples (ubiquitinated lysine fragments) and 1,092 negative samples (non-ubiquitinated lysine fragments) in the training set; meanwhile in the independent test set, there were 14 positive samples and 267 negative samples. The benchmark dataset we used were larger than Tungï¿½ï¿½s 157 ubiquitination sites (Tung and Ho 2008) or Radivojacï¿½ï¿½s 272 ubiquitinated fragments (Radivojac et al. 2010). Both the positive and negative lysine samples for training and independent test can be found in Dataset S1.
The features of PSSM conservation scores
Evolutionary conservation is one of the most important concepts in biology. If an amino acid in a particular position of a particular protein is conserved, it indicates that this amino acid may locate in an important or functional region of the protein.
Position Specific Iterated BLAST (PSI BLAST) (Altschul et al. 1997) can measure the residue conservation in a given location. Each residue can be encoded into a 20-dimensional vector which represents the probabilities of conservation against mutations to 20 different amino acids. Position Specific Scoring Matrix (PSSM) (Ahmad and Sarai 2005) is a matrix of such vectors which represent all residues in a given sequence. If a residue is conserved in PSI BLAST, it is likely to be important for biological function. In this study, we used the PSSM conservation score to quantify the conservation status of each amino acid in the protein sequence. The program ï¿½ï¿½blastpgpï¿½ï¿½ downloaded from ftp://ftp.ncbi.nlm.nih.gov/blast was used to calculate the PSSM conservation score with three iterations (-j 3) and e-value threshold for inclusion in multipass model 0.0001 (-h 0.0001).
The features of amino acid factors
AAIndex (Kawashima and Kanehisa 2000) is a database of numerical indices, representing various physicochemical and biochemical properties of amino acids or pairs of amino acids. Atchley et al (Atchley et al. 2005) did multivariate statistical analyses on AAIndex to produce five multidimensional patterns of attribute covariation reflecting polarity (AAFactor 1), secondary structure (AAFactor 2), molecular volume (AAFactor 3), codon diversity (AAFactor 4), and electrostatic charge (AAFactor 5). These five transformed scores (called ï¿½ï¿½amino acid factorsï¿½ï¿½ here) has been used to successfully solve several difficult biology problems, such as deleterious non-synonymous SNP identification (Huang et al. 2010b) and B-cell epitopes prediction (Rubinstein et al. 2009). Here, we used these five amino acid factors to encode each amino acid in the lysine fragment.
The features of disorder score
Disordered regions in proteins lack fixed three-dimensional structures under physiological conditions, but they play important roles in regulation, signaling control. These activities are achieved by high-specificity low affinity interactions and multiple binding of proteins (Sickmeier et al. 2007). In this study, we used disorder score to quantify the disorder status of each amino acid in the protein sequence. VSL2 (Peng et al. 2006) was used to calculate the disorder score. The VSL2 predictors can predict disordered regions of any length and it can accurately identify the short disordered regions. The disorder scores of lysine site and its surrounding amino acids formed the features of disorders.
The feature space
The lysine (K) ubiquitination site was encoded by 20 PSSM conservation scores and 1 disorder score, in total 21 features. Each of its surrounding amino acids (10 residues upstream and 10 residues downstream) was encoded by 26 features, including 20 PSSM conservation scores, 5 amino acid factors, and 1 disorder score. Overall, each sample was represented by features.
The Maximum Relevance, Minimum Redundancy (mRMR) method was originally developed to deal with the microarray data processing by Peng et al.(Peng et al. 2005a). In this method, each feature can be ranked based on its relevance to target, and the ranking process is able to consider the redundancy of these features at the same time. A ï¿½ï¿½goodï¿½ï¿½ feature is defined as one has the best trade-off between maximum relevance to target and minimum redundancy within the features. To quantify both relevance and redundancy, mutual information (MI), which estimates how much one vector is related to another, is defined as following.
where, are vectors, is the joint probabilistic density, and are the marginal probabilistic densities. Givendata points drawn from the joint probability distribution, the joint and marginal densities can be estimated by the Gaussian kernel estimator as following (Beirlant et al. 1997; Qiu et al. 2009)
is a tuning parameter that controls the width of the kernels.
Let denotes the whole feature set, while denotes the already-selected feature set which contains m features and denotes the to-be-selected feature set which contains n features. Relevance of the feature in with the targetcan be calculated by:
And redundancyof the feature in with all the features in can be calculated by:
To obtain the featurein with maximum relevance and minimum redundancy, Eq(5) and Eq(6) are combined with the mRMR function:
For a feature set withfeatures, the feature evaluation will continue N rounds. After these evaluations, we will get a feature setby mRMR method:
In this feature set, each feature has an index h, which indicates which round that the feature is selected. The better a feature is, the earlier it will be selected, and the smaller its index h will be.
Nearest Neighbor Algorithm
In our study, Nearest Neighbor Algorithm (NNA) is used as a prediction model. NNA makes its decision by calculating similarities between the test sample and all the training samples. In our study, the distance between vector and is defined as follow (Qian et al. 2006; Huang et al. 2009; Huang et al. 2010a):
In NNA, the query vector will be designated to the same class of its nearest neighbor in training set with known classes which has the smallest distance.
Jackknife cross-validation and independent test
We used the jackknife cross-validation method, also known as Leave-One-Out Cross-Validation (LOOCV) (Li et al. 2007; Cai et al. 2009; Huang et al. 2008), one of the most effective and objective ways to evaluate the performance of our classifier on training set. With jackknife cross-validation, every sample is tested by the predictor trained with all the other samples. Besides the jackknife cross-validation on training set, we also did independent test. Since the positive and negative samples are highly imbalanced in training set and independent test set, the Matthewsï¿½ï¿½s correlation coefficient (MCC) (Baldi et al. 2000) was used to evaluate the prediction performance and defined as
where TP, TN, FP and FN stand for true positive, true negative, false positive and false negative, respectively.
Taken both sensitivity and specificity into account, MCC is considered as a balanced measure in dealing with imbalanced data (Baldi et al. 2000; Han et al. 2008).
Meanwhile, Sensitivity (Sn), specificity (Sp) and accuracy (ACC) and defined as following are also calculated
where TP, TN, FP and FN stand for true positive, true negative, false positive and false negative, respectively.
Incremental Feature Selection (IFS)
Although mRMR could rank the features based on their importance, we do not know how many features in the list should be chosen. In our study, Incremental Feature Selection (IFS) (Huang et al. 2009; Huang et al. 2010a) was used to determine the optimal number of features.
An incremental feature selection is conducted for each of the independent predictor with the ranked features. Features in a set are added one by one from higher to lower rank. If one feature is added, a new feature set is obtained, then we get N feature sets where N is the number of features, and the i-th feature set is:
Based on each of the N feature sets, NNA predictors were constructed and tested by jackknife cross-validation on training set. With MCC of jackknife cross-validation calculated, we obtain an IFS table with the number of features and the performance of them. is the optimal feature set that achieves the highest MCC.
Using the mRMR program downloaded from http://penglab.janelia.org/proj/mRMR, we obtained the ranked mRMR list of 541 features. The smaller index of feature indicates more important roles in discriminate positive samples from negative ones. The mRMR list was used in IFS procedure for feature selection and analysis.
Based on the outputs of mRMR, we built 541 individual predictors for the 541 sub-feature sets to predict the lysine-ubiquitination sites. As described in the Materials and Methods section, we tested the predictors with one feature, two features, three features, etc., and obtained the IFS result which can be found in Table S1.
Figure 1 shows IFS curve plotted based on Table S1. The highest MCC was 0.142 when 456 features were used. So these 456 features were considered as the optimal feature set of our classifier. The 456 optimal features were given in Table S2.
Independent test and comparison with other methods
We tested our model in an independent dataset in which there were 14 positive samples and 267 negative samples. The MCC of our method independent test was 0.139. Meanwhile, we also predicted the independent set with two existing ubiquitination site predictors: UbiPred (Tung and Ho 2008) and UbPred (Radivojac et al. 2010). The MCC of UbiPred and UbPred on the same independent test set were 0.135 and 0.117, respectively. The performance of our model is better than both UbiPred and UbPred on the independent test set in which the positive and negative samples are highly imbalanced and close to real situation.
The distribution of the optimized feature set
As described in the Materials and Methods section, there were three kinds of features: PSSM conservation scores, amino acid factors and disorder scores. The number of each type of features in optimal feature set was investigated and shown in Figure 2A. The number of each site of features in optimal feature set was shown in Figure 2B. In the optimized 456 features, there were 100 amino acid factor features, 8 disorder score features and 348 PSSM conservation score features. This may suggest that conservation played important role for the ubiquitination site prediction. Similar evolutionary information exploited through position-specific scoring matrices (PSSMs) was also used in two previous prediction models of ubiquitylation (Radivojac et al. 2010; Tung and Ho 2008).
Since there were 348 PSSM conservation score features which count for a large proportion in the optimized 456 features, we investigated the number of each kind of amino acid of PSSM features (Figure 3A) and the number of each site of PSSM features (Figure 3B). The conservation of lysine site (AA11) was most important for the ubiquitination, and there were more PSSM conservation score features at nearby site AA7, AA8, AA9, AA12, AA14 and remote site AA1, AA18, AA19, AA21 than others. The importance of remote site explained why Tung found that the proper window size for ubiquitylation site prediction is 21 (Tung and Ho 2008). In addition, the conservation against mutations to 20 amino acids played different roles. Mutations to amino acids A, C, F, H, I, L, M, S, T, V, W and Y have more influence on ubiquitination than other kinds of mutations.
The number of amino acid factor features in the optimal feature set was 100, which means all amino acid factor features have been selected and all the five amino acid factors were equally important.
There were 8 disorder scores selected in the optimal feature set: the disorder scores at site AA6, AA7, AA8, AA9, AA10, AA14, AA17 and AA18. The disorder score of AA7 ranked first in the mRMR list. This indicated the disorder status of amino acid around the ubiquitination site could affect the ubiquitination process. It has been reported that disordered proteins have a greater proportion of predicted ubiquitination sites (Edwards et al. 2009). To better investigate the relationship between disorder and ubiquitination, we averaged the disorder scores at each site in ubiquitinated fragments and non-ubiquitinated fragments and compared them in Figure 4. In Figure 4, the red and blue dots were the mean of disorder scores at each site in ubiquitinated fragments and non-ubiquitinated fragments, repectively. The width of error bar represents the standard error of the mean. It is quite clearly that the ubiquitinated fragments and non-ubiquitinated fragments have very different disorder score pattern. The disorder score at each site in the ubiquitinated fragments is higher than the one in the non-ubiquitinated fragments.
Proteins are targeted for degradation by the covalent ligation to ubiquitin, a small 76-amino-acid residue protein. Ubiquitination of target substrates is a highly collaborative process involving a three-step cascade mechanism between the ubiquitin-activating enzyme (E1), ubiquitin-conjugating enzymes (E2), and ubiquitin ligases (E3) (Hershko and Ciechanover 1998).
Within the selected physicochemical property parameters, we show that polarity (AAFactor 1), secondary structure (AAFactor 2), molecular volume (AAFactor 3), codon diversity (AAFactor 4), and electrostatic charge (AAFactor 5) share the similar role in protein ubiquitination selection. The most pronounced feature of Ub sites is the abundance of charged and polar amino acids, especially negatively charged D and E, and the depletion of hydrophobic residues, such as L, I, F, and P around Ub sites (Nonaka et al. 2005; Radivojac et al. 2010). These parameters are highly related to electrostatic charge and amino acid composition in the adjacent sequence. The known E3 enzymes could be separated in two protein families: HECT domain and RING E3s. The crystal structures of these complexes reveal extraordinary specificity of interaction by a small set of loops at the end of the UbcH7 ï¿½ï¿½-sheet (a subset of secondary structure) (Zheng et al. 2000; Huang et al. 1999). From these results, it is easier to understand how the presence of a few divergent surface residues could modulate the catalytic properties of ubiquitination. The similar positions of the three substrate binding domains supported that RING E3s promote ubiquitin transfer by positioning the substrate in a manner such that the lysine is optimally E2 active size (Zheng et al. 2002; Schulman et al. 2000), spacing between the destruction motif and the ubiquitin-acceptor lysine residue as a parameter that affects the rate of substrate ubiquitination, further supporting the positioning model (Wu et al. 2003). These structure analyses emphasize the importance of secondary structure, molecular size or volume to the ubiquitination process.
The relationship between ubiquitination and protein disorder is complex and remains unclear, but researchers have observed that the percentage of residues predicted as possible ubiquitination sites increases with increasing amounts of disorder (Edwards et al. 2009). A large proportion of disordered proteins are highly expressed in many tissues (Edwards et al. 2009). These proteins may have a higher chance of degradation, as they are likely to have a higher density of ubiquitination sites.
Although much knowledge about ubiquitination has been accumulated to date, it is difficult to assume that all substrates carry a similar preexisting structure before they bind to the components of the ubiquitination machinery. Here, we examine sequence and structural preferences of all available ubiquitination sites and show that they have selected physicochemical property parameters. Regulated protein targeting and turnover through the ubiquitin-proteasome system underlies a host of critical physiological and pathological states in humans. The ability to modulate the individual steps in the ubiquitination pathway offers potential therapeutic strategies in the future.
A novel sequence-based predictor was developed for identifying the ubiquitination at Lysine-site. With the IFS feature selection procedure based on mRMR analysis, the predictor achieved an MCC of 0.142 by jackknife cross-validation test on benchmark dataset. In independent test, the MCC of our predictor was 0.139, higher than the existing ubiquitination site prediction tools UbiPred and UbPred. Our analysis shows that the conservation of amino acid at and around lysine plays important roles in ubiquitination site prediction. It also shows that electrostatic charge, molecular volume, secondary structure, codon diversity, and polarity of amino acids in the flanking sequences are important for the ubiquitination process. Interestingly, disorder and ubiquitination have a strong relevance. Although the results reported here are quite encouraging, the present study is merely a preliminary one. Further investigation is needed to clarifying the predicted relationship between conservation, disorder and ubiquitination.
Figure 1 - The IFS curve of predictors
In the IFS curve, the x-axis is the number of features and the y-axis is the MCC of jackknife cross-validation. The highest MCC was 0.142 when 456 features were used. So these 456 features were considered as the optimal feature set of our classifier.
Figure 2 - The number of each type or each site of features in optimal feature set
(A) The number of each type of features in optimal feature set. There were 100 amino acid factor features, 8 disorder score features and 348 PSSM conservation score features. (B) The number of each site of features in optimal feature set. From 10 residues upstream to 10 residues downstream (ï¿½ï¿½AA1ï¿½ï¿½, ï¿½ï¿½AA2ï¿½ï¿½, ï¿½ï¿½, ï¿½ï¿½AA20ï¿½ï¿½, ï¿½ï¿½AA21ï¿½ï¿½), there were 23, 20, 21, 21, 20, 21, 23, 23, 24, 22, 20, 23, 19, 24, 20, 22, 21, 24, 22, 21 and 22 features, respectively.
Figure 3 - The number of each type or each site of PSSM features in optimal feature set
(A) The number of each type of PSSM features in optimal feature set. (B) The number of each site of PSSM features in optimal feature set. The conservation of lysine site (AA11) was most important for the ubiquitination, and there were more PSSM conservation score features at nearby site AA7, AA8, AA9, AA12, AA14 and remote site AA1, AA18, AA19, AA21 than others.
Figure 4 - The disorder scores at each site in ubiquitinated fragments and non-ubiquitinated fragments
The red and blue dots were the mean of disorder scores at each site in ubiquitinated fragments and non-ubiquitinated fragments, repectively. The width of error bar represents the standard error of the mean.
Dataset S1 - Benchmark dataset.
Table S1 - The IFS result.
Table S2 - The 456 optimal features.