Protein Tyrosine Sulfation Analysis Biology Essay
Protein tyrosine sulfation is a ubiquitous post-translational-modification (PTM) of secreted and transmembrane proteins that pass through the Golgi apparatus. In this study, we developed a new method for protein tyrosine sulfation prediction based on nearest neighbor algorithm with Maximum Relevance Minimum Redundancy (mRMR) method followed by Incremental Feature Selection (IFS). We incorporated features of sequence conservation, residual disorder and amino acid factor, totally 229 features, to predict tyrosine sulfation sites. From these 229 features, 145 features were selected and deemed as the optimized features for the prediction. The prediction model achieved a prediction accuracy of 90.01% using the optimal 145-feature set. Feature analysis showed that conservation, disorder and physicochemical/biochemical properties of amino acids all contributed to the sulfation process. Site-specific feature analysis showed that the features derived from its surrounding sites contributed profoundly to sulfation site determination in addition to features derived from the sulfation site itself. The detailed feature analysis in this paper might help understand more of the sulfation mechanism and guide the related experimental validation.
Sulfation, Maximum Relevance Minimum Redundancy, Incremental Feature Selection, nearest neighbor algorithm
Various post-translational modifications (PTMs) of proteins play important roles in proteome structural and functional diversity and regulate various biological processes. Tyrosine sulfation, one of PTMs, occurs in various species and cell types 1-5. It is catalyzed by one of the two tyrosylprotein sulfotransferases (TPSTs, TPST-1 6 and TPST-2 7-8) through transfer of the sulfuryl group from 3 -phosphoadenosine-5 -phosphosulfate (PAPS) to the phenol group of tyrosine 1. As one of the most universal PTMs in secreted and transmembrane proteins, tyrosine sulfation has been experimentally demonstrated to be essential to extracellular protein-protein interactions, intracellular protein transportation modulation, protein proteolytic process regulation 3, 9-11 and implicated in various pathophysiologial processes such as atherosclerosis, lung disease and HIV infection 12-14. Approximately up to 1% of the tyrosines in the total protein content of a cell can be sulfated 15. In the overview, identification of protein tyrosine sulfation sites is of fundamental importance to understand the molecular mechanism of tyrosine sulfation in biological systems. Because of the lability of sulfotyrosine, it is difficult to determine tyrosine sulfation sites using conventional experimental approaches including chemical sequencing and mass spectrometry analysis 15-17. Although many sulfated proteins have been identified, few sulfotyrosine sites have been exactly determined 1, 3. In addition, to determine tyrosine sulfation sites by conventional experimental approaches may be time-consuming and labor intensive especially for large scale datasets. Therefore, it is much more convenient and efficient to predict tyrosine sulfation sites using in-silico algorithms, especially at the proteome level.
Since there are no specific sequence conservation patterns around Tyrosine-sulfation site, it is difficult to predict sulfotyrosine site 9, 18. Rosenquist and Nicholas studied the effect of basic, hydrophobic, small amino acids, disulfide, N-glycosylation (sugar) and acidic sites surrounding the tyrosine sites and found some sulfation sites were surrounded by acidic amino acids 19-20. However many other sulfotyrosine sites have no acidic residues in their flanking regions, for instance Tyr30 in mouse Lumican can be sulfated but no acidic amino acids existed within 5 residues 9. Further, Yu et al. developed position-specific scoring matrix (PSSM) to predict tyrosine sulfation sites in seven-transmembrane peptide receptors 21. In 2002, Sulfinator 22 was developed using four Hidden Markov Models for the prediction of sulfotyrosine sites using information from sequence alignment of 68 sequence windows containing tyrosine sulfation sites. Chang et al. developed a method called SulfoSite 18, which considered both structural information such as secondary structure and accessible surface area (ASA) and other sequence information for the prediction of tyrosine sulfation sites using SVM model and used 162 experimentally verified tyrosine sulfation sites as positive samples. But most existed methods have their limitations. For example, Sulfinator cannot identify certain kind of sulfated tyrosines, such as sulfotyrosine in extracellular class II leucine-rich repeat (LRR) proteins which were identified by mass spectrometry experiment 15, 23. For SulfoSite, its prediction model is a black box , from which we can t obtain any useful biology information and it has no biology analysis of features they used.
In this work, a new computational method to predict tyrosine sulfation sites was developed based on machine learning approach Nearest Neighbor Algorithm (NNA), incorporated by feature selection (IFS based on mRMR). The features we used can be grouped into three categories: Position-Specific Scoring Matrices (PSSM) conservation scores, amino acid factors and disorder scores. Our study is featured by (1) three kinds of features were considered. (2) Nearest Neighboring Algorithm (NNA) was used as the prediction model which is more effective than HMM (Hidden Markov Model) and SVM (Support Vector Machine) that were used by Sulfinator and SulfoSite. (3) Jackknife cross-validation method was used to evaluate the performance of our classifier. (4) Features were selected and analyzed. Feature analysis shows that the conservation of amino acids at some certain residue sites around Tyrosine plays important roles in the sulfation site prediction; it also shows that secondary structure, codon diversity, molecular volume, polarity and electrostatic charge of amino acids in the flanking sequences are important for the sulfation process, and that the structural disorder of the flanking sequence and sulfation are also strongly related.
2 Materials and Methods
We downloaded protein sequences containing sulfated tyrosine sites from SysPTM (version 1.1)24 and UniProt (version 2010_06)25-26. By removing redundant sequences and sequences less than 50 amino acids, 75 protein sequences were left and used in our study. We randomly separated these 75 protein sequences into two parts: 60 sequences as training dataset and 15 sequences as independent test dataset. Then we extracted consecutive peptides containing 9 residues with Tyrosine itself, 4 residues upstream and 4 residues downstream of the Tyrosine (Y) for the training and independent test dataset separately. The sulfated Tyrosine sites are considered to be positive samples and the unsulfated Tyrosine sites are considered to be negative samples. For the training dataset, there are totally 731 samples including 102 positive samples and 629 negative samples. For the independent test dataset, there are totally 96 samples including 27 positive samples and 69 negative samples. The training and independent test datasets were given in Dataset S1 and Dataset S2 respectively.
2.2 Feature Construction
2.2.1 PSSM conservation score features
In biological analysis, one of the most important aspects to concern is the evolutionary conservation. More conserved status of residues at specific protein sites may indicate that they are under stronger selective pressure and therefore are likely to be more important for the protein functioning. There are imperfect conserved flanking sequences surrounding sulfotyrosine sites which have been demonstrated by several previous works 18-19, 22. For example there are much more D residues at the site directly upstream to the sulfotyrosine sites. Both our and previous studies considered sequence conservation features as a primary factor for the prediction of tyrosine sulfation sites 18, 21-22.
Position Specific Iterative BLAST (PSI BLAST) 27 can be used to measure the conservation status for a specified location. It denotes normalized probabilities (log odds scores) of conservation against transitions to 20 different amino acids for a specific residue by a 20-dimensional vector. All such 20-dimensional vectors for all residues in a given sequence composed a matrix called PSSM (Position Specific Scoring Matrix). Residues conserved through cycles of PSI BLAST were suggested to be important in biological functioning. In our study, PSSM conservation score was used to quantify the conservation status against 20 different amino acids of each residue in the protein sequence.
2.2.2 Amino acid factor features
The specificity and diversity of protein structure and function are largely attributed to the composition of various properties of each of the 20 amino acids. Previous studies have shown the important effect of individual amino acid physicochemical properties in discriminating the sulfotyrosine from those non-sulfated tyrosines. Rosenquist et al. had demonstrated the effect of basic, hydrophobic, small amino acids, disulfide, N-glycosylation (sugar) and especially acidic sites surrounding the tyrosine sites on the prediction of sulfotyrosine sites 19-20. The effect of polarity, secondary structure, charge distribution on the determination of tyrosine sulfation had also been demonstrated in 20.
Atchley et al28 performed multivariate statistical analyses on AAIndex 29 which is a database of amino acids biochemical and physicochemical properties. They summarized and transformed AAIndex to five highly interpretable and multidimensional numeric indices reflecting secondary structure, polarity, molecular volume, electrostatic charge, and codon diversity. We used these five numerical index scores (we called amino acid factors ) to represent the respective properties of each amino acid in the research.
2.2.3 Disorder score features
The functional importance of protein segments that lack fixed 3-D structures under physiological conditions has been increasingly recognized 30-31. The disordered regions of proteins always contain sorting signals, PTM sites and protein ligands, and consist of disordered, unstructured and flexible regions without regular secondary structure. Protein disorder in the non-globular segments allows for more modification sites and interaction partners, so it is of great importance for protein structure and function 30, 32-33. In this study, we used VSL2 34 to calculate disorder score which represents each amino acid disorder status in the given protein sequence. The VSL2 predictors can accurately predict both long and short disordered regions in proteins 35-36. The features of disorders consist of the disorder scores of Tyrosine site and 4 flanking sites at both C-terminal and N-terminal.
2.2.4 The feature space
The feature space of our samples consists of the features of PSSM conservation scores, amino acid factors and disorder scores. For Tyrosine (Y) site, totally 21 features were used, including 20 PSSM conservation scores and 1 disorder score. For each of its 4 surrounding amino acids in both C-terminal and N-terminal, totally 26 features were used, including 20 PSSM conservation scores, 5 amino acid factors and 1 disorder score. Over all, each sample peptide was encoded by features.
2.3 mRMR method
To rank the importance of the 229 features, we used Maximum Relevance Minimum Redundancy (mRMR) Method that was firstly developed by Peng et al37 for the analysis of the microarray data. mRMR method could rank features based on their relevance to the target, and at the same time, the redundancy of features was also considered. Features that have the best trade-off between maximum relevance to target and minimum redundancy were considered as good features.
To quantify both relevance and redundancy, mutual information (MI), which quantifies the relationship between two vectors, is defined as following.
where , are vectors, is the joint probabilistic density, and are the marginal probabilistic densities.
Let denotes the whole feature set, while denotes the already-selected feature set which contains m features and denotes the to-be-selected feature set which contains n features. denotes class of sample tyrosine whether it was sulfated or not. Relevance of the feature in with the target can be calculated by:
And redundancy of the feature in with all the features in can be calculated by:
To obtain the feature in with maximum relevance and minimum redundancy, Eq(2) and Eq(3) are combined with the mRMR function:
For a feature set with features, the feature evaluation will continue N rounds. After these evaluations, we will get a feature set by mRMR method:
In this feature set , each feature has an index h indicating the round number that the feature is selected. The earlier a feature is selected, the better it is, and the smaller its index h will be.
2.4 Nearest Neighbor Algorithm
In our study, Nearest Neighbor Algorithm (NNA) is used as prediction model. NNA makes its decision by calculating similarities between the test sample and all the training samples. In our study, the distance between vector and is defined as follow 38-39:
where is the inner product of and , and represents the module of vector . The smaller is, the more similar to is.
In NNA, given a vector and training set , will be designated to the same class of its nearest neighbor in , i.e. the vector having the smallest :
2.5 Jackknife Cross-Validation Method
We used Jackknife Cross-Validation Method 40-42 (also called the Leave-one-out cross-validation, LOOCV), which is an objective and effective way to evaluate the performance of a classifier. In Jackknife Cross-Validation Method, every sample is tested by the predictor that is trained with all the other samples. To evaluate the performance of our sulfation site predictor, we calculated the accuracy rates for the positive, negative and total samples separately as following: (8)
2.6 Incremental Feature Selection (IFS)
Although mRMR could rank the features based on their importance, the number of features to be used to optimize the discrimination between sulfated and non-sulfated samples was not known. In this study, we used Incremental Feature Selection (IFS) 39, 43 to determine the optimal number of features.
An incremental feature selection is conducted for the ranked features. Features in the ranked feature set are added one by one from higher to lower rank. When one feature is added, a new feature set is obtained. Thus we get N feature sets where N is the number of features, and the i-th feature set is:
Based on each of the N feature sets, an NNA predictor was constructed and tested with Jackknife cross-validation test. With N overall accurate prediction rates, positive accuracy rates and negative accuracy rates calculated, we obtain an IFS table with one column being the index i and the other column to be the overall accuracy rate. is the optimal feature set that achieves the highest overall accuracy rate.
3 Results and Discussion
3.1 mRMR result
Using the mRMR program, we obtained the ranked mRMR list of 229 features. Within the list, a smaller index of a feature indicates that it is deemed as a more important feature in discriminating the positive samples from the negative ones. The mRMR list was provided in Table S1 and was used in IFS procedure for feature selection and analysis.
3.2 IFS result
Based on the outputs of mRMR, we built 229 individual predictors for the 229 sub-feature sets to predict tyrosine sulfation sites. As described in the Materials and Methods section, we tested the predictors with one feature, two features, three features, etc and the IFS results can be found in Table S2. Figure 1 shows the IFS curve plotted based on Table S2. The maximum accuracy is 0.9001 when 145 features are included. These 145 features were considered as the optimal feature set of our classifier. Based on these 145 features, the prediction accuracies of the positive samples and negative samples were 0.6667 and 0.9380 respectively. The 145 optimal features were given in Table S3.
3.3 Optimal feature set analysis
As described in the Materials and Methods section, there were three kinds of features: PSSM conservation scores, amino acid factors and disorder scores. The number distribution of each feature type in the optimized 145 features was investigated and shown in Figure 2A. Among the optimized 145 features, there were 34 features of amino acid factor, 6 features of disorder score and 105 features of PSSM conservation score. This suggests that all three kinds of features contribute to the prediction of protein tyrosine sulfation sites and conservation score may play irreplaceable role for sulfation site prediction. Although there are only 9 disorder scores in the initial 229 feature set, 6 disorder scores were selected in the optimal feature set. This indicates the important role of disorder status in tyrosine sulfation determination. The site specific distribution of the optimal feature set, shown in Figure 2B, demonstrates that site 1 and 2 influence mostly on the prediction of tyrosine sulfation sites. Sites at the center (site 6 and 7) and site 9 have relatively small effect on tyrosine sulfation, and sites 3, 4, 5, 8 have the smallest effect on tyrosine sulfation. The site-specific distribution of the optimal feature set is quite interesting, revealing that the residues at the two distal sides and the relatively center are more important for tyrosine sulfation prediction than the remaining residues.
3.3.1 PSSM conservation feature analysis
Since there were 105 PSSM conservation score features which account for the greatest proportion of the optimized 145 features, we investigated the number of each kind of amino acids of the PSSM features (Figure 3A) and found that the conservation against mutations to the 20 amino acids influences differently on the sulfation. Mutations to amino acid C, A, W, K and M influence more on sulfation than the mutations to other amino acids. We also investigated the number of PSSM features at each site (Figure 3B). The conservation status of AA1 , AA2 , AA5 and AA6 sites were most important for the sulfation site prediction, shown in Figure 3B. Particularly, amino acid at site 4 had been shown to be imperfectly conserved, and in most cases is a D residue 18. The first feature in the mRMR feature list is the PSSM feature at site 4 against transition to amino acid D, indicating that it is the most important feature for the prediction of tyrosine sulfation sites which is consistent with previous studies. In addition, the features within the top 10 features in the optimal feature list contain four other PSSM conservation features: the conservation status against residue R at site 9 (index 3, pssm9.1 ), the conservation status against residue T at site 1 (index 8, "pssm1.16"), the conservation status against residue Y at site 7 (index 9, "pssm7.18"), the conservation status against residue I at site 1 (index 10, "pssm1.9").
3.3.2 Amino acid factor analysis
The number of each type of amino acid factor features (Figure 4A) and the number of amino acid factor features at each site (Figure 4B) were analyzed. It was found that secondary structure, molecular volume, codon diversity, polarity were almost equally important features to the sulfation site prediction. Electrostatic charged amino acid factor feature has a little influence on sulfation site prediction. In Figure 4B, residues at site 2, 4 and 9 have the most important effect on sulfation site prediction, and residues at site 1, 6, 7 and 8 have less effect on sulfation site prediction. Residues at site 3 have the least effect on sulfation site prediction. The site-specific distribution of the amino acid factor features is consistent with the results of previous work showing that the neighboring residues contribute moderately to sulfation with some sites with relatively more influence on sulfation site determination such as site 4 20. Previous study demonstrated that the charge of the residue at site 4 is critical for tyrosine sulfation 20. The electrostatic feature of this site has an index of 5 in our mRMR feature list indicating it is one of the most important features for the tyrosine sulfation site prediction. Residue at site 2 can influence the sulfation degree of the tyrosine site 20. The index of the polarity feature of the amino acid at this site is 2 in the mRMR feature list. This indicates that the influence of residue at this site on tyrosine sulfation degree may be mediated by its polarity status. The existence of amino acid polarity, secondary structure, molecular volume and electrostatic charge features in the optimal feature set had all been supported by the effect of these physicochemical properties on tyrosine sulfation process demonstrated by 19-20, 44.
3.3.3 Disorder score analysis
An NMR study of hirudin found that the peptide chain at the tyrosine sulfation site is too flexible and disordered to determine a structure in that region 45. Rosenquist et al. also demonstrated that small amino acids near the tyrosine sulfation sites were nonuniformly distributed and should make the peptide chain following the tyrosine become very flexible, which suggest that TPST may require a substrate that can make a sharp turn when binding to the enzyme 19. The effects of coil structures and turn-inducing residues on tyrosine sulfation site determination had also been demonstrated by various studies 18, 44, 46.
Within the optimal feature set six disorder scores were selected: the disorder scores at site 1, site 2, site 4, site 5, site 6 and site 7. The selection of 6 out of the 9 total disorder scores indicates that the disorder status within the tyrosine region is quite important for the tyrosine sulfation process. From the site distribution of the six disorder scores, we can see that the disorder status of Y site and 3 adjacent sites may have greater effect on the sulfation process and this is consistent with the study carried by Rosenquist et al. showing that the peptide chain immediately following the sulfated tyrosine should be very flexible to satisfy the requirement of a sharp turn when a substrate binds to enzymes 19.
3.4 Comparisons with existed methods
We used an independent test dataset containing 96 samples including 27 positive samples and 69 negative samples. We put this dataset into both our method and two previously developed methods: Sulfinator and SulfoSite. We also put our training dataset and independent test dataset into a SVM-based method. The prediction accuracies for positive, negative and total samples were shown in Table 1.
As shown in Table 1, the overall prediction accuracy for independent test dataset of our method is 0.9479, which is better than Sulfinator (0.9063) and SulfoSite (0.9063). The overall prediction accuracy of SVM method for training dataset (0.9015) is a little better than our method (0.9001), but for independent test dataset, our method (0.9479) is much better than SVM-based method (0.9167). Overall, we can say that our method is a little better than SVM-based method for tyrosine sulfation site prediction.
3.5 Directions for experimental validation
The selected features at different sites may provide guide line for researchers to find or validate new determinants of protein tyrosine sulfation. For example, among the top 10 features in the optimal feature set, two of them: the conservation status against residue D at site 4 (index 1, "pssm4.3" ) 18, the electrostatic charge property of residues at site 4 (index 5, "aai4.4" ) 20, had been explicitly validated by researchers. The disorder status at site 2 (index 6, "disorder2") is consistent with that the peptide chain at the tyrosine sulfation site is disordered 19, 45. The polarity property of residue at site 2 (index 2, "aai2") suggests that the previously observed influence of site 2 on the sulfation degree 20 may be mediated by its polarity status. The remaining 6 features: the conservation status against residue R at site 9 (index 3, "pssm9.1"), the codon diversity property of residues at site 8 (index 4, "aai8.3"), the molecular volume of residues at site 6(index 7, "aai6.2"), the conservation status against residue T at site 1 (index 8, "pssm1.16"), the conservation status against residue Y at site 7 (index 9, "pssm7.18"), the conservation status against residue I at site 1 (index 10, "pssm1.9") are yet to be validated by experiments.
In this study, we developed a method for the prediction of protein tyrosine sulfation sites. Our approach considered not only information of sequence conservation but also individual amino acid s physicochemical features and residue disorder status within the tyrosine regions. Our method achieved an overall accuracy of 90.01%. Based on the feature selection algorithm, a compact set of features were selected, which are deemed as the features that contribute significantly to the prediction of protein tyrosine sulfation. The selected features may provide important clues of sulfation mechanism and guide the related experimental validations.
Dataset S1. Training dataset used in this study.
Dataset S2. Independent test dataset used in this study.
Table S1. mRMR list.
Table S2. IFS result.
Table S3. The optimal feature set.
Figure 1. Distribution of prediction accuracy against feature numbers
IFS prediction accuracy was plotted against feature numbers based on Table S2. The maximum accuracy is 0.9001 when 145 features are included. These 145 features were considered as the optimal feature set of our classifier.
Figure 2. Feature and site specific distribution of the optimal feature set
(A)Feature distribution of the optimal feature set. Among the optimized 145 features, there were 105 features of PSSM conservation score, 34 features of amino acid factor and 6 features of disorder score. (B) Site specific distribution of the optimal feature set. The site specific distribution of the optimal feature set demonstrates that site 1 and 2 influence mostly on the prediction of tyrosine sulfation. Sites at the center (site 6 and 7) and site 9 have relatively small effect on tyrosine sulfation, and sites 3, 4, 5, 8 have the smallest effect on tyrosine sulfation.
Figure 3. Feature and site specific distribution of the PSSM features in the optimal feature set
(A)Feature distribution of the PSSM features in the optimal feature set. We investigated the number of each kind of amino acids of the PSSM features and found that the conservation against mutations to the 20 amino acids influences differently on the tyrosine sulfation. Mutations to amino acid C, A, W, K and M influence more on sulfation than the mutations to other amino acids. (B) Site-specific distribution of the PSSM features in the optimal feature set. The conservation status of site 1, site 2, site 5 and site 6 were most important for the sulfation site prediction.
Figure 4. Feature and site specific distribution of the amino acid factor features in the optimal feature set
(A) Feature specific distribution of the amino acid factor features in the optimal feature set. It was found that secondary structure, molecular volume, codon diversity, polarity were almost equally important features to the sulfation site prediction. Electrostatic charged amino acid factor feature has a little influence on sulfation site prediction. (B) Site specific distribution of the amino acid factor features in the optimal feature set. Residues at site 2, 4 and 9 have the most important effect on sulfation site prediction, and residues at site 1, 6, 7 and 8 have less effect on sulfation site prediction. Residues at site 3 have the least effect on sulfation site prediction.
In this study, we developed a new method for protein tyrosine sulfation prediction based on nearest neighbor algorithm with Maximum Relevance Minimum Redundancy (mRMR) method followed by Incremental Feature Selection (IFS) using features of sequence conservation, residual disorder and amino acid factor. The prediction accuracy achieved 90.01% using 145 features. Detailed feature analysis might help understand more of the sulfation mechanism and guide the related experimental validation.
Need an essay? You can buy essay help from us today!