Translation Rate Based On Sequential And Functional Features Biology Essay

Published:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

In many occasion, mRNA level is used as a substitute for protein abundance. But many large scale genomic and proteomic studies either could not find the assumed correlation between mRNA level and protein abundance or the correlation was very weak. The reason for this is that protein concentrations depend on not only the mRNA level, but also the translation rate and the degradation rate. Determination of mRNA��s translation rate would provide valuable information for in-depth understanding of the translation mechanism and dynamic proteome. With ribosome-profiling technology, ribosome-protected mRNA fragments can be deep sequenced and the translation rate can be monitored, but it is time-consuming, expensive and not helpful for understanding the translation mechanisms. In this study, we developed a new computational model to predict the translation rate, featured by (1) integrating various related properties of RNA��s translation, such as codon usage frequency features, Gene Ontology enrichment score features, biochemical and physicochemical features, start codon features, coding sequence length, minimum free energy, 5��UTR length, 3��UTR length, Number of transcription factors known to bind at the promoter region, Number of RNA binding proteins known to bind its mRNA product, Protein abundance, mRNA half life, Protein half life and 5��UTR free energy, (2) applying the mRMR (Maximum-Relevance-Minimum-Redundancy) method and IFS (Incremental Feature Selection) procedure to select feature and optimize the prediction model, and (3) being able to predict the translation rate of RNA into high or low category. It was found that the following features played major roles in determining the translation rate: (1) Codon usage frequency, (2) Gene Ontology enrichment scores, (3) protein features (such as amino acids composition, polarity, normalized Van Der Waals volume, hydrophobicity, polarizability and secondary structure) and (4) other features (such as Number of RNA binding proteins known to bind its mRNA product, Coding sequence length, Protein abundance and 5��UTR free energy). These findings might provide useful information for understanding the mechanisms of translation. Our translation rate prediction model might also become a high throughput tool for large-scale annotating the translation rate of mRNAs.

Introduction

In many occasion, people assume that the protein abundance of a highly expressed mRNA will for certain be high and mRNA level is often used as a substitute for protein abundance, especially in microarray studies. But the regulation of gene expression takes place at many levels, from transcription to translation and post-translational modification. Many studies either could not find the assumed correlation between mRNA level and protein abundance (1) or the correlation was very weak(2-3). By estimation, only 20%�C40% of protein abundance is determined by the concentration of its corresponding mRNA (4-5). The reason for weak correlation between protein and mRNA levels is that protein concentrations depend on not only the mRNA level, but also the translation rate and the degradation rate. The translation rate of mRNAs has great influence on the actual protein abundance. Dysregulation of translation will result in various diseases, such as cancer and neurological disorders (6). The regulation of translation plays as important role as transcriptional regulation in the control of gene expression.

To study translation and predict the translation rate in Saccharomyces cerevisiae, one of the most studied model organisms especially for translation researches, we used the ribosome-profiling data from Ingolia��s work (7) in which the read density of mRNA is measured by deep sequencing of ribosome-protected mRNA fragments under both rich and starvation condition. The translation rate is defined as the normalized read density of translation (footprints) divided by the normalized read density of transcription (mRNA). With this dataset, an efficient computational model to predict the translation rate was constructed with Nearest Neighbor Algorithm (NNA) and cross-validated. More specifically, to identify the most important features regulating translation rates under different conditions, we applied mRMR (Maximum Relevance & Minimum Redundancy) and IFS (Incremental Feature Selection) procedure to analyze what kind of features are important for determining the translation rate in rich and starvation conditions, respectively. Our results suggest that the following features would play the major roles in determining the translation rate: (1) Codon usage frequency, (2) Gene Ontology (GO) enrichment scores, (3) protein features (such as amino acids composition, polarity, normalized Van Der Waals volume, hydrophobicity, polarizability and secondary structure) and (4) other features (such as Number of RNA binding proteins known to bind its mRNA product, Coding sequence length, Protein abundance and 5��UTR free energy). These findings might provide useful information for understanding the mechanisms of translation. Our translation rate prediction model might also become a high throughput tool for large-scale annotating the translation rate of mRNAs.

Materials and Methods

Dataset

The ribosome-profiling data we used are from Ingolia��s work (7) and publicly available at GEO http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE13750. With ribosome-profiling technology, Ingolia et al. (7) deep sequenced the ribosome-protected mRNA fragments and monitored the genome-wide translation with subcodon resolution in Saccharomyces cerevisiae under both rich and starvation condition. To get the translation rate, we divided the normalized read density of translation (footprints) by the normalized read density of transcription (mRNA). The ratio represents the translation rate and we characterize the ratios into two groups according to their values which are: (1) smaller than median or equal to median, (2) greater than median. ORFs in the former group have low translation rate, while the ORFs in the latter group have high translation rate. We characterized the translation rates in rich condition and starvation condition, respectively. There were 1334 ORFs with low translation rates and 1333 ORFs with high translation rates, in rich condition and starvation condition. The number of ORFs with low translation rates in both conditions was 1125, while the number of ORFs with high translation rates in both conditions was 1124. 209 ORFs had low translation rates in rich condition but high translation rates in starvation condition; 209 ORFs had high translation rates in rich condition but had low translation rates in starvation condition.

Feature Construction

Codon usage frequency features

We downloaded the ORF coding sequences from Saccharomyces Genome Database (SGD) (8) and calculated the codon relative frequencies with seqinR (9). It was reported that highly expressed genes has much more extreme synonymous codon preference and the pattern of codon usage can be used to predict the gene expression level in yeast (10). It is highly possible that ORFs with different translation rate have different codon usage pattern, too. There were codon usage frequency features.

Gene Ontology features

Proteins are produced to achieve their biological functions. As demand determines production, the translation rate of ORF is definitely correlated with its biological functions. The function of one protein can be better described in protein interaction network. The network context will give a comprehensive and robust description of its function. In this study, the network context we used was STRING(11). The Gene Ontology enrichment score of one protein was defined as the �Clog10 of the p-value generated by the hypergeometric test of its neighbors on STRING network. The larger the enrichment score of one Gene Ontology term, the more overrepresented this term is. There were 4148 Gene Ontology (GO) enrichment score features.

Biochemical and physicochemical features of proteins

To encode proteins of different sequence lengths with a uniform dimensional vector, we adopted the notion of pseudo amino acid composition (PseAAC) (12-13). Each protein sequence was represented by 132 biochemical and physicochemical properties according to the following seven aspects: (1) amino acid composition (AAC, the occurrence frequencies of the 20 native amino acids for a given protein) (14-15), (2) solvent accessibility, (3) normalized van der Waals volume, (4) polarizability, (5) secondary structure, (6) hydrophobicity, and (7) polarity (16). Except for AAC, all the other six ones are associated with a single amino acid in a given protein sequence position and each amino acid can be classified into two or three pseudo groups. For secondary structure, each amino acid can be predicted by SSpro (17) as: helix, strand or coil. For solvent accessibility, each amino acid is predicted by ACCpro (18) as: exposed or buried to solvent. For the other four types of properties, each amino acid can be classified into three categories in a similar way according to their values. In terms of hydrophobicity, there are three groups of amino acid: polar (R, K, E, D, Q, N), neutral (G, A, S, T, P, H, Y) and hydrophobic (C, V, L, I, M, F, W) (19). For polarizability: 0�C0.108 (G, A, S, D, T), 0.128�C0.186 (C, P, N, V, E, Q, I, L) and 0.219�C0.409 (K, M, H, F, R, Y, W) (20). For normalized van der Waals volume: 0�C2.78 (G, A, S, C, T, P, D), 2.95�C4.0 (N, V, E, Q, I, L) and 4.43�C8.08 (M, H, K, F, R, Y, W) (21). For polarity: 4.9�C6.2 (L, I, F, W, C, M, V, Y), 8.0�C9.2 (P, A, T, G, S) and 10.4�C13.0 (H, Q, R, K, N, E, D) (22).

To generate the corresponding global features by integrating the local quantities of amino acid over the entire protein sequence, the following three quantities are calculated: (composition), (transition), and (distribution) (23). refers to the global percent composition of each of the three or two groups in the pseudo sequence; to the percent frequencies with which the pseudo code letter changes to another along the entire sequence length; and to the percentage of sequence length within which the first, 25%, 50%, 75%, and 100% of each kind of pseudo letters is located.

For normalized van der Waals volume, polarizability, secondary structure, hydrophobicity and polarity, each amino acid is classified into three categories and would generate 21 features. For solvent accessibility, each amino acid is classified into two categories and the combination of, and for the sequence coded according to solvent accessibility would only generate 7 features.

Now for the AAC we have 20 features; for solvent accessibility, 7 features; and for each of all the other five properties, 21 features. Combining all these features together, each protein has features. The detailed explanation of the 132 biochemical and physicochemical features can be found in our previous work (23).

Start codon features

During the translation initiation, the 40S subunit of ribosome binds to a site upstream of start codon. It proceeds downstream until it encounters the start codon and form the initiation complex of translation. The start codon is typically AUG (or ATG in DNA) and related with translation initiation. We extracted sequences in untranslated region 3 bp upstream of the initial ATG and sequences in conding region 3 bp downstream of the initial ATG. We encoded the 6 bp DNA sequences up/downstream of start codon ATG binarily and each base pair was represented by a 4-dementional vector:,, and .

Other features

Coding sequence length

We calculated the coding sequence length of each ORF as a potential feature for translation rate prediction.

Minimum free energy

The minimum free energy of RNA structure was calculated with RNAfold (24).

Various parameters of untranslated regions from Tuller��s study

Tuller et al.(25) collected various properties of untranslated regions of the S. cerevisiae genome and we used the following 8 features from Tuller��s study: 5��UTR length, 3��UTR length, Number of transcription factors known to bind at the promoter region, Number of RNA binding proteins known to bind its mRNA product, Protein abundance, mRNA half life (26), Protein half life and 5��UTR free energy(27).

Feature space of ORF

As mentioned above, there are 64 codon usage frequency features, 4148 Gene Ontology (GO) enrichment score features, 132 biochemical and physicochemical features, 4 start codon features and 10 other features. The total features used in this study to represent an ORF sample would be.

mRMR method

The Maximum Relevance Minimum Redundancy (mRMR) method (28-29) was originally developed by Peng et al. and the mRMR program used in this paper was downloaded from the website http://penglab.janelia.org/proj/mRMR. It ranks each feature according to both its relevance to the class labels and the redundancy among the features. The ��good�� features have maximum relevance with the target class and meanwhile minimally redundant, i.e., maximally dissimilar to each other. Both relevance and redundancy are defined by mutual information (MI), which measures how much one vector is related to another. MI is defined as follows:

(2)

where and are two vectors, is the joint probabilistic density, and are the marginal probabilistic densities.

Letdenotes the whole vector set containing all the genes, denotes the selected vector set with vectors, and denotes the to-be-selected vector set with vectors. The relevance of a feature in with the target class variable can be computed by equation (3):

(3)

The redundancy of a featureinwith all the features incan be computed by equation (4):

(4)

To obtain a featurein with maximum relevance and minimum redundancy, mRMR function is obtained by integrating equation (3) and equation (4):

(5)

For a feature pool containing features, feature evaluation will be executed in rounds. After these evaluations, a feature setwill be obtained:

(6)

where each feature inhas an subscript index, indicating at which round that the feature is selected. The earlier a feature is selected, the better it is and it will have smaller subscript index.

Nearest Neighbor Algorithm

In our work, the Nearest Neighbor Algorithm (NNA) was used to classify samples into different categories. Its basic idea is to predict a new sample into categories by comparing the features of this sample with the features of those with known categories. The distance between two sample vectorsandin the study is defined as (23,30):

(7)

where is the inner product of and, andis the module of vector. andare consider to be more similar if is smaller.

In NNA, a vectorwill be designated as having the same class as its nearest neighbor which has the smallest. That is

(8)

whererepresents the number of training samples.

Jackknife Cross-Validation Method

The Jackknife Cross-Validation Method (30-31) is one of the most objective and effective methods to evaluate prediction performance. During Jackknife Cross-Validation, each sample in the dataset is tested in turn by the predictor, which is trained by the other samples in the data set. During this process, each sample is involved in training times and is tested exactly once. To evaluate the performance of the predictor, the accuracy rate for the overall samples can be calculated as:

(9)

whereandstand for the number of correctly predicted samples and overall samples in class.

Incremental Feature Selection (IFS)

After the mRMR step, we obtained a feature list in their order of selection. However, we still do not know how many features in the list should be chosen. In our study, Incremental Feature Selection (IFS) (23,30) was used to determine the optimal number of features. We constructed feature subsets of the feature list provided by the mRMR feature list defined in equation (6) by adding an additional feature to the candidate feature subset, starting from an initial subset containing only the first feature. The feature subset is defined as:

(10)

by adding featureto the previous subset .

For each feature subset, the Jackknife Cross-Validation Method is used to obtain the accuracy rate. The results were plotted to produce an IFS curve with index as its x-axis and the overall accuracy as its y-axis.

Results

Identification of relevant features and construct translation rate prediction model

Using the mRMR program downloaded from http://penglab.janelia.org/proj/mRMR/, we ranked and analyzed the top 500 relevant features to translation rate. Each of them has the maximal relevance with translation rate and the minimal redundancy with other features. Accordingly, 500 prediction models were constructed with 1, 2, 3�� 499 and 500 features respectively and tested as described above. As shown in Figure 1 (A), the translation rate prediction model of rich condition achieved the peak accuracy at 68.8% with 37 features which can be provided in Table S1 (A). These 37 features formed the optimal feature set for translation rate prediction model of rich condition. Similarly, in Figure 1 (B), the translation rate prediction model of starvation condition achieved the highest accuracy at 70.0% with 86 features which can be found in Table S1 (B). These 86 features formed the optimal feature set for translation rate prediction model of starvation condition.

Analysis of optimal features sets in rich and starvation condition

We compared the optimal 37-feature set of rich condition and the optimal 86-feature set of starvation condition and found there were 27 common features between them. These 27 common features were provided in Table S1 (C). To investigate what kinds of features are critical for translation rate, we extracted the optimal features and counted the numbers of each kind of features. Shown in Figure 2 is the numbers of each kind of features in (A) the optimal 37-feature set of rich condition, (B) the optimal 86-feature set of starvation condition. As we can see from Figure 2 and Table S1, the following kinds of features play the major roles in affecting the translation rate: (1) Codon usage frequency, (2) Gene Ontology (GO) enrichment scores, (3) protein features (such as amino acids composition, polarity, normalized Van Der Waals volume, hydrophobicity, polarizability and secondary structure) and (4) other features (such as Number of RNA binding proteins known to bind its mRNA product, Coding sequence length, Protein abundance and 5��UTR free energy).

Discussion

It has been reported by several studies that codon bias is the major determinant factor for translation efficiency (32-33). The more usage of efficient codons, the higher elongation rate it is. Assuming the flux of ribosome is constant, the more usage of efficient codons result in fewer ribosomes on mRNA, and thus a better allocation of ribosomes. As a result, the translation rate increases (33). Our analysis found the strong correlation between codon bias and translation efficiency, too.

In a recent study of Tuller (34), it was reported that lower folding energies (which correspond to more elaborate mRNA structures) can slow down the velocity of ribosomal movement on mRNA. Under the constant flux ribosomes assumption, the density of ribosomes is higher for lower ribosome velocity. So, folding energy influences not only translation initiation, but also the rate of translation elongation. Folding energy plays an important role in determining global translation rate. Our results have further confirmed their finding. The 5��UTR free energy was found important for translation rate determination. And various protein features, such as amino acids composition, polarity, normalized Van Der Waals volume, hydrophobicity, polarizability and secondary structure, were also very important. The reason for the importance of protein features in translation efficiency determination maybe is that the translated protein can interact with the translating mRNA and thus, slow down the translation elongation. Brockmann et al. found that ORF�Cspecific translation elongation velocity depends on the amino acid composition of the corresponding protein (35).

Interestingly, our studies indicate that ORFs with different functions or subcellular locations will have different translation rate. ORFs with cellular response function may have higher translation rate in extreme condition, such as starvation, cold, heat or metal ions. ORFs that interact with RNA modification proteins or proteins with asparagine-tRNA ligase activity may have different translation efficiency from other. ORFs located in cytoplasm may have higher translation rate than the ORFs in nucleus, since the ORFs in cytoplasm have more chance to bind ribosomes.

Conclusion

We have developed a new method to predict the translation rate by integrating various sequential features and functional features. In rigorous jackknife cross-validation test, the predictor can achieve an overall prediction accuracy of 68.8% and 70.0% in rich and starvation conditions, respectively. With feature selection based on the mRMR method and IFS procedure, we found that the following features played the major roles in determining the translation rate: (1) Codon usage frequency, (2) Gene Ontology (GO) enrichment scores, (3) protein features (such as amino acids composition, polarity, normalized Van Der Waals volume, hydrophobicity, polarizability and secondary structure) and (4) other features (such as Number of RNA binding proteins known to bind its mRNA product, Coding sequence length, Protein abundance and 5��UTR free energy). These findings might provide useful information for understanding the mechanisms of translation. Our translation rate prediction model might also become a high throughput tool for large-scale annotating the translation rate of mRNAs.

Figure Legends

Figure 1 - The IFS curves of translation rate prediction in rich and starvation condition.

The IFS curves for (A) the translation rate prediction model of rich condition achieved the peak accuracy at 68.8% with 37 features and (B) the translation rate prediction model of starvation condition achieved the highest accuracy at 70.0% with 86 features.

Figure 2 - The numbers of each kind of features in optimal feature sets.

The numbers of each kind of features for (A) the optimal 37-feature set of rich condition, (B) the optimal 86-feature set of starvation condition.

Writing Services

Essay Writing
Service

Find out how the very best essay writing service can help you accomplish more and achieve higher marks today.

Assignment Writing Service

From complicated assignments to tricky tasks, our experts can tackle virtually any question thrown at them.

Dissertation Writing Service

A dissertation (also known as a thesis or research project) is probably the most important piece of work for any student! From full dissertations to individual chapters, we’re on hand to support you.

Coursework Writing Service

Our expert qualified writers can help you get your coursework right first time, every time.

Dissertation Proposal Service

The first step to completing a dissertation is to create a proposal that talks about what you wish to do. Our experts can design suitable methodologies - perfect to help you get started with a dissertation.

Report Writing
Service

Reports for any audience. Perfectly structured, professionally written, and tailored to suit your exact requirements.

Essay Skeleton Answer Service

If you’re just looking for some help to get started on an essay, our outline service provides you with a perfect essay plan.

Marking & Proofreading Service

Not sure if your work is hitting the mark? Struggling to get feedback from your lecturer? Our premium marking service was created just for you - get the feedback you deserve now.

Exam Revision
Service

Exams can be one of the most stressful experiences you’ll ever have! Revision is key, and we’re here to help. With custom created revision notes and exam answers, you’ll never feel underprepared again.