Functional Non Protein Coding Rnas Biology Essay

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

Non coding RNA genes are genomic regions that are transcribed but whose function is not to encode a protein but rather to encode a functional RNA molecule that will fold and carry a function by itself. RNA genes are increasingly recognized as important molecules in cells.

Goal is to study the various experimental and computational approaches that exist to model and identify of RNA genes. Those include an array of computational techniques (e.g. dynamic programming algorithms, sampling algorithms, context-free grammars, machine learning, etc.). Given the envisioned Ph.D. project of the student, a particular attention will be given to the prediction of microRNAs in plant genomes


In eukaryotes, a very small part of the DNA can code for proteins. In mammals, only 2% of the genome encode for RNA messengers, which are transcribed and translated in proteins. The vast majority of remaining DNA code for short and long non coding RNAs [17]. After a lot of research done to annotate the protein-coding exons portion of the genomes, which represent the genes, research turned to the annotation of the non-coding regions. New techniques have been developed, such as functional analysis and comparative sequence analysis [1].

Thousands of regulatory non-protein-coding RNAs (ncRNAs) are transcribed from the human genome. This includes microRNAs, small interfering RNAs (siRNAs), piwi interacting RNAs (piRNAs) and long ncRNAs [17].

Studies on ncRNAs are important to understand the role of the non-protein-coding part of the DNA, but we hope that they also could be used to develop of a new generation of diagnostic biomarkers or therapeutic targets [17]. For example, expression profiling of a particular miRNA accurately identify the origin of a tumor. The profiling can be done directly from blood, saliva or tissues samples. A therapeutic way would be to restore artificially the expression of a particular miRNA. This topic has just opened in the literature [17]. On a structural side, non-coding regions contain many large blocks that vary between individuals. Some of those structural variants have been linked to disease in human [1].

Functional (non-protein-coding) RNAs lack particular exploitable signatures in their sequence, making difficult elaboration of detection algorithms. However, many of them have a particular secondary structure attached to a biological function. These functions are often shared between species due to evolution, offering a structure based approach by comparative genomics to reveal functional RNAs [18].

Today, deep genome sequencing of a transcriptome permits to identify variations of ncRNA expression. Rather than trying to find an alteration in a protein gene, it�s now possible to focus on a probable aberrant deregulation of this gene, directly linked to particular symptoms [17].

Biology of non-coding RNAs

The different class of ncRNAs usually differs on the sequence length and their target type [17].

In eukaryotes organisms, small ncRNAs are involved in various epigenetic processes, like transcriptional and post transcriptional silencing, germ cell reprogramming, germinal maintenance, development and differentiation, antiviral defense, transposon silencing, chromatin remodelling and X chromosome inactivation [17]. They are tissue and developmental stage specific. The most fundamental role is the silencing capacity of the ncRNAs on genes, called RNA interference (RNAi) [17]. A silenced gene is an inactivated gene. Among ncRNAs classes, known RNAi are small interfering RNAs (siRNAs), microRNAs (miRNAs) and piwi interacting RNAs (piRNAs), present in both plants and animals [17]. New RNAi, having the capacity to take place aside DNA transcription sites, were recently identified, including promoter-associated small RNAs (PASRs) and transcription initiation RNAs (tiRNAs) [17]. To have a scale idea, in human, we have actually identified almost 2000 miRNAs, hundreds of siRNAs, and millions of piRNAs sequences. All those sequences are unique, and testify that those RNAs have a wide range of regulatory functions facilitated by sequence-specific interactions [17]. We assume that new ones are still to be discovered [17].

A particular ncRNA class are long non-coding RNAs (lncRNAs), abundantly transcribed in mammalian genome and composing 80% of the RNA transcriptome [17]. Recent studies suggest that they mirror protein coding genes, since they share common characteristics, like the length (from 2 to 100 kb) or contain polyadenylation signals. Several thousand of lncRNAs are actually identified in mammals. They are, like small ncRNAs, tissue and developmental stage specific and their central role is the regulation of protein-coding expression [17].

Implication in diseases

Due to their important participation in the development and the physiology of the host, dysfunctional ncRNA lead directly to diseases [17]. In human, studies found aberrantly expressed miRNAs, globally down regulated, in a long list of organs developing cancer. They are also involved in central nervous (i.e. well known schizophrenia and Alzheimer), cardiovascular diseases and various syndromes. Considering the growing list of diseases association with ncRNAs, it is not impossible that every illness could present a link with particular ncRNAs dysfunction or deregulation [17]. In a lot of cases, the fault is in the loss of a small RNA locus. Since they act either as activator and inhibitor, their absence leads to the deregulation of the associated targeted proteins. But when the integrity of the ncRNA gene is safe, the problem can come from the protein which processes this gene (i.e. Drosha and Dicer for miRNAs). In that case, it�s lethal at the embryonic stage [17].

Biology of microRNAs

The large majority of the MicroRNAs have a length of 21-22 nucleotides (nt), but the main database, miRbase, contains microRNAs between 16 to 30 nt. They are specialized in the post-transcriptional regulation of gene expression in targeting mRNAs. Those mRNAs contains the complementarity sequence essential for the attachment of the miRNA on it in order to block the translation in protein [10]. In mammals, miRNAs control the activity of ~50% of all protein-coding genes. A same miRNA can control many genes by targeting different mRNAs [10].

The biogenesis of miRNAs works as follow: miRNAs are processed from precursor sequences, whose length vary between 40nt to almost 1000nt depending species. Precursors (pre-miRNAs) come from pri-miRNAs, transcribed by RNA polymerase II from independent genes or intronic regions. Pri-miRNAs fold into hairpins and are cleaved by Drosha enzyme to give a hairpin precursor. Then, Dicer enzyme cut a duplex in the precursor, containing the miRNA and the miRNA* (miRNA star) sequences [10]. The miRNA is located either on the 5� or 3� arm of the hairpin [7]. In plants, Drosha and Dicer are replaced by Dicer Like 1 (DCL1) [3]. The miRNA is finally carried by the miRISC protein complex to the mRNA target. Repression is done by base pairing [10].

Formation and evolution of microRNAs

MiRNAs studies offer the opportunity to assess evolutionary studies [4].

Plant and animals miRNAs doesn�t share conservation [4].

In plants, miRNA genes can form by inverted duplication events [4], resulting on a perfect or near perfect self-complementarity sequence [3]. Another possibility is that miRNAs are formed by the accumulation of mutation within inverted repeats [3]. Over the time, in those new miRNAs genes, accumulated mutations in the foldback arms results in a compatibility with miRNA biogenesis machinery [3]. Assuming a certain coevolution with the miRNA target(s), the miRNA can become an essential node of a new of existing regulatory network [3].

In animals, miRNAs are more the fruit of spontaneous formation, appearing from the many existing hairpins in the genome by random acquisition of miRNA-processing characteristics and expression profiles [4]. In both plant and animals, inverted repeats in transposable elements could also form miRNAs.

A mature miRNA sequence found in different closely related species is called conserved. At the difference the non-conserved lineage specific sequences are considered young [4]. Majority of miRNAs are species specific, especially in plant were the part of conserved miRNAs is lower than animals [3]. Young miRNAs genes born frequently and die by selective pressure [4]. They arise by inverted duplication and can have various effects, deleterious or not, and perturb existing regulatory networks [3]. They are usually weakly expressed, processed imprecisely and a majority have a lack of targets, indicating a neutral evolution [4]. Also, when comparing closely related species genomes, young miRNAs genes are associated with higher variability regions than conserved ones [4]. The nucleotide divergences arose essentially outside the miRNA-miRNA* region, so in the loop and loop distal stem. Targeting ability of the miRNA is then preserved [3].

Finally, concerning the expected number of miRNAs present in each species, we don�t know. For example, in human this question remains controversial, with estimations varying between few hundred to tens of thousands [6]. Furthermore, this number wouldn�t be the same between individuals, because of the regular birth and death of miRNAs.

Approaches to fold RNAs

In living organisms, RNAs fold into a certain structure which gives functionality. They are three dimensional and very hard to compute. That�s why a two dimensional (secondary) structure is calculated in first place, giving a plan of the molecule organization [19]. RNA secondary structure prediction relies on topological and thermodynamic rules (stacking, stabilizing and destabilizing energies) and temperature for finding best energetically sable (minimum free energy (MFE)) leading to optimal structure. A difficulty is that such a structure is not unique and generated structures sometimes lack biological reality since they do not satisfy experimental studies [19]. Few suboptimal structures can be produced, but it�s challenging to select the native one [15].

A structure can be fold from multiple sequences, and their number increase exponentially with sequence length [<durbinbiological>].

Covariation analysis

Few types of non-coding RNAs have a particular structure. A way to detect them is to find sequences which fold in that structure.

Nussinov algorithm

The Nussinov approach is to maximize the base pairing of a given RNA sequence with a recursive dynamic algorithm [<durbinbiological>]. Only stable A-U, G-C and GU pairs are accepted. Algorithm is as follow: Given a RNA sequence S of length n, a matrix W is calculated, with W_(i,j) the maximal number of paired bases among all possible folding of S[i�j] and W_(1,n). Let d_(i,j)=1 if S_(i,j)is a complementary base pair, 0 if it is not. Algorithm works in two steps, the matrix fill stage and traceback stage. In first stage, the Matrix is initialized as follow:

W_(i,i-1)=0 for i=2 to n

W_(i,i)=0 for i=1 to n

Then, recursion starts:

W_(i,j)=max-(2=i,j=n)?{�(W_(i+1,j)@W_(i,j-1)@W_(i+1,j-1)+d_(i,j)@max-(i<k<j)??[W_(i,k)+W_(k+1,j)]? )�

In second stage, to get the structure, a trace back is done through the matrix W, beginning from W_(i,n).

Zuker and Stiegler algorithm

Zuker algorithm [19] is a method for folding a given RNA. It uses previous work of Salser [16] and Nussinov et al. [14]. It includes new features compared to previous folding algorithms in the literature, like the reactivity of nucleotides to chemical modification or enzymatic influence on the RNA. It is based on dynamic programming and compute the optimal minimum free energy secondary structure of a sequence S of length n in time O(n3). The algorithm use a defined group of specific substructures (loop, bugles, stacks, etc.) assigned with energy depending nucleotides composition. The total energy of S is the sum of the energy of its every substructure.

The recursive algorithm works by adding nucleotide one per one and find the best structure at each step. It runs as follow: nucleotides of the RNA molecule are numbered from 5� to 3�, denoting by Si the i^th nucleotide for 1=i=n. The main technique is to compute two possibility energies for each subsequence S_ij. For all pairs ij in 1=i<j=n, let matrices W(i,j) and V(i,j) be the MFE of all possible structures formed from S_ij, except is set only in case of base paring of i and j. When ij cannot form base pair, V(i,j)=8. If distance d between i and j = 4, W(i,j)=0, otherwise V(i,j) and W(i,j) are computed in terms of V(i',j') and W(i',j'), with i<i^'<j^'<j and j^'-i^'<d. Then, for each specified structure,

V(i,j)=min-(i<j)?{�(E(FH(i,j))@min-(i<i^'<j^'<j)??{E(FL(i,j,i^',j^' ))+V(i^',j^' )}?@min-(i+1<i^'<j-2)??{W(i+1,i^' )+W(i^'+1,j-1)}? )�

Parisien and Major algorithm

Parisien and Major have created MC-Fold to infer RNA secondary and 3D structures from sequence data [15]. In previous presented algorithms, models use canonical base pairs: Watson Crick (A-U, G-C) and Wobble (G-U). Here, they use an approach of nucleotide relationship in structure, called nucleotide cyclic motifs (NCM) and a scoring function. Those motifs contain all types of loops, bugles, base pairs and are stored in a database. The algorithm also accept all non-canonical base pairs (e.g. A-A) since they contribute to the energy of the structure.


Annotating non-coding RNAs

In genetics, annotation is the process of finding the precise location, function and all other pertinent information attributable to a DNA region.

Annotation process includes comparative analysis and functional analysis. The comparative analysis does a large scale sequence similarity comparisons on the genome sequence and identify structural variations compared to the published genome reference, different individual genomes of the same species, and to closely related species references genomes. This analysis is done by sequence similarity and focuses essentially on repeated sequences, including segmental duplications, simple and tandem repeats, transposons (copy-paste or cut-paste sequences) and pseudogenes (dysfunctional gene copies) [1]. Other elements whose mechanisms of formation or structure are known are identified by model based comparison techniques [1].

The functional analyse process the raw experimental data using a signal-processing paradigm. Raw signal generated by experiment is analysed by smoothing, thresholding and segmenting it into discrete units of initial annotation [1]. By smoothing, signal noise is removed. Then, thresholding permits to choose a value which indicates an activation or inactivation signal. Finally, the segmenting define regions that are active or not on the genome. Those steps are important to make clean signal tracks [1] superposed to the genome to be compatible with UCSC genome browser [9]. This cleaning is essential since experiments generates a lot of small sequences, sometimes difficult to align on a reference genome because of sequencing errors or impossibility to uniquely map on highly repeated regions. Also, the interpretation of the signal depends of the type of the experiment, whether it involves transcription (DNA to RNA) or immunoprecipitation (protein binding) [1].

A difficulty in the annotation process is the sequence conservation versus function. Usually, comparison between species permits to identify conserved non-coding elements that are candidates for function. This is based on the principle that conservation in evolution involves the conservation of a function, useful to detect functional elements [1]. Nevertheless, many conserved elements have no experimentally functional evidence and many experimentally validated elements were not conserved. That would lead to the revision of evolutionary models [1].

Prediction of ncRNAs

There is various ways to predict non coding RNAs, but all techniques turn around the research of local structural motifs in a set input RNAs based on free energy minimization procedures. This can be done by conservation analysis using alignments to build consensus structures or simple RNA folding prediction [[].

The function can be linked with the location of an ncRNA on a genome and a close protein-coding gene, where transcription of ncRNAs affects flanking coding genes. But that�s far to be enough to predict function, since it depends on the relative distance and the ncRNA type [12].

Most of functional RNAs are known to have particular expression signals, strong GC content and have sequence or structure conserved across evolution [13].

ncRNA prediction by secondary structure conservation

An efficient method to detect functional RNAs is implemented in RNAz [18]. It takes as input alignment of sequences, usually genomes fragments, and consists in measuring the RNA secondary structure conservation based on structure consensus and the thermodynamic stability. This approach classifies alignments as functional RNAs or other sequences using two values: structure conservation index (SCI) and normalized z-score. The SCI is obtained by comparing consensus structure MFE E_cons with average MFEs E � of every structures of the alignment. A perfectly conserved secondary structure is indicated by a "SCI�1" : "SCI"=E_cons/E � . Normalized z-score z is calculated to measure the significance of MFE predicted value E, assessed by comparison with a large set of randomly generated sequences of same length and single or dinucleotide composition. This method lies on the fact that structure MFE alone is not sufficient to detect functional RNAs. However, studies show that functional RNAs are more stable than random sequences, hence the comparison against random sequences. Mean � and standard deviation s of those random sequences are calculated to get z, where z=((E-�))/s. Finally, a SVM is used to classify the aligned sequences based on SCI/z-score plane.

An advantage to this approach is the fact that the SVM is not trained depending specific structure and composition characteristics, so do not contain specific information about particular ncRNAs. Machine learning is used here as a help to manipulate SCI and z-score. To finish, the accuracy of this approach depends greatly on the type of ncRNA [18] and the window and alignment lengths [13].

Prediction by combination of conservation, structure, HTS and array data

Functional RNAs can usually be distinguish from coding sequences by having generally stronger smallRNA-seq expression signals, lower poly-A+ signals, and high secondary structure conservation but low amino acid conservation. So a way to integrate all these characteristics to improve the identification of ncRNA is to use RNA secondary structure, evolutionary conservation and large amount of expression data by tilling arrays and high throughput sequencing [13]. A lot of features are extracted from these data, based on sequence, structural and expression features and classified by Random Forest classifier method. This method gives a very robust model. It revealed that many ncRNAs candidates are inside introns of coding genes at the antisense of exons [13].

ncRNA function prediction by co-expression network

A class of ncRNAs, the lncRNAs (long non-coding RNAs), seems to evolve more quickly than protein-coding genes, leading to poor conservation and making difficult their detection by genomic comparisons. Then, coding-non-coding co-expression (CNC) networks can be constructed to predict function of lncRNAs [12]. Networks are composed of edges, representing same direction (positive or negative) expressional correlation, and nodes, representing genes. Gene�s functions are determined by Gene Ontology annotations. Networks are built from multiple microarray datasets from resembling experimental conditions to avoid noise and maximize probabilities to find similar functions between lncRNAs. To simplify the problem, only genes having highly differential expression or being co-expressed in several datasets are included in the network. In order to measure the network performance, a random network respecting few conditions is constructed and compared to a built one from a known lncRNA functions database [12].

ncRNA prediction by stochastic context-free grammars

Identification of functional RNAs

Prediction of targeted mRNA by ncRNAs

Classes of ncRNAs involved in posttranscriptional gene regulation, like miRNAs and siRNAs, work by targeting mRNAs [17]. Their linear nucleotide sequence and secondary structure are as important as those of mRNAs. Then, an approach is to identify recurring patterns (motifs) on mRNA instead focusing on ncRNAs [<rabani2008computational>]. Motifs include mRNA decay rate (by half-life), RNA binding proteins and cellular localization. From a set of RNA sequences and structures assumed to share a motif, short motifs candidates which appear in as many inputs as possible are identified. Then, these motifs serve as seed to train a probabilistic model in order to detect new ones.

MicroRNAs identification and prediction

Because detecting miRNAs by experimental techniques is expensive, computational methods have been developed. Since 2003 [11], about twenty miRNA prediction tools were published. They use various ways to accomplish predictions, such as structural characteristic features integrated in different machine learning classifiers [8] and processing of deep sequencing data [6]. In all cases, those tools are based on known miRNAs from miRbase to predict new ones, in specific species or not. We present here miPred [8], one of the most popular and efficient machine learning based predictor. Besides, two deep sequencing analysis pipelines created to discover miRNAs, miRdeep [6] and miRanalyzer [7], are also described.

Machine learning approach with miPred

As other miRNA prediction tools, it is not the mature miRNA sequence itself which is predicted, but the pre-miRNA sequence. In genomes, many stem-loop hairpin structures are present, the challenge is to distinguish real pre-miRNAs hairpins and false ones (pseudo hairpins). MiPred uses a combination of local contiguous triplet structure-sequence composition, secondary structure MFE and P-value of randomization test in order to distinguish pseudo and real pre-miRNAs. All those features are compiled in a random forest machine learning algorithm.

Triplet informs on the status of a nucleotide in the structure: paired, represented by a parenthesis in the secondary structure, or unpaired, represented by a dot. Adjacent nucleotides structures are included, giving a total of 8 possible compositions: �(((�, �((.�, �(..�, ���, �.((�, �..(�, �.(.� and �(.(�. Then, the middle nucleotide is added, giving a total of 32 combinations (e.g. �A(((�, �C.(.�). Finally, the local contiguous triplet structure-sequence composition is calculated by percentage of appearance.

The P-value of randomization test determines if the MFE of a given sequence is significantly different from a randomly generated RNA sequence. A Monte Carlo randomization test is set to obtain this P-value:

Calculate the MFE M of the original given sequence

Shuffle the sequence while keeping dinucleotide distribution constant and recomputed the MFE. Repeat this step N times (N should be equal to 1000).

"P-value" = R/(N+1) where R is the number of randomized sequences having a MFE = M.

Forest of random trees (Random forests) concept was developed by Leo Breinman [2]. It is a combination of tree predictors trained on a random subset of features sampled independently. This classifier has the ability to learn only important features and ignore irrelevant ones, avoiding the feature selection step (e.g. by information gain). This ability is obtained by randomly select subset of features instead all of them during the tree construction.

The classifier efficiency is measured by four values: Sensitivity (Se) and specificity (Sp), which measure respectively the proportion of positives and negative instances respectively that are correctly classified, total prediction accuracy (ACC) and the Matthew�s correlation coefficient (MCC), calculated with the formulas:





With TP, true positives, FP, false positives, TN, true negatives and FN false negatives.

This approach gets nice results, but false positive rate stay too high [6]. This can be explained by the fact that only the precursor is validated, never the mature sequence itself.

Deep sequencing analysis

Non coding RNAs, especially miRNAs, tend to have a highly variable expression due to various factors such as cell type origin, developmental phase and environmental influences. Expression of one miRNA varies from few to ten of thousand per cell [7]. One of the challenges here is to analyse of the several gigabytes of data generated by each sequencing experiment [7]. Another is to discriminate miRNAs from other non-coding RNAs or degradation products [6].

When mapping high throughput sequencing data, stacks of reads (expressed sequences) appears. Expression profiles and reads stacks offer a great opportunity to identify potential miRNAs. MiRdeep has been developed to process that information. Its uses probabilistic model of RNA biogenesis to score the compatibility of the position and frequency of sequenced small RNAs with the secondary structure of the precursor [6]. An experiment produces a lot of degradation products. Among them, we have the Dicer residues, such as hairpin loops and precursors extensions around the duplex miRNA and miRNA*. Except the miRNA, everything is partially degraded. The position and frequencies of these elements are identified by miRdeep such as a signature to discover miRNAs. The pipeline works as follow: 1) Mapping reads against reference genome. 2) Extracting and folding precursor around mapped reads. Non hairpin sequences are discarded. 3) Precursors are then submitted to the miRdeep model based on the positions and frequencies of Dicer residues mapped on the precursor. 4) Estimation of conserved miRNAs in miRbase by blast and false positive rate by permuting structure and signature pairing [6].

Another approach use Dicer products to predict miRNAs: miRanalyzer [7]. The difference with miRdeep is the mapping against various known RNA stored in databases like Rfam and Repbase. This reduces the amount of sequenced reads and lowers the false positives miRNAs. Beside, instead a probabilistic model, miRanalyzer implements a machine learning method based on random forest generated from a broad variety of features associated with nucleotide sequence, structure and energy.

Ratio of miRNA and miRNA* vary in different tissues or developmental stages [10].

Identification by conservation

Identification of such RNAs can be done by comparing genomes. To this purpose, two genomes must be aligned to find orthologous regions that fold with characteristics of miRNAs precursors. Then, de novo miRNAs are found with the help of deep sequencing smallRNAs data and specific filters and according to orthologous regions [4].

The identity of the 5�-terminal nucleotide of the miRNA influences its loading by miRISC, with observed preference for nucleotides U or A [5].