RNA Structural-BLAST: A New RNA Structure Database Searching Service

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

Based on the XIOS framework, we present a new RNA topology based indexing and structural comparison tool RNA Structural-BLAST. It can index all the RNA structures, and produce unique RNA structural spectral fingerprints for fast structural comparison purpose. It also provides fast database searching, matching and classification of RNA.

Keywords: RNA fingerprint; XIOS; RNA structure; Graph Matching; RNA Classification;

1. Introduction

Like proteins, RNAs also perform important cellular functions, and our understanding of this fact is increasing rapidly. As our existing knowledge about RNA grows, the large scale characterization and analysis of RNA structures and functions, namely structural genomics of RNA, becomes increasingly important. The core of the structural genomics of RNA project is to find all unique structural motifs and 3D folds, molecular structure determines function. Current RNA function predictions mostly are based on finding conserved sequence motifs, similar to what is done with proteins. In order to identify the sequence motifs, multiple sequence alignments have to be generated. Conserved motifs can be identified from the alignments and the function of RNAs predicted. The problem is that, Compared with proteins, there are not many RNA classes currently known, and RNA sequences sharing same structural motifs may have no detectable primary sequence similarity, which makes it impossible to align them. RNA secondary structure can help the prediction and determination of tertiary structure, but accurately predicting RNA secondary structures from sequence information alone is not trivial. Secondary structural information can be used to solve the RNA function prediction problem. While there are many RNA secondary structure prediction programs, few of them can predict the key elements called pseudoknots. Pseudoknots are the most prevalent RNA structural motif in many RNA classes, such as self-splicing introns and telomerase. They play important roles catalytic functions of RNA, such as forming the catalytic core of various ribozymes, and altering gene expression by inducing ribosomal frameshifting in many viruses. Finding novel RNA secondary structures can give insight into the possible different functions and roles of RNA. In addition, the ability to find novel RNA secondary structures can help with the design of pharmaceuticals by providing an accurate target site for drug recognition.

Our group developed a framework we call XIOS, which represents an ensemble of RNA secondary structures in one XIOS graph, pseudoknots are specifically included (Li et al). Each node of the graph represents a RNA stem and each edge/link is a spatial stem-stem relationships. XIOS graph is then converted into minimum Depth First Search (minimum DFS) code for fast RNA structure comparison. In contrast to traditional sequence based approaches, XIOS is a topology based approach, which can comprehensively and efficiently explore the RNA structure space. Currently some RNA motif databases based on graph theory are available, but there is no database which provides a RNA structural topological searching service including pseudoknot topology. Additionally, techniques efficiently identifying structural similarity between RNAs are not well developed.

Built up on the XIOS framework, we present an RNA topology based indexing and structural fingerprints comparison tool, RNA Structural-BLAST.

2. RNA Structural-BLAST

Traditional sequence-level-BLAST is based on sequence conservation, and sequence similarity often translates into functional similarity. When sequence similarity is not high enough, we can use the power of structure to identify related molecules. This is particularly important for RNA, for which molecules with similar functions often have no detectable sequence similarity. Because molecular structure determines function, a structural-level-BLAST provides additional level of information not present in the sequence towards functional annotation.

RNA Structural-BLAST is a topology based indexing framework (as compared with shape-based indexing framework [abstract shape paper], or sequence-based indexing framework [blast-filter, HMM-filter paper]) which can 1. identify biologically related sequences; 2. identify structural motifs with statistical significance; 3. identify additional candidate sequences (iterating to convergence) by using structural motifs and 4. use the structural motif as theoretical starting points for laboratory experiments. Our package utilizes RNA suboptimal structure prediction ability of UNAFOLD and generates RNA structural fingerprint, basically a spectrum of all enumerated structural motifs found in an RNA structure, which contains all the spatial structural information. It allows us to index RNA families sequences and known structures (even novel RNA sequences with unknown function) and produce unique RNA structural fingerprints for each RNA molecule, including those with pseudoknots, and build a RNA topology database. If one query RNA belongs to a certain RNA family, it must share similar structural fingerprint pattern with the other RNAs in the same family. The structural fingerprint comparison tool provides a mechanism for fast structural database searching and matching with simple pattern matching strategy. Query RNA sequence can easily be classified into a family (with low or even no primary sequence similarity) by RNA structural fingerprint comparison. All structurally similar structures would be retrieved. We call it RNA Structural-BLAST.

Different RNA molecules contain different structural motifs. This is the assumption for the RNA structural-BLAST search (normal BLAST search is based on primary sequence similarity, while here it is structural similarity). We have enumerated all physically possible small RNA structures by using graph theoretic approach, or more precisely we have enumerated all of the small graphs which represent those structures (Li et al. 2008). We define these structures as structural motifs, which should be conserved within a group of RNA molecules with similar structures/functions. Structural motif sizes in our library varied from 1 to 7 nodes (containing up to 7*6/2 = 21 edges/links, namely 21 spatial RNA stem-stem relationships) and minimum DFS codes were extracted from small structural motifs. Instead of focusing on sequence similarity, we use the presence of common structural motifs to identify similar molecules. Currently, we have constructed a library of 55,728 structural motifs, which can be rapidly searched using a tree-based approach. Based on the RNA XIOS graph framework [Li et al. 2008], we have implemented a complete graph matching approach for fast structure to structure comparison. All small structural motifs were collected and built into a tree structure for fast searching. Bigger set of structural motifs could be enumerated but not without a large amount of computational time and hard-drive space. Structural motifs could be thought of building blocks of RNA structures, and each RNA structure contains some number of some specific motifs. For any biological RNA, we use a XIOS graph representing the ensemble of suboptimal secondary structures (predicted by UNAFOLD) as a query against the search tree. The XIOS graph is used to enumerate types and counts of structural motifs in the query, producing a 55,728 element structural feature vector that retains the structural features of the query RNA. We call this vector the RNA fingerprint.

3. Conclusion

Some observations: a) as the number of input molecules goes up, the number of common structural motifs shared by all decreases; b) RNA structural fingerprint patterns are very similar to RNAs from the same family (even from different biological species). Functions of molecule are determined by the structure, each RNA family folds into similar structures to perform the RNA family specific functions; c) RNA structural fingerprints of RNAs from different families show different spectral patterns.

The current structural motif database has been constructed based on existing RNA family databases. Biological RNA sequence data were downloaded from Rfam database “seed” alignments, RNA secondary structure data were collected from STRAND database and RNASEP database. RNA structural fingerprints were generated for all structures. More biological data are being converted into our RNA structural fingerprint database.

4. Discussion

We are continuously retrieving RNA sequence and structure information from available public databases (NCBI, RFAM, RNASPE database, etc.), and building a comprehensive RNA fingerprint database containing the structural features, sequence, and function for each entry. Query RNAs can be matched to this database of biological RNA fingerprints, using a simple vector distance to identify biological molecules sharing structural similarity. The sequence and function information for the database hits are available to aid in assignment of function to novel queries. The search result can provide biologists new hypotheses for novel RNA functions, aid forming functional hypotheses, insights into the structure function relationships, and aid in designing functional RNA molecules.

In contrast to traditional sequence based approaches, XIOS is a topology based approach, which can comprehensively and efficiently explore the RNA structure space. Currently there is no database which provides a RNA structural topological searching service including pseudoknot topology. Additionally, techniques efficiently identifying structural similarity between RNAs are not well developed. This work allows us to index all RNA families (even novel RNA sequences), specifically those ones including pseudoknots, and build a RNA topology database. Query RNA sequence could be easily classified into a certain family (even with low or even no sequence similarity) by RNA fingerprint comparison. All structurally similar structures would be retrieved with simple pattern matching strategy. Rfam RNA family classification and RNA structural and functional prediction would be greatly benefited from this new approach.

In the pharmaceutical industry, design of inhibitory or therapeutic RNAs often begins with randomly generated RNA sequences of specific lengths, and tests whether the generated molecules have specific function(s). This random approach is both time consuming and costly due to combinatorial search space. With the help of our structural-level-BLAST service, we can identify motifs from a relevant biological sequence pool, which would be likely to perform desired functions (based on similarity to known molecules). For researchers, this significantly narrows down the search space to identify functional RNA molecules of therapeutic use and saves time and expense.

Above all, our new application, by unleashing the power of molecular structures, could benefit busy biologists in many ways. Our web service can be accessed freely by public at http://xios.genomics.purdue.edu.

5. Future Directions

RNA structural fingerprints classification problem is complicated. Currently we are working on this problem on a small scale biological dataset (11 RNA families with and without pseudoknots, 2k plus sequences) using different feature selection/extracting methods, such as PCA, SVM, cosine-distance clustering, simple machine learning methods (e.g.: Naive-Bayes), and semi-supervised statistical method (e.g.: POCRE). After determining the best approach to classification, we will extend this framework to larger dataset, covering all types of RNA molecules and functions. One of our future directions is to find the "sequences to structure N to 1 mapping" as well as "structures to functions N to N mapping".

Another future direction is filter the RNA structural fingerprint data to remove background noise and motif redundancy/overlapping, in order to reduce the effects of such factors on the classification. We have seen such solutions in some other areas such as “remote sensing in Geographic Information Systems”. We believe with some modifications, it could be applied to the RNA structural fingerprint system.