information technology

The information technology essay below has been submitted to us by a student in order to help you with your studies.

Analysis On Database Integration Issues Of Biological Information Technology Essay

ABSTRACT

Data obtained from biological systems are useful and meaningful only when they are efficiently stored and processed. Different biological data sources store these data into different formats. Many of these data sources are proprietary that makes integration complicated. This report brings out the various complexities involved in storing, manipulating, retrieving and integrating biological data between many data sources. Several data integration frameworks are discussed and the best framework is chosen and its features are analyzed. However, on considering the amount of complexity involved in handling these biological databases, this report provides a lead that the best is yet to come and finding that best framework would result in a major breakthrough in the field of Bioinformatics.

TABLE OF CONTENTS

1 INTRODUCTION 3

2 ORDINARY DATABASES vs BIOLOGICAL DATABASES 3

3 COMPOSITIONS OF BIOLOGICAL DATABASES 3

4 PROBLEMS WITH REPRESENTATION 4

4.1 INTEROPERABILITY 4

4.2 INFLEXIBLE SCHEMA 4

5 ISSUES WITH FINDING AN INTEGRATED SOLUTION 5

5.1 DIFFERENT SCHEMAS 5

5.2 DATA ACCESS AND REPRESENTATION 5

5.3 DATA MODELS AND STANDARDS 5

6 EFFECTIVE FRAMEWORKS AND TOOLS 6

6.1 EnsEMBL 6

6.2 GENOMAX 6

6.3 SRS 6

6.4 DISCOVERY LINK 6

6.5 OPM 6

7 EFFECTIVE DATA INTEGRATION SOLUTIONS 7

7.1 KLEISLI- A WORTHY WAY 7

7.2 DATA FORMAT AND MODEL 7

7.3 QUERY MANIPULATION WITH KLEISLI 8

8 CONCLUSION 8

1 INTRODUCTION

The digitization of biological information led to the birth of Bioinformatics as a field. This availability of information led to birth of many tools that processed them and provided precise and accurate results. Ultimately, it increased the swiftness of research activities which created an impact on obtaining expected publishable results. The Human Genome Project is a good example. This project finished well before the deadline mainly due to extensive use of biological databases for sequencing and mapping.

Bioinformatics is a heterogeneous environment where multiple databases and software systems work in tandem [1]. In the beginning, these biological databases were created based on the needs of biologists and were not extensible. Therefore, many databases were created ad hoc, without any established standards for maintaining them. As the unprocessed data increased, new databases were handled in different format in different structures.

This heterogeneous nature of biological data and their growth in recent years makes it intrinsically hard to analyze and process the data. Learnability is a huge barrier and it is required to constantly update knowledge on the latest information. The choice of databases to query for the required data depends on the availability of the data in that database. Therefore, processing different databases is the key and it requires integrated database architecture to retrieve the data [2]. The problems faced on handling these databases and integration solutions are categorically discussed here.

2 ORDINARY DATABASES vs BIOLOGICAL DATABASES

Databases have been in use for long and processing data in them using effective data models to get best results have brought many best practices. The emergence of biological data sources has made a huge impact on the existing data models and the necessity to have them integrated has been a topic of research in recent years.

Most of the genetic databases have their data stored in a flat file format. This enabled them to process their data efficiently. When it comes to adding new data or merging with some external datasets, the entire format of the existing database is affected. Thus the databases lack their flexibility and are not extensible. Generally, the structure of biological data is very nested and also very sequential which makes it hard to put either into relational format or object oriented format.

3 COMPOSITIONS OF BIOLOGICAL DATABASES

Genome databases are generally in 5 different formats. They are sequence databases, map databases, model organism databases, bibliographic databases, and databases of databases. Sequence database contains genetic sequence data. Map databases contain information on human genetic and physical maps. Model organism databases include genetic map and sequence data of experimental organisms. Bibliographic databases contain citations from journals related to genome research. [3]

Biological databases are composed of several genetic and protein databases. The integrated database must integrate from all the following: [4]

DNA Databases (GenBank, EMBL, DDBJ) - Databases that store different nucleotide sequences.

Protein databases (PIR, Swissprot) - Databases for amino acid sequences of proteins.

Structural Proteins (PDB) - A database for tertiary structures of proteins.

Genetic Databases (GDB) - A database for physical and genetic maps.

Bibliography and Annotations (MEDLINE) - A database for bibliographic information.

The main purpose of integration of all these databases is to retrieve data for bioinformatics analyses and to process relationships between data. In order to improve the speed and efficiency of processing, many bioinformatics tools integrate the data locally. This increases query processing speed but the major drawback is that data has to be constantly updated or inserted from external sources. Another approach is to get all the source data and do query processing. This would enable to process on up to date data, but will result in poor query performance [5].

Therefore, the approach of getting all source databases is done by the use of webservices. Every source databases provide their data that are accessed by the integrated tool by exposing them as webservices.

4 PROBLEMS WITH REPRESENTATION

Several DNA sequencing techniques identified in the last two decades has led to the explosion of complete and meaningful genetic information. This has also led to the spurge in many technology and tools for handling sequences but an integrated framework for complete analysis is lacking.

Some of the major drawbacks are listed here:

4.1 Interoperability:

The size of genomes of different organisms varies over a wide range. Combining every organism’s chromosomes which amount to trillions of base pairs, storing them would amount to several gigabytes of computer storage space. The generated data is being subjected to lot of analysis with applying principles of biophysics, biochemistry etc. which enables to visualize data as a three dimensional structure, signatures based on functions, annotations for genes and gene expressing signals.

The obtained data is being performed lot of analysis to predict the possible genes. But this enables lot of data to be processed which increases the gap between useful and junk raw data. Every database has their own schema of storing data ranging from flat file to relational databases and their datasets are associated with metadata that explains the stored data. The number of biological data keeps on increasing and any integrated database should be able to access interoperable data. This seems to be unsolvable as the latest data keeps on increasing day by day and the database should be accessible to every tool that processes them. Inconsistent data and non-interoperability are the major problems [9].

4.2 Inflexible Schema:

Major issue in integrating biological databases is the need of an extensible schema that is required to transform one database into another. Multiple mappings for a simple field makes data integration complicated. Flexible and extensible schema is necessary, lacking which makes relational database very complicated to handle integrated biological information [6]. To access the data that do not have a common schema, developing a common querying system is needed. There must be a uniform query interface to develop query multiple databases and this querying should be supported by efficient data management in these databases. This can be achieved only by the use of a flexible schema between multiple databases.

Every data source is categorized by the way data is stored in it. That is, its schema defines it. To retrieve a comprehensive result, the data must be retrieved from many data sources and most of the external sources have a different schema of storing its data. Data is stored as a flat file or in a relational database system or as an XML or as an image or a custom database model. A need of an internal schema converter and a data wrapper are needed to preserver database schemas. Data from the external sources are converted to a single global schema. Queries are run over this global schema to retrieve fast processing of the result. The data wrapper will wrap the outgoing data into the specified common schema and share with the external data source [7].

5 ISSUES WITH FINDING AN INTEGRATED SOLUTION

The exponential growing complexity of biological databases has encouraged systems to be built on integrating public and proprietary databases. Information must flow free from these composite databases. The integration of these databases has to solve the following issues [8]:

5.1 Different schemas:

Different databases have different schemas to represent the same data. For example, NCBI’s Unigene database and TIGR’s gene index database represent the same gene in entirely different format. Even the lexical syntax and case sensitive notations differ between databases.

5.2 Data Access and Representation:

Any tool that requests to access data from different databases receives data in different formats. The application must be flexible to process different formats of data. Some data are retrieved as plain text and some are retrieved using APIs. To handle the retrieved data, the application must be aware of the structure of the data that is to be stored.

5.3 Data Models and Standards:

Various data models are used to represent different databases. Commonly used data models are relational data models and object oriented data models. Since they are different, nomenclatures of these databases are inconsistent. There is a challenge faced on validating the integrated databases.

In order to address these issues, a local warehouse is created that stores all the required data in a specified format. Creating the warehouse improves the efficiency and also improves the data availability. The warehouse enables to have all databases in sync and allows cleaning the incoming data and filtering the unwanted data before processing on them. Therefore the primary operations of a data integration system involves around moving data from different sources, cleaning and processing them and finally storing them into data warehouses.

6 EXISTING FRAMEWORKS AND TOOLS

A data integration system should not only provide data, but it should provide data that are semantically meaningful by the applications. Comparing all the data integration frameworks to analyze which would better solve the interoperability issues.

6.1 EnsEMBL - A system jointly developed by European BioInformatics Institute and Sanger Institute that performs automatic prediction of genes and it organizes raw sequence of data into its internal data fomat. It enables to browse genomic data and it performs annotation and other analysis to predict genes in those sequences. Advantages of EnsEMBL lie in its flexibility towards searching for a genomic sequence. Whereas, the disadvantage is that it does not have a flexible data format to store and process adhoc queries. This route to one of our unsolvable problems described above. EnsEMBL is free to use and can be used effectively for solving a particular issue of searching all sequences.

6.2 GenoMax- GenoMax is an integrated system developed by InforMax. It has a well-structured data warehouse system developed on Oracle database. It especially focuses on point solution of analyzing protein databases. It has a broader scope than other approaches but fails in retrieving data for complex queries from multiple data sources.

6.3 SRS – SRS is the most widely used database query and querying system used by the industry. It has an easy GUI and searches all internal, public and other databases for queries. SRS uses its own query language and it is mainly used as a navigation language to search for required information across databases. Another advantage in using SRS is that it allows adding new data sources and merging their data format is also allowed. Disadvantages are that it does not allow any modifications on the data that produce the results which attributes to the unsolvable problem of data inconsistency.

6.4 Discovery Link – A product developed by IBM, differentiates itself from other data integration products by having an explicit data model. This data model provides a common structure of data either while querying, storing or processing the data. Discovery link like SRS also stores data in a relational model yet it is able to manipulate the stored data which SRS failed to do. The disadvantage of this product is that it has to convert everything into relational data model and it requires the data to be processed into a normal form before storing. Thus it adds more complexity while storing new data. Even though it provides general solutions, using queries to fetch those data increases the learning curve.

6.5 OPM- OPM is a general data integration system developed at Lawrence-Berkeley National labs. OPM has a more advanced data model, based on the entity relationship model. This makes it more efficient to store and process highly nested data which is same as that of biological data. This deep nested structure enables it to be retrieved by using SQL queries which can be optimized to retrieve data across data sources. Main disadvantage of using OPM is that it is not extensible. Since it uses an effective schema, even a simple change in the structure data source forces the schema to be changed. Another disadvantage is that the data being broken down into many entities which causes performance problems when the data source gets bigger. Overall, OPM also has the same disadvantage as other frameworks when it comes to handling complex data.

Apart from the above established frameworks, a set of tools and frameworks are being built using XML. As XML structures the data effectively, it enables for easy and fast retrieval of data. Thus many XML based querying techniques and data stores are currently being researched to address integration of biological databases. [2]

7 EFFECTIVE DATA INTEGRATION SOLUTIONS

Any effective data integration framework for handling biological databases must satisfy the following requirements: [9]

Address the disadvantages of existing frameworks.

Improving performance of all database operations on biological databases. Fetching the required data from multiple databases should not take more time then processing every database individually.

Must improvise constantly to provide standards and benchmarks for handling complex structured data.

It should be highly extensible.

Should be Interoperable between different biological data sources.

Existing frameworks are built by integrating different databases. Any new data source that has to be added must not produce new complexities of integrating into existing framework.

7.1 KLEISLI – Worthy Way

Kleisli is a new data integration framework that has been suggested to address these above issues. It provides better ways to handle complex data. This framework was successfully applied in the problems revolving around data integration in the human genome project. Let us discuss the key aspects and features of Kleisli.

Kleisli has its own unique data model. This data model specifies the format in which the data must be stored and for providing constraints for data stored inside it and integrating with external data sources. For integrating with other data sources, Kleisli has its own extensible data exchange format that integrates external data and stores them.

7.2 Data Format and Model

The data format structure of Kleisli supports a complex format. It will support both flat files and also a relational model. It supports a very nested structure of relational models that makes it the best integration framework for handling biological data sources. How Kleisli handles data integration is, at first the data is modified into a specific format by using data wrappers. The wrapper is simply a parser that exchanges incoming data into the format stored into Kleisli. This data exchange format makes it more useful than other data integration frameworks.

Another key and prominent feature in using Kleisli is the way it handles query processing. It has a high level flexible query processing language for manipulating data in its data model. The query language used is sSQL. This is an extension to normal SQL that queries all kind of heterogeneous data sources. It takes in a set of input data, processes them into a specific format and ouputs the specific biological sequences.

This kind of modularity makes it extensible for any kind of external data bases to be integrated. It can handle large volume of data thereby handling complex data. The scripts are reusable in Kleisli and they can also be modified to handle new queries. Kleisli has a set of templates whereby queries are executed on the target data sources and clear results are produced to aid in biological scientific activities. The sample data exchange format of Kleisli is [2]:

(#uid:num, #title:string, #accession:string, #feature:{(

#name:string, #start:num, #end:num, #anno:[(

#anno_name:string, #descr:string)])})

Here, {and} specify the brackets sets, [and] specify lists, #L: label for field L of the record.

7.3 Query Manipulation in Kleisli

Kleisli is an efficient alternative for all integration solutions because it supports both SQL and non-SQL type querying of databases. In BioInformatics, every data cannot be fetched directly through SQL queries. Data should also be fetched that are complex to do with SQL queries. Kleisli does this by using application programming interfaces that interacts with every other database systems and provides the necessary details.

There is a suite of packages used by Kleisli that parses data into Kleisli exchange format that converts into an internal object. There is also a way of embedding Kleisli into any programming language by using these APIs. The presence of such APIs provides a crucial way of integrating with other available specialist frameworks.

Thus comparing with other frameworks, Kleisli framework provides many general solutions and provides various data integration solutions. XML seems to be the main contender but some of the flexibilities required by biological data sources are yet to be provided by XML as a framework. Not only this, Kleisli requires only simple programming skills to learn and use it. In general, learnability of Kleisli is relatively easy when compared to other integration frameworks.

8 CONCLUSION

The various problems encountered with integration of biological data sources provide many opportunities for new systems to solve their complexities and evolve into a standard framework. There is no one size fit all solution that solves the complexities with biological data processing. The choice of a data integration system for biological databases directly depends on its simplicity and its potential to express complex data [8]. These systems should not only have meaningful syntax, but also provide meaningful semantic output thereby enabling large scale research operations to be completed in effective timeline. Of all the existing integration systems, Kleisli fares well as a general integration system for biological sources, but only on certain aspects. It does not provide solutions to all the problems. Many organizations still use specific customized integration systems that specialize in specific areas to answer their specific problems. Any new framework must strive itself towards answering general integration issues and also must be customizable to answer specific needs. Thus, it also makes way for many exiting frameworks to evolve themselves by increasing their adaptability towards heterogeneous systems.


Request Removal

If you are the original writer of this essay and no longer wish to have the essay published on the UK Essays website then please click on the link below to request removal:

Request the removal of this essay


More from UK Essays