Data Integration in bioinformatics

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

Data Integration

Nowadays, the volume of biological data is fast- growing. This data is available in a wide variety of formats, functions, structures, and data access methods and stored in flat files and relational or object oriented databases.

Now, different kinds of important bioinformatics databases have been created.

One of absolutely necessary issue in bioinformatics research is how to unify the heterogeneous data sources to facilitate data accessibility, management and analysis?

Data integration solutions give scientists the ability to access an integral view of heterogeneous data sources, integrating retrieved data and manipulating it with advanced data analyzing and visualization it [1].

Within the diverse field of bioinformatics, there are many types of data and data analysis. Now, hundred of databases of bioinformatics are formed. There are very differences among the various biological databases depending on the data that they contain, and their functions.

Biological data sources can be classified into six categories based on their functions [2]:

  1. Sequence databases
  2. Functional genomics databases
  3. Protein-protein interaction databases
  4. Pathway databases
  5. Structure databases
  6. Annotation databases

For instance, On a wide range of Nucleic Acid sequence databases can be divided into three categories: Genebank [3], The European Molecular Biology Laboratory (EMBL) [4], Database and the DNA data Base of Japan (DDBJ) [5].

Moreover, biological data sources can also be classified by species on interest [6].

The owner of Bioinformatics databases develops private system to inference relevant data query and processing services on its domain. Such as NCBI develops Entrez [7] database query system, which used on Genebank. And EMBL develops SRS[8].

The problem is that each of these databases contains a different kind of biological knowledge. Although, each database can answer questions in its domain.

The key point is how to share those heterogeneous and voluminous databases and make a common query platform for users?

Data integration aim to decrease these problems by providing a uniform interface to underlying biological data sources (DSs). Data integration systems may need to find data sources that are relevant to a user query, divide a query into sub queries and combined the retrieved results.

Data Integration in bioinformatics

The goal of data integration is to provide a uniform access to a set of autonomous and possibly heterogeneous data sources in a particular application domain [9].

There are three major approaches for biological data integration: data warehousing approach, link integration approach and view-based integration approach [10]. In addition, Goble and Steven in [11] introduced two more data integration approaches such as web service approach and semantic approach.

  • Data Warehouse integration:

In the warehouse data integration approach, all accessible data are fetched from many different databases, transforming the data and importing it to the local database (data warehouse) and executing all queries on the data contained in the warehouse rather than in the actual sources. It focuses on data translation [12].

In fact, in the first step in data warehousing a unified data model is developed that can accommodate all the data that is included in the disparate source databases. The next step is to develop a collection of software programs that will fetch the data from databases, convert them to match the unified data model via data mapping before it is physically stored locally and then load them into the warehouse.

The data warehouse approach has several advantages: 1) the database is not vulnerable to external factors, such as network connection, database maintenance, etc. 2) The database can optimize queries and process data locally according to a single data schema. 3) Another benefit in the warehouse integration approach which is very attractive property for bioinformatics is it allows the user to filter, validate, modify, and annotate the data obtained from the sources. [13]

However, there are some challenges in this type of data integration. The biggest issue is keeping the data warehouse up to date. Modification of data translation is expensive. The warehousing integration should solve problems in data model transformation and integration, semantic mapping, data conversion and conflict resolution.

Representative example of data warehousing include: BioWarehouse [14], MetNetDB [15], ATLAS [16], COLUMBA [17], VINEDB [18], BIOZON [19], etc.

  • Link integration:

This approach is probably one of the most popular and effective data integration in portals and keywords indexing systems.

Usually, there are cross-references among different databases for a biological entity. The users can follow the hyperlink to surf across databases. This approach is based on ontology and identity authorities [20].

One of example of this approach is SRS (sequence retrieval system), which is originally designed for a keyword indexing and search system for biological databases. SRS has linked different data entries among databases. Therefore, annotations from different databases are cooperating to create dependable linking rules, but this is not a strict requirement. Link integration has been adopted by many biological databases because of the simple implementation and increasing cooperation among database maintainers [21].

Unfortunately, link integration has some problems. First, this type of integration usually needs stable link. Second, there is ambiguities and update. Third, the link integration cannot provide powerful query interfaces except the keyword available in external databases.

Another representative example of link integration includes: Entrez, Integr8 [22].

  • View integration:

View-based integration system (VDIS) is a framework that solves the data integration problem for structured data by integrating sources into a single uni¬d view. It would provide a virtual environment (view) around the databases. Data remain exclusively in data sources and are obtained when the system is queried. This type of integration does not store data locally either.

Unlike data warehousing which focuses on data translation, view integration approach focuses on query translation.

Typically, query answering in the view-based approach is performed as follows. First, independently of the data in the sources, the query system executes the user request and transforms it into a set of sub-queries for relevant wrappers and external databases. Wrappers are small programs that translate local relational queries into appropriate requests understood by specific data sources and transform their results into relations [23]. Then, the query results are combined into a single result for users. Representative examples for view-based integration approach include: BioMart [25], DiscoveryLink [25], TAMBIS [26], K2/Kleisil [27], etc.

  • Web Service technology:

With the rapid advancements in technology, web service of biological data repository is appeared as a new integration approach. Web service can be noticed as a special type of view integration. Web service is a dynamic, integrated programming. Data can be programmatically accessed through web services and data sources handle as service providers. Therefore, this approach can be regarded as a service-oriented approach. The service-oriented approach enables data integration from multiple heterogeneous data sources through computer interoperability [28].

Unlike, data warehousing and middleware approach which focus on centralizing data access, through data translation and query translation, each individual data sources agree to access their data via web services (WS) as a decentralized approach.

Web services are based on a collection of standard protocol, including [28]:

  • XML (extensible markup language) that is used to tag the data.
  • SOAP (Simple Object Access Protocol) that is a protocol for transferring XML-based messages over computer networks.
  • REST (Representational State Transfer) that is a simple protocol implemented using HTTP methods.
  • UDDI (Universal Discover, Description and Integration), which is used for listing what services, are Available, WSDL (Web Services Description Language) that is used for describing the services available.

Web services are using common transport protocols such as HTTP and SMTP.

Some representative examples for Service-oriented integration approach includes:

  • BioMoby: is an open-source integration system that predefines ontology for XML schema [30].
  • Pathway Database System: is an integrated system of a collection of software tools for modeling, storing, analyzing, visualizing, and querying biological pathways data [31].
  • BLAST (Basic Local Alignment Search Tool): is a Web Service family of applications that allow scientists to easily identify and find homologues of an input sequence in DNA and protein sequence libraries [32].
  • Distributed Annotation System (DAS): is open source software that provides access to complete genome annotations using a SOAP web interface [33].

Web service Architecture:

Web Services including three components. Service providers, Service registry and Service requestor [34].

Service provider defines a service description for the Web service and publishes it to a service requestor or service registry. In other words, it makes the service available to provide services to register.

The service requestor uses an operation to retrieve the service description locally or from the service registry and uses the service description to bind with the service provider and it is looking for and invoking or initiating an interaction with the Web service implementation.

This Figure is shown the Web service architecture [35].


Web service architecture

  • Semantic integration:

Usually, most of biological databases are developed for human studding but we need another way that computers have ability to understand biological data sources and unambiguously process them.

Semantic web provides a machine-readable way for data representative and interoperability.

The semantic web would be able to query and connect different databases available on Internet. It describes data to computers for the exchange of them using several standards, including [36]:

  • RDF (Resource Description Framework).
  • RDF Schema (RDF vocabulary Description Language).
  • OWL (Web Ontology Language).
  • SPARQL (standard web query language for RDF).

The RDF provides a standard format for documents.

RDF and OWL generate a series of entities called ’triple’ for describe data as simple statement in the form of a subject, predicate and object.

Some examples of semantic web integration are: Bio2RDF[37], YeastHub[38], HCLS[39], etc.