Query Optimization In Heterogeneous Database System Computer Science Essay

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

Query involving more than one database require global query optimization to achieve good overall system performance .MDBS query optimization is used for this purpose. The significant differences between an MDBS ,which is used for execution of query on (HDDB) and a traditional distributed database system (DDBS) make query optimization in heterogeneous DBMS more challenging than traditional DBMS. Several global query optimization techniques suitable for an MDBS, such as semantic query optimization, query optimization via probing queries, parametric query optimization and adaptive query optimization, are suggested. Caching of query result is implemented in the architecture for improving the performance .

[ Keywords: Query optimization, Query Execution Plan, Heterogeneous databases ]


Improvements in web technology have had a significant impact for the need to access thousands of heterogeneous Databases on various platforms since they could be immediately made available for many users through the existing computer networks. Query processing in multidatabase systems requires a particular study because of the autonomy and heterogeneity of the component databases. The effects of local autonomy are manifested in both optimization as well as execution aspects of query optimization. First of all, local information needed for global query optimization, such as local cost formulas, database statistics typically are not available to the Multidatabase Query Optimizer (MQO). Secondly, the MDBS has neither direct control over subqueries executed at LDBSs, nor direct access to internal data structures of LDBSs. Hence, the MQO cannot predict the execution time of a query submitted to LDBS.

Consequently, commonly recognized optimization solutions for global query optimization in traditional

distributed database systems are not appropriate for MDBS. when working with an integrated schema, the queries over this integrated schema may be translated in some multidatabase queries or others taking into account information about replication of data and information about cached data in the integrated schema. In a multidatabase environment, rules used for SQO can be classified into three types: global, inter-schema and local. Local rules are constraints held on local component databases, which are used for the optimization of global sub-queries that can be executed in local databases. Automatic acquisition of these kinds of rules has been studied extensively. Inter-schema rules identify the relationships between local databases, which can reduce the cost of data transformation by reducing unnecessary data retrieval. Global rules are defined for global schema virtually, inducted from local rules.[3] These kinds of rules must satisfy all their corresponding source constraints in local databases and are used for global semantic query transformation.

The purpose of global semantic query transformation is to identify redundancies and inconsistencies in the specification of the global query. Global rules can be expected to be small in number and simple in their form, and so are easy and cost effective to apply. To the application, the heterogeneous distributed database system appears as a single, local database. The local database server hides the distribution and heterogeneity of the data. One database server accesses the other database system using Heterogeneous Services in conjunction with an agent. If you access the any database store using an Transparent Gateway, then the agent is a system-specific application.[2]

The schemas of LDBS in heterogeneous databases may be different in various ways, where the same information is represented in different way. Global rules should satisfy all their corresponding source constraints in LDBS. Induction of global rules in global schema must consider the effect of conflicts between these schemas. To induce global rules, a three-step approach is used. In the first step local rules are mapped into global rules according to schema conflicts and meta data in dictionary. In the second step, consistency is checked and contradictory rules are eliminated. Final global rule set is inferred by deleting redundant rules in the last step.


A global query is a query on the federated schema that accesses data controlled by more than one LDBS. It is submitted to the front-end system of the MDBS that coordinates its execution by interacting with the LDBSs. The decomposition of the global query can be performed. More formally a global query may be represented as a quadruple where RQ is the set of relations of the federated schema involved in the query, Cl Q is a set of local conditions, i.e. each involving only attributes from a single site, eg , Cg Q is a set of global conditions, i.e. each involving attributes from several sites and AQ is the set of attributes that are in the result of the query.[9]


Global query optimization process is decomposed into two steps. In the first step a global query is decomposed into a set of sub queries in the second step the results returned by sub queries are connected through a set of federated join operations. At the global level, the execution process of a global query is specified by a query execution plan (QEP) that defines a partial order in the execution of the federated join operations. Thus, the global query optimization problem consists in selecting the best QEP, i.e. the best partial order of the federated join operations. More formally, the global optimization problem may be formulated as follows.

Given a global query Q, the execution space XQ, consisting of all possible execution plans P for Q, and a cost function C(P) that associates an execution cost to each P. We may then formulate the global query optimization problem as finding in the search space XQ the execution plan(s) with the least execution cost:


Multidatabase query optimization can use two different approaches.

Approach based on applying traditional distributed query optimization techniques to generate the "optimal" query execution plan;

Approach based on the idea of delegating the evaluation of the execution cost of the elementary steps of a global query execution plan to LDBSs where computations are performed.

Whether to use the first approach is based on the ability to find approximate cost models for autonomous LDBSs. Here we require to compute a cost of a local query.[9]

There are several techniques are available for extracting or estimating local cost models as

Calibrated cost model: - A generic cost model for a LDBS is developed. Here a synthetically created database is used to run a set of queries against this database, to deduce the coefficients in cost formulas for the underlying LDBS. The coefficients are mostly system-dependent factors. Instead of using cost formulas based on physical parameters, authors developed a set of cost formulas based on logical parameters such as cardinality and selectivity. This approach has some shortcomings. It is not possible to create a calibrated database for some sites.

Query sampling cost model:- To estimate cost parameters of local cost formulas here sample queries are used. In this method grouping of queries are done to each LDBS into classes, and running a sample query workload for each class. The sample queries costs are used to derive a cost formula for the queries in the query class. To estimate the coefficients of cost formulas the statistical procedure based on multiple regression analysis is applied. The accuracy of the estimation is closely related to how queries are classified into classes.

Extensible cost model : - This approach is based on the cooperation of the LDBSs in the evaluation of the execution cost of a global query execution plan. Distribute the evaluation of the processing of each elementary step of a global query to LDBSs. All LDBSs perform this evaluation process in parallel, and cooperate in sending to each other the cost estimates of the partial execution plans. It means that the total execution cost of a global query execution plan is computed incrementally through the cooperation of all sites involved in the query. Each site evaluates the cost of its part of the computation and the characteristics of the result, and eventually sends this information to sites that may use it for further computation. All sites cooperate, but each acts autonomously. The main advantages of this approach is that it does not require any assumption on local execution models and optimization strategies and preserves completely the local autonomy of sites.


Huge amount of data is not well structured e.g. Web pages and other online data stores which content textual data, picture, video, image, and voice data. The inclusion of the Web and online data with structured data as

components of a multidatabase system is one of the most challenging research topic now.Few proposals have addressed the problem of query optimization and evaluation. In traditional systems, query optimization is done at multiple levels. One important issue concerns query optimizers that has to deal with heterogeneous federated systems of this size. A multidatabase query optimizer cannot simply construct a static optimal plan for a global query because of the following reasons.

1 Some sites may be unavailable

2 Even the best plan can behave arbitrarily badly in the presence of unexpected data transfer delays due to changing loads in sites and in the network.

3 There may be several replicas at various sites, and the quality of them may vary. Therefore, an optimizer must be able to deal with this quality of service issue.

Another problem is how to decide that objects in two different databases refer to the same object in the real world (it is simply impossible to ensure uniform keys across 1000 databases). For all this reasons, the traditional static-cost-based approach to global query optimization seems inappropriate because it is unable to adapt in response to changing loads, unexpected delays or unavailability of sites. Some solutions to the above mentioned problems already exist. The problem of unexpected data transfer delays was considered , to address the problem query scrambling approach has been proposed. Scrambling modifies a query execution plan on-the fly in response to unexpected delays that can arise due to communication problems in obtaining partial results. The plan is modified when a significant idling arise so that progress can be made on other parts of the plan. Query scrambling reacts in two ways: (1) by rescheduling the query execution plan or (2)

by operator synthesis.


1 Catching the Intermediate Results for Query Optimization:

Improving the MQO performance by Catching Query results Data of different queries. Processing multiple queries together and storing partial results for future needs, as well as caching intermediate or final results of previous queries and using them to answer future ones can reduce response time greatly.[10]

It is worth to have some data cached in the extension of the knowledge base in order to avoid accessing the underlying databases each time a user formulated a query. Communication cost involved in transferring intermediate results among the nodes and the final reconstruction of the answer can be avoided. Therefore, dealing with a multidatabase system implemented using a Client/Server architecture and that has some data cached for the integrated schema, the query processor has to take into account the existence of replicated data in the underlying databases and also the existence of cached data in the integrated schema node in order to perform some query optimization such as semantic query optimization. Once the underlying databases have to be accessed, more data can be brought than just the data needed to answer the query. As the more data is brought and cached, the more possibilities of answering queries from the cache but also the bigger is the cache memory. It may be possible to ask the user to cooperate with the system and tell the system if he is interested in bringing more information or not, depending on if he is going to ask for that data later. The figure shows the architecture for multidatabase Query optimizer which executes the query and create the materialized view from that result to keep the resultant data in the result cache.

Multidatabase query optimizer with Multi Agent Architecture:

It is quite evident that the performance of a DDBS is critically dependant upon the ability of the query optimization algorithm to derive efficient query processing strategies.[1] There are number of Query Execution Plan for DDB such as: row blocking, multi-cast optimization, multi-threaded execution, joins with horizontal partitioning, Semi Joins, and Top n queries.

Layers of distributed query optimization

DDBMS query optimization algorithms attempts to reduce the quantity of data transferred. Minimizing the quantity of data transferred is a desirable optimization criterion. The distributed query optimization has several problems relate to the cost model, larger set of queries, optimization cost, and optimization interval. The goal of DQP is to execute such queries as efficiently as possible in order to minimize the response time that users must wait for answers or the time application programs are delayed. And to minimize the total communication costs associated with a query, to improved throughput via parallel processing, sharing of data and equipment, and modular expansion of data management capacity. In addition, when redundant data is maintained, one also achieves increased data reliability and improved response time. In this work we propose a multi-agent architecture for distributed query processing. There are number of Query Execution Plan for DDB such as: row blocking, multi-cast optimization, multi-threaded execution, joins with horizontal partitioning, Semi Joins, and Top n queries figure shows the block diagram for a novel agent based QEP generator for heterogeneous distributed data base systems.


Catching of query result and parallel selection of query evaluation plan in query processing of heterogeneous database quires minimizes the response time and cost of the query processing which we execute using multidatabase query optimizer. Use of mobile agents in query optimization also reduces the communication cost in selecting the EP for the execution of the queries.