Application Of Keyword Search On Data Graphs Computer Science Essay

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

The Web search engines have made keyword based searching a very popular strategy. Although all the important systems implemented today have full search capabilities, they still require the users to have some knowledge of intricate structured querying languages along with an understanding of the underlying schema to search for the required information. The keyword search is carried out by considering the data in the form of a graph where the entities represent the nodes and relationships are the edges. This paper addresses the strategies for keyword search in graphs which reside in the memory, as well graphs which are considerably larger than the memory size. The keyword search strategies such as BANKS [2], BLINKS [3], and bidirectional search [4] are based on in-memory graph search. In the case of data graphs which are stored in the external memory such as the cache memory are implemented using a technique which is known as the multi-granular graph technique [1]. The results of keyword search conducted on these two types of data graphs are compared and found that the incremental expansion search algorithm, which is used to search the graphs larger than the memory performs better than the in-memory algorithms. This is because the application of in-memory algorithms to very large graphs would result in very substantial IO cost.


The implementation of keyword search on graphs has gained a lot of interest in the recent times. The relational, XML data can be interpreted in the form of graphs where the entities are modeled as nodes and the relationships as the edges. The responses obtained by executing the queries of keyword search are generally represented in the form of trees which connect the nodes with closely related key words. Each of the answer trees is related with a score depending on the weights of the nodes and edges, and the top k answers have to be retrieved.

A large amount of research has been done on applying keyword search to structured and semi-structured data. These strategies are based on the keyword search carried out on graphs that exists in the memory.

The approaches which come under this group include BANKS [2] which makes use of the backward expanding search, bidirectional search [4], and BLINKS [3]. All these algorithms believe that the graph resides in the memory. Keyword word search has drawn attention for the following reasons as it provides an interface which does not need the user to run complex queries for retrieving information. The user also need not be familiar with the underlying schema structure.

If the graph contains millions of nodes then it will not be able to fit appropriately on large servers. [1] proposes the strategy for key-word search on graphs whose size is considerably larger than the memory. It also presents a technique of representing it as a multi-granular graph, which a reduced portion of the graph which resides in the memory, along with the other parts of the graph which are present in the virtual memory. The supernode graph is obtained by grouping nodes present in the full graph into supernodes, along with superedges which are created between supernodes that encompass connected nodes. Raw data might not be structured as a graph but there exist many associations between them which restore these connections and enable more efficient and spontaneous keyword querying. This paper deals with applying keyword search to graphs which reside in the internal and external memory.


This section briefly sums up the key graph concepts dealt in [2] and also two level graph representations which are defined by [5].

Graph Concepts

Nodes: Every node present in the graph has a corresponding set of keywords linked with it. It also has another related property called the node weight (also known as prestige) which affects the answer rank including the node.

Edges: The edges contain weights and direction. An edge with a higher weight is equivalent to weaker connection degree. A directed model is illustrated to prevent answers which are generated by hub nodes and which are not likely to carry any meaning. [3] presents a method of assigning edge scores depending on the in degree and out degree of the nodes. The search methods considered are not influenced by the definition of the edge scores.

Queries: A query executed for keyword search consists of a set of terms.

Answer Trees: The answer tree is a smallest directed tree so that some node in the tree holds every keyword. The overall score is computed as a function of the node score and edge score. The node score is obtaining by summing the node weights of leaf and node. Quite a few models have been proposed to calculate the edge score by different approaches. [2] defines the edge score as the inverse of the weighted sum of the edges. Whereas [4] obtains the edge score by summing the length of the paths. [3] utilizes the above model and allows answering of queries, where as the Steiner tree model is NP hard, and it avoids generating same answers with same root.

Two Level Graph Representation

The representation of the graph as a two-level which is proposed by [5] is defined as:

Supernode: With the help of a clustering algorithm the graph is divided into many components. The super node represents each of the clusters which are present in the graph. It holds the subset of the set of vertices. The nodes present in the set are known as the innernodes.

Superedge: The edges connecting the supernodes are known as super edges. They are created by considering where there exists at least an edge from the innernode of supernode s1 to another innernode of supernode s2; in such a case a superedge will be present between them.

While constructing the supernode graph, the cluster values are selected in a particular way which will enable the graph to appropriately fit into the main memory. Each of the supernode is stored on the disk and posses an unchanging number of innernodes. In [5] the nodes and the edges do not posses weights. Whereas in the case of keyword search, edges posses weights. The weight of a super edge is expressed as "min {edge-weight}" which is calculated over edges between the innernodes containing two supernodes. By using min assists in obtaining an upper bound on the score which is a refinement of an answer that contains a superedge.

Multigranular Graph Representation

The 2-phase search algorithm needs the top level of the representation to reside in the memory and it gets the lower parts of it into the memory as and when necessary. As the entire lower level is bigger than the size of the memory, parts of it are stored in a buffer of fixed size. With the memories that are available today, a considerable amount of memory can be available for storing, In case the associated queries have been executed previously then the necessary parts of the graph may be present during query execution.

The multi-granular graph structure is proposed to make use of information residing in the low-level nodes that reside in the cache during query execution. The multi-granular graph is a mixture as it possesses supernodes and inner-nodes during any given time. The supernode can be present either in an expanded form or an unexpanded form. In case the supernode is present in an expanded form all the innernodes and adjacency list reside in the cache. On the other hand, when it is in the unexpanded form the internodes are not present in the cache. The illustration of a multi-granular graph is demonstrated in Figure 1.

In-memory graph algorithms:

Many algorithms have been proposed for keyword search which presume that the graph resides in the memory. [2] Models tuples of the database as nodes in a graph, which are connected to each other by links produced by the foreign key relationships. Answers to the queries are represented as rooted trees which connect tuples that are equivalent to individual keywords in the query. The answers to the queries are ranked in order of proximity joined with prestige depending on the inlinks, which are alike to the approaches implemented for Web search. It presents a backward expanding search algorithm which provides an analytic explanation for finding and ranking query results.

The first step is to trace the nodes corresponding to the search conditions. A node is applicable to a search condition if it possesses the search conditions as a portion of an attribute value or metadata. For instance, all tuples which fit appropriately in to a relation would be considered to be applicable to that keyword. A reply to a query is a rooted directed tree and it contains no less than one node from each set. It should also be noted that that the tree may also contain nodes which are not present in any set of nodes and is as a result a Steiner tree. A relevance score should be assigned to each of the answer trees, and the answer trees have to be presented in decreasing order of the relevance score. The scoring mainly includes combining the significance hints from edges and nodes. Two separate relevant measures are provided by the node weights and edge weights. Many copies of the Dijkstra's single source algorithm are run by the backward expanding search algorithm.

BLINKS[3] is a two-level indexing and query processing method for generating the top-k keyword search results on graphs. For reducing the index spacing, it divides the data graph into blocks: The two-level index stores the outline information at the block level to begin and direct the search among different blocks, and more detailed information for each block to speed up the search process within the blocks. The approach depends on two central ideas: The first idea is a cost-balanced strategy for controlling the expansion across clusters, with a provable shift on its worst-case performance. The second idea is use indexing to support forward leaps in the search. Indexing allows users to find out whether a node can accomplish a keyword and the shortest distance. By this means we can remove the ambiguity and inadequacy of a step-by-step forward expansion process.

[4] Deals with the problem of powerfully extracting a small number of the "best" answer trees from the data graph. Bidirectional search, is a method which improves the backward expanding search algorithm proposed in [2] by allowing forward search from possible roots towards the leaves. The first approach was to create iterators only for keywords which are not "frequently occurring" and additionally explore the forward paths to "frequent" keywords (Figure 2). This strategy combines all the only source shortest path iterators from the backward search algorithm. It uses the spreading activation to prioritize the search.

This approach pursues the forward edges starting from the nodes explored by the incoming iterators.

Keyword search on graphs larger than memory:

[1] Puts forward a method of representation, called the multi-granular graph technique, which merges the reduced version of the graph which always resides in the main memory, along with remaining parts of the complete graph which are present in the cache. The supernode graph is produced by grouping all the nodes in entire graph to form supernodes and the superedges are formed in between them which possess nodes which are connected. The multi-granular graph method proposed here gives information regarding the entire the graph which at present exists in the memory. It also suggests two different ways of exploring the multi-granular graphs by extending the already existing algorithms.

The first method utilizes the iterative expansion algorithm which is an algorithm which has many stages and this is appropriate for multi-granular graphs. Each repetition of the algorithm can be divided up into two phases- the explore phase, expand phase. The answers which are produced might have supernodes.

Now an algorithm for graph present in the memory (in-memory), is executed on the existing condition of the multi-granular graph. The keyword-search algorithm is implemented repeatedly to a point where the top-k results are obtained.

The second method makes use of an incremental procedure, which as the previous method expands the supernodes, but rather than starting the search again from the beginning, it modifies the in-memory data structures to reproduce the altered state of the multi-granular graph. Hence the search which was performed previously does not need to be done again if it does not surpass the expanded node. Even though it should be achievable to create incremental editions of other keyword-search algorithms like the bidirectional search [4] or the search method implemented in BLINKS [3].

Experimental evaluation and comparison

Experiments have been performed and the results are compared with the various existing keyword search algorithms. The total time taken to execute a query is compared by dividing it into the CPU-time and the IO time. The total query execution time is the time taken to retrieve the top-k results.

Figure 3 illustrates the execution time taken for responding to queries by the different keyword-search algorithms. For every query that is executed, the diagram demonstrates three vertical bars which stand for the iterative method, incremental method and VM-search. VM-search is the approach of executing the in-memory algorithms in the virtual memory. The time taken to generate 10 query answers is computed for all the three cases. As the performance was observed to be extremely poor when the number of iterations increased, a heuristic was introduced to stop when 30 iterations have been completed. Even after the introducing this condition, the iterative method performs quite poorly in terms of time taken. It can also be noticed that the incremental approach considerably performs better than both the iterative and VM-search approaches. Also, the VM-search method has notably higher figures for the IO time than compared to the incremental approach for most of the query results, but in general it has lower value for CPU time.

Conclusion and Future work

This paper focuses on many algorithms for applying keyword search to graphs. The answers to keyword queries are usually expressed in the form of trees. It is also discussed that the graph with many nodes will not fit even on large servers. Keyword search in graphs residing outside the main memory is addressed by making use of a multi-granular representation of data. In conclusion, we can say that the incremental expansion search algorithm significantly outperforms the alternative in-memory algorithms because using in-memory algorithms on very large graphs would produce a very substantial IO cost. The future work mainly involves developing a bidirectional search algorithm that can be applied to multi-granular graphs. Also, to develop better clustering techniques which are more efficient than the existing methods. Another option for storing large graphs in the external memory is to share it across the main memory of many nodes in an environment which is parallel.