Page Rank Algorithm Semantic Web Search Engines Computer Science Essay

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

Web search Engines play an important role in extracting the information to the users. But the results obtained from these web search engines provide a burden of useless pages. To overcome this limitation, Semantic Web is used in proposed system. Many search engines have been developed based on this Semantic Web architecture. However, in order to rank results, most of the existing solutions need to work on the whole knowledge base. The proposed method which combines Page rank algorithm along with Semantic Web Search Engine relies on the information that is extracted from user queries and annotated resources.

Index Terms- Semantic Web, knowledge retrieval, search process, query formulation.


The Semantic Web is an extension of the World Wide Web in which the meaning (semantics) of information and services on the web is defined, making it possible for the web to "understand" and satisfy the requests of people and machines to use the web content.

The major difference between the Semantic Web and the World Wide Web is that the Semantic Web is supposed to provide machine accessible meaning for its constructs whereas in the World Wide Web this meaning is provided by external mechanisms.

Semantic search seeks to improve search accuracy by understanding searcher intent and the contextual meaning of terms as they appear in the searchable dataspace, whether on the Web or within a closed system, to generate more relevant results.

PageRank is a link analysis algorithm, used by the Google search engine that assigns a numerical weighting to each element of a hyperlinked set of documents, such as the World Wide Web, with the purpose of "measuring" its relative importance within the set. The algorithm may be applied to any collection of entities with reciprocal quotations and references.

Rather than using ranking algorithms such as Google's PageRank to predict relevancy, Semantic Search uses semantics, or the science of meaning in language, to produce highly relevant search results. In most cases, the goal is to deliver the information queried by a user rather than have a user sort through a list of loosely related keyword results.

Query formulation is an essential part of successful information retrieval. The challenges in formulating effective queries are emphasized in web information search, because the web is used by a diverse population varying in their levels of expertise.

Knowledge Retrieval is a field of study which seeks to return information in a structured form, consistent with human cognitive processes as opposed to simple lists of data items.


The objective of this paper is to show how to make use of relations in Semantic Web page annotations with the aim of generating an ordered result set, where pages that best fit the user query are displayed first.

One of the first attempts to enhance Semantic Web searchengines with ranking capabilities is reported in [1]. A similarity score is defined for measuring the distance between the systematic descriptions of both query and retrieved resources. Similarity is then computed as the ratio between relation instances linking concepts specified in the user query and actual multiplicities of relation instances in the semantic knowledge base. This method is applied on each property individually and requires exploring all the Semantic Web instances. More-over, the user is requested to specify all the relations of interest. Thus, since it is predictable that the number of relations will largely exceed the number of concepts [2], its applicability in real contexts is severely compromised.

Ontology-based lexical relations like synonyms, antonyms, and homonyms between keywords (but not concepts) have been used to "expand" query results. In this case, search is targeted to the Web, rather than to the Semantic Web [3].

A slightly different methodology has been exploited in SemRank [4]. The basic idea is to rank results based on how predictable a result might be for the user but based on how much information is conveyed by a result. To achieve this goal, the two measures "uniqueness" and "discrepancy" are defined. An additional added value of SemRank is that in the computation of the ranking, it exploits a so-called "modulative relevance model" that takes into account the particular context/purpose in/for which a query has been submitted. It is reasonable to infer that since to rank single page information related to the annotations of all the remaining pages is needed, the performance of the proposed solution would hardly scale for huge Semantic Web environments.

An approach [5] based on the context of interest to the user, rather than specific concepts or relations, is specified. In this method, "universal" and "user-defined" weights are assigned to each semantic relation. In order to get accurate results, an intensive manual configuration step has to be performed, and this step cannot be valid for answering heterogeneous queries.

A totally different solution is represented by OntoLook [6]. The basic idea is that in graph-based representation of a Web page annotation, concepts and relations are modeled as vertices and weighted edges, respectively; it becomes possible to remove less relevant concepts from the graph. The strategy behind OntoLook only allows us to empirically identify relations among concepts that should be less relevant with respect to the user query. However, the effectiveness of the approach is strongly limited by the fact that there does not exist any ranking strategy. PageRank can be used in conjunction with [6] to exploit relevance feedback and postprocess the result set. But the use of the remaining techniques is not feasible since they cannot be reasonably applied into a concept-relation-based framework where ontology is predominant on pure text.

The proposed technique is not intended to replace the ranking strategies of actual search engines. Rather, it should be understood as a preprocessing step to produce a semantic-aware ordered result set to be later treated with existing techniques in order to come to an increased hit ratio in user query processing.


A. Relation Based Search Engine

In the proposed system to construct a controlled Semantic Web environment the a well-known ontology written in the OWL language is used, and modified by adding new relations in order to make it more suitable for demonstrating system functionality. A knowledge base is created by either downloading or automatically generating a set of web pages and embedded into RDF semantic annotations based on the ontology above. Finally, the remaining modules of the architecture are designed, including a Webpage database, a crawler application, a knowledge database, an OWL parser, a query interface.



Result Set Query

Web DB

Crawler Application

User Interface



Search Logic

Knowledge DB

OWL Parser

Fig 1. Semantic Web Architecture.

The crawler application collects annotated Web pages from the Semantic Web including RDF metadata and originating OWL ontology. RDF metadata are interpreted by the OWLparser and stored in the knowledge database. A graphics user interface allows for the definition of a query, which is passed on to the relation-based search logic. The ordered result set generated is finally presented to the user.

B. Query Definition Process

In a traditional search engine, a query is specified by giving a set of keywords, possibly linked through logic operators and enriched with additional constraints. Semantic search engines are capable of exploiting concepts hidden behind each keyword together with natural language interpretation techniques to further refine the result set. The core step that consists of identifying the mapping between keywords and concepts can be performed in a (semi)automated way.

It is worth observing that the current implementation is not able to handle multiple ontologies describing the same domain. From the point of view of the search logic, this would require the integration of one of the existing techniques for mapping or merging/translating the heterogeneous ontologies, which would result in the definition of a set of mapping rules or in the creation of a novel ontology. From the point of view of user interaction, having an extended ontology would increase the need for a preprocessing step enabling automatic identification of keyword-concept pairs.

C. Graph Based Notation

In the ontology and annotation graphs, concepts and relations are translated into graph nodes and edges. In a query subgraph, nodes are represented by concepts that have been specified within the query. Nodes/concepts are linked by an edge only if there exists at least one relation between those concepts in the ontology. The weight is represented by the actual number of relations. Similarly, a page subgraph is built based on the annotation associated to the page itself.

The methodology starts from a page subgraph computed over an annotated page and generates all the possible combinations of the edges belonging to the subgraph itself not including cycles. Since there could exist pages in which there are concepts that do not show any relations with other concepts but that could still be of interest to the user, the methodology progressively reduces the number of edges in the page subgraph and computes the probability that each of the resulting subgraphs obtained by a combination of the remaining edges is the one that matches the user's intention. Edge removal could lead to having concepts without any relation with other concepts. Thus, several relevance classes are defined, each characterized by a certain number of connected concepts. Within each class, pages are ordered depending on the probability measure above and presented to the user.


A. Graph-Based Formalization

A graphbased representation can be designed where OWL classes are mapped into graph vertices and OWL relation properties are mapped into graph edges. Thus, the existing relations between couples of concepts in the domain are depicted by means of connected vertices in the graph. This is called the ontology graph G. According to graph theory, the undirected graph G can be defined as G(C, R), where C = {c1,c2, . . .,cn} is the set of concepts that can be identified in the ontology, |C| = n is the total number of concepts available,

R = {Rij | i =1, . . . , n, j = 1, . . . , n, j > i}

is the set of edges in the graph, and more specifically,

Rij = {r1ij,r2ij, . . . ,rmij,m < n} is the set of edges between concepts i and j.

The query subgraph GQ is an undirected weighted graph derived from ontology graph G. GQ can be expressed as GQ(CQ,RQ), where

CQ = {ct | ( kt , ct ) Є Q} C C is the subset of concepts mentioned in the query, RQ = {Ŕij | 1 ≤ i ≤ n,1 ≤ j ≤ n, j > i}, and Ŕij = {rij | ci , cj Є CQ , | Rij | ≥ 1}.

If an ontology graph G and a query subgraph GQ are given, it is possible to define a ranking strategy. The proposed ranking strategy assumes that given a query Q, for each page p, it is possible to build a page subgraph GQ,P using a methodology that is similar to the one used for G and GQ.

B. Page Relevance Score

Consider an ontology graph G, a query subgraph

GQ and a page p and derive its page subgraph GQ,P. The constrained spanning forests SFQ.P(l) for a given number of edges l (1 ≤ l < | CQ,P |) is defined. Based on the constrained spanning forests, a constrained relevance score P(Q,p,l) for page p is defined.

The page relevance score psQ,p is computed as

psQ,p = P(Q,p,max(l)) + max(l) | P(Q,p,l) ≠ 0.


The user starts defining query keywords and concepts. The search engine logic accesses the Web page database, constructs the initial result set including all those pages that contain queried keywords and concepts, and computes the query subgraph. Then, for each page in the result set, the page subgraph is computed. Starting from each subgraph, all page spanning forests are generated and used to compute the page score. Web pages are associated to relevance classes, and the final result set is constructed.

A. Spanning Forest Algorithm

Calculating the relevance score for a single page requires considering all the page forests and, for each

forest, constrained page relevance score has to be computed. This requires finding an efficient way for both enumerating all the page forests for a given query and computing the page probability.

Two strategies can be used.

The first strategy could be to consider all the possible page spanning forests of the page subgraph and progressively remove their edges generating constrained page spanning forests by taking care to avoid duplicate configurations. But, avoiding the production of duplicate configurations is an extremely expensive task.

A second strategy could start by considering all the

page forests of length one and generate all the possible page forests of increasing length by recursively adding a new edge until a page spanning forest is obtained.

The second approach has several advantages over the first approach. First, by properly selecting the edge to be added in the recursive process, it is possible to implicitly obtain a set of page forests without duplicates. Secondly, the iterative approach allows us to exploit the results achieved in previous steps in order to speed up the time requested for computation. Computing the probability of a forest composed by n edges simply requires multiplying the probability obtained for the page forest with n _ 1 edges by the relation probability associated to the additional edge.

Label the edges in GQ,P with an index ranging from 1 to RQ,P

Define variables e and a to index graph edges

Set ηe = ηij // number of relations linking concepts

// i and j in the ontology graph(edge e Є RQ,P)

Set δe = δij // number of relations linking concepts

// i and j in the page subgraph(edge e Є RQ,P)

Set τe = ηij / δij // relation probability for edge e

Mark all the edges in GQ,P as not visited

Allocate weight vector W of size |CQ,P|-1

//W[l] stores the accumulated constrained

//probabilities for page forests of length l

Allocate vector Σ of size |CQ,P|-1

// Σ[l] stores the number of page forests

// for a given length l

Initialize W and Σ to zero

for e =1, e < |RQ,P| , e = e+1

mark edge e as visited

visit (e,e,1, Ï„e)

W[1] =W[1] + Ï„e

Σ[1]= Σ[1]+1

function visit (o,e,l,s)

a = e + 1

while a < |RQ,P| and l < |CQ,P|-1

if a is not visited and a is safe

// (does not introduce cycles,checked through DFS)

mark edge a as visited

visit (o,a,l + 1,s x Ï„)


Σ[l+1]= Σ[l+1]+1

Set edge a as not visited


a = a + 1

Fig 2 Pseudocode of the algorithm


Semantic Web improves search strategies and enhances the probability of seeing the user query satisfied without requiring tiresome manual refinement. Several ranking algorithms for the Semantic Web exploiting relation-based metadata have been proposed. They mainly use page relevance criteria based on information that has to be derived from the whole knowledge base, making their application often unfeasible in huge semantic environments. A novel ranking strategy is proposed that is capable of providing a relevance score for a Web page into an annotated result set by simply considering the user query, the page annotation, and the underlying ontology. The proposed method provides promising results in terms of both time complexity and accuracy. Future Semantic Web repositories that based on multiple ontologies, characterized by billions of pages, require more scalability.