Semantic Web Data

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

This chapter details every aspect of the proposed research, how it useful and its scope for future development. It will give the detail of how to fulfill the objective described in the chapter one. The research work can be broadly classified into five main categories namely:

Mining the web for semantic data.


Proposed system scheme.

Proposed solution

3.1 Mining the Web for Semantic Data.

The task of mining the Web for Semantic Data essentially consists of crawling the web and finding Semantic Web Documents, which are stored in the form of Resource Description Framework (RDF), Web Ontology Language (OWL), Friend of a Friend (FOAF), Really Simple Syndication (RSS), etc at various locations. This leads us to the idea of designing a robust RDF crawler. The semantic web is a global database whose data is managed in centralized manner. The quality of semantic web which differentiate it from the conventional World Wide Web is, its ability to read data semantically. In other words we can say that it adds semantics to the information and acknowledges it as well. This semantic aspect is brought into work by using semantic web languages such as RDF, RDF Schema, such as RDFS and Web Ontology Language such as OWL. All together these technologies constitute the Semantic Web. When we talk about the management and retrieval of information from the semantic web, we have to face a lot of challenges and come across a lot of problems. One of major issue when retrieving data from the semantic web is the Duplication of data. The redundancy caused by this duplicate data produces a large number of junk data which is unwanted material by the user. So to reduce or completely remove these duplicate query results from the semantic web many approaches have been described so far not only to detect this kind of data but also to remove it. As I described earlier the purpose of my research was to remove duplicate query results from the semantic web when the query is placed on it. I have worked to remove query results in the form of Meta data which may be used during integration.

One of the advantages of the SQL-based scheme lies in the use of RDF_MATCH table function. The RDF_MATCH table function invocation is rewritten as a SQL query, thereby avoiding run-time table function procedural overheads. It also enables optimization of rewritten query in conjunction with the rest of the query. Along with the benefit of providing RDF querying capability within SQL which will enable database system to support for building a wider range of semantically rich applications, it also carry the problems inherited from SQL, like duplicate data resulted from query results in the semantic web. So in the semantic web, just like the conventional World Wide Web, data is not located on a single source. It is stored in the form of very large databases which are scattered geographically. When a user ask a search engine to find some data, it is extracted from scattered resources, of which user is unaware. Similarly when we talk about the semantic web , RDF data which play an important part during the integration process, the semantic of the data which are stored in the form of metadata for RDF, are collected for each web resources from the world wide scattered semantically distributed data sources.

During the query process a user may ask for many types of complex searches for which results are produced and sent back to user. These results which are obtained as a case of user query may come from different data sources. The results from a single data source may not contain any duplicate or repeated results but as the data on the semantic web is not from a single source therefore there is a great chance of redundancy. Currently, a lot of work is done for finding the dirty or duplicate data on the web. Manny standards have been developed. But most of them are designed for searching similar text from different documents depending on some features or criteria. There has been lack of meta-data search techniques that may work efficiently to find and integrate data from different resources on the semantic web. The algorithm that I have proposed in my thesis is a need to fulfill the needs of finding results free of duplicates, when queried by using Structured Query Language. The scenario given later in this chapter provide an improved method as it reduces the disk storage needs for keeping the query results and thus increases the performance by making the throughput better.

The main advantage of using k-nuth hashing algorithm along with SQL lies in:

1) Using SQL for querying RDF data from the semantic web, which itself is efficient as it eliminates the mapping requirement from one language to SQL if any other language is used to query RDF data e.g. SPRQL.

2) The second advantage is utilizing the K-nuth hashing algorithm with it , implying "a pass- through method/architecture", in which the result is sent to user only if it is not similar to already queried result.

So, overall overhead is reduced first, by using SQL and avoiding complexities caused by formats specified for representing RDF and second, by reducing memory requirement , which incorporates the storage of only unique query results. The purpose of the proposed steps is to provide hash technique to remove duplicate query results. The main objective of my scheme is to enhance the efficiency and performance of the existing scheme by the use of given algorithm.

3.2 Methodology

I have developed a case study for my proposed scheme. For this purpose I obtained the RDF data (metadata) about the web pages using the two web resources 'Google' and 'Yahoo' using web spider tool or meta data carawler and then using Knuth algorithm obtained the hashing values for the URLs to identify and remove duplicates.

3.3 Proposed System Scheme.

The proposed technique provides such a method in which either the result is passed to the user or ignored. In this approach data is gathered from multiple sources and it is checked for duplication. If the result obtained is not similar to the result already sent to user then it is not passed to user and discarded. The duplication of data is checked on the basis of the hashing techniques. For the user query, against each web page its metadata is collected and then any attribute is taken as key value for winch hash value is calculated using k-nuth hashing algorithm. After calculating the hashing value it is stored along with the pointer in a hash table and result is sent to user for display. Similarly more results are obtained, but when the new result is received its hash value is sorted in the table only if dose not matches with the key value already sent to user. If the key values match then the data against the keys is checked to find the actual collision if it exists. In the case when the data against the key values is same, the new result is discarded and not sent to the user. In this way it eliminates the need to store the each result returned from the data source on the system which increases the memory requirement and in turn can slow down the system speed.

3.4.1 Proposed Algorithm

The processing steps for this approach are written in an algorithmic form below, and then described in more detail in the following paragraph.

ALGORITHM: Building a process for detecting and eliminating the duplicated metadata from the query results.

INPUT: Query Results of RDF data from a semantic data source.

OUTPUT: Query Results free of duplicate metadata.

STEP1: /* Displaying the Query Results to User After Receiving the Query.

1.1: DO Read the user query/ problem statement.

1.2: WRITE Query to Data Source.

1.3: READ Query Results

1.4: COMPUTE Hash Index (using Knuth Algorithm)

1.5: SAVE the Hash Index and Pointer In Hash Table.

1.6: DISPLAY results to the user.

STEP2: /* Displaying the Query Results to User after Removing duplicate data.

2.1: WHILE not end of query results DO

2.2: READ the Query

2.3: COMPUTE Hash Index (using Knuth Algorithm)

2.4: IF Hash Index values in STEP 2.3 IS EQUAL to the Hash Index in STEP 1.4 THEN

IF Query Results in STEP 1.3 IS EQUAL to the Hash Index in STEP 1.4 THEN

DISCARD the result.






Figure 3.1 Proposed Algorithm

The figure 3.1, describes the steps for the proposed algorithm.

The detailed description for the process above described in the form of algorithm is given as under:

In this process queries are sent to the metadata sources one at a time, allowing the results form each source to be processed before the results from the next source are received. When a user receives the results from the semantic web source, it is sent to the user and then a hash index is computed by using Knuth hashing algorithm which is then stored in the hash table with a pointer , pointing to the first semantic web source. The same process is done for the second semantic web source but before storing the hash index calculated for it, in the hash table, it is compared with the index calculated for the first data source to check for the index collision. If both indexes values match then another check is made to see whether the actual query results gathered from both the sources are same or not. If they are same then the second query result is discarded without storing its index and pointer in the hash table. But if both indexes do not match then the second index with its pointer to the second source is stored in the hash table and result is sent to the user.

3.4.2 Flow Chart for the Duplication Removal Algorithm

Proposed Solution: Building a process for detecting and eliminating the duplicated metadata from the query results.

Input: - Query Results of RDF data from a semantic data source.

Output: - Query Results free of duplicate metadata

The whole system for removing duplicates by using SQL and k-nuth is a combination of different steps, which are shown in figure 3.1. The whole process of removing duplicates is described as a procedure, which inputs the query results from the semantic web and give out result free of duplication.

Procedure Name:

Input: Query Results of RDF data from a semantic data source.

Output: Query Results free of duplicate metadata.

Process: This process is assumed to accept query results which are obtained when the user send a query to any data source on the semantic web with the help of search engine enriched with meanings of the information. These meanings can be obtained in the form metadata. The response to the user is given in the form of the result and its metadata. This result is further processed for calculating hashing value and then storing in the hash tables with its pointer to refer the storage location. The detection of duplication is done on the basis of hash values. If the hash value of every new result is compared with the already calculated hash values. If they are same then the actual data is checked. If they are also not same then results are not same and passed to user otherwise ignored and not sent to the system.











Receive User Query for RDF

Issue Query using RDF_Match() function to semantic search engine

Receive Query Results

Compute Hash Index using k-nuth multiplicative schem

Store Hash Index and Pointer in Hash Table

Receive New Result for RDF

Compute Hash Index using k-nuth multiplicative scheme

Pass Result for Display to User

Throw away/ reject New Result

Pass Results for Display to User

Does the New Hash Index Match Any Hash Indexes in the Hash Table?

Both results similar?

More Results from Current (google/Yahoo)Data Source?







Figure 3.2 Flow Chart of Duplication Removal Procedure.

The Figure 3.2 is a flow chart which shows that for duplication removal a involves the following:

1) Receiving user query

2) Issuing query using RDF_Match ( ) function to first web data source through semantic search engine.

3) Receiving first query result

4) Calculating the hash value for the first query result

5) Sending the first query result to the user.

6) Receiving new query result, if any.

7) Calculating the hash value and index for second query result

8) Comparing hash index with the said second index to check for an index collision.

9) Throwing the second query result if first web data source contains the similar data; otherwise,

10) Passing second/new query result to the user.

Queries are issued to one data source at a time, allowing the results from each web data source to be processed before the results from the next web data source is received.

Block Diagram of the Proposed System

To understand the basic functionality of the proposed method a block diagram is given below.


Semantic Web

Knuth hashing Algorithm

Results without duplication

Query RDF data

Query result to be indexed


Figure 3.3 Block Diagram of the Proposed System

The Figure 3.3 is the block diagram of the proposed system which helps in understanding the working of the proposed system. It gives an overview of the system for removing duplicates using k-nuth hashing algorithm.

Context Level Data Flow Diagram for Removing Duplicate from Semantic Web


SQL-Based RDF Query Scheme

Knuth Hashing Algorithm


Request Data

Query result


Semantic web source

RDF Data

Figure 3.4 Context Level Data Flow Diagram for Removing Duplicate from Semantic Web

The Figure 3.4 is context level Data Flow Diagrams, which gives the functional overview of the system. When user requests the data from the semantic web, the request is sent to the Oracle database, from where SQL-Based RDF querying scheme is used to query data. As the data is queried, k-nth hashing algorithm is applied on the data, which is then results are returned to user if r not duplicate. The figure 3.5 is a detail data flow diagram for the system

Detailed Data Flow Diagram for Removing Duplicate from

Semantic Web

RDF data stored in Oracle Database

RDF/SQL Mapping RDF (Data about data)



Query through RDF_Match ()function

Calculate Hashing Value Using K-nuth Hashing Algorithm

Matching Hash Indexes

Hash Table

Hash index

Compare Data in Data source


Index value

Discard data


Index value


Any Result

Any Result

Figure 3.5 Detailed Data Flow Diagram for Removing Duplicate from Semantic Web

The Figure 3.5 gives the detail flow of the data and the relation between related components. When user queries the data, request is sent to data source through semantic web engine. After getting the result the hash value is calculated (using k-nuth hashing algorithm) and stored in the hash table along with the pointer to the data source and result is sent to user (through application software). Then new query result is obtained and same process is repeated but before giving result to user it is checked for duplicates using index values, and depending upon that either result is sent to user or discarded.

3.4.3 Knuth Algorithms


a) Donald Knuth Hashing Function.


public long DEKHash(String str)


long hash = str.length();

for(int i = 0; i < str.length(); i++)


hash = ((hash << 5) ^ (hash >> 27)) ^ str.charAt(i);


return hash;


/* End Of DEK Hash Function */} [ Partow Arash, 2002]

Figure 3.6 a) Donald Knuth Hashing Function.

b) Detail Working of Donald Knuth Hashing Function.


import java. Lang.*;

public class GeneralHashTest


public static void main(String args[]) throws IOException


GeneralHashFunctionLibrary ghl = new GeneralHashFunctionLibrary();

String key = "ABCD";

System.out.println("Key: " + key);

System.out.println("DEK-Hash Function Value: " + ghl.DEKHash(key));

System.out.println("Press 'ENTER' to exit...");

BufferedReader stdin = new BufferedReader(new I nputStreamReader(;



} [ Arash Partow, 2002 ]

Figure 3.6 b) Detail Working of Donald Knuth Hashing Function.

The above hashing function has been implemented with the SQL based scheme, to remove duplicates when user put a query on the semantic web. The hashing function is applied on the URLs to find the key values. The following main steps are performed when user query the data on the semantic web.

The URLs are given as input to the system

This algorithm is passed to the system and its length is calculated

Then the hash valued is calculated using the Knuth Hashing Multiplicative Scheme, which uses shift operators for finding the significant bits.

After calculating the hash value it is returned.

All process is repeated for all the URLs queried.

The algorithm proposed for removing duplicate query results from the semantic web, as described above is very efficient. The working of this algorithm on the RDF data(meta data) is shown in the next chapter, that explains in detail how the above algorithm can be used to remove the duplicate query results if exist, when retrieved from the semantic web.