McAfee SECURE sites help keep you safe from identity theft, credit card fraud, spyware, spam, viruses and online scams

Cookie Information

Privacy Information

World Wide Web

The largest single information resource that the humanity has ever produced is the World Wide Web. The internet and the World Wide Web has brought great change to the lives of peoples of different fields and has become a revolution to information technology. The Web was originally all about documents. Simply, by clicking on a link in a Web browser, we can move to another web page, which is displayed by triggering our request to a Web server to send a document. Most of the current forms of web content are designed and structured for use by people but are not easily understandable by computers. Much effort has been made towards achieving the machine readable World Wide Web called "The Semantic Web". Burners-lee (1998), gave the vision of developing expressive languages to describe information understandable by machines, which can be an efficient way of representing data on the World Wide Web, or as a globally linked database. As the management of the data on the semantic web is not an easy task. It involves very complex technologies; similar is the case of extracting data from the semantic web is a challenging task.

Methods for query processing are an essential part of database and information systems [Molina et al., 2002].

This chapter is divided into six main different sections. The first two sections give the background of the research, in which the first discusses the need and importance of semantic web, while the second describes the searching and querying from it. Next two sections tell the motivation and then objective of the research. Then research problem is told. In the last section the detail of some terms related to semantic web such as Resource Description Framework (RDF) and a note on Structured Query Language (SQL) are described. The main headings of this chapter are:

1.1Background of the Research

The following sub sections describe some history of the World Wide Web, its problems and then need for the semantic web.

The World Wide Web is a very big resource for the global information which is made available to all users of computers through internet. The Internet has been around for over a quarter century. The development of the World Wide Web in the 1990's promoted the concept and usage of the Internet and now its not limited to only the world of computer science but also has entered the world as a whole. The success of the Internet and the World Wide Web (WWW) has been astonishing.

The question some computer scientists have been asking is: i) what is the next step? ii) How can the WWW be made even more useful? [Grimshaw, Undated]

1.1.2What is the Problem?

Information overloading is the main problem. Because of the huge size of the WWW, finding relevant information on the web has become very difficult. For the most part information on the WWW consists of text. Search engines usually work on the basis of pattern matching to find the desired information in the form of documents. The search is syntax based. To the search engines (and other computer programs) the text has no meaning. Text is just strings of characters. Unlike people these machines does not understand the meanings of the text. For the machines these texts have no semantics. [Grimshaw, Undated]

The answer to the question that what is the next step on the World Wide Web, many folks think that any semantic information that can be understandable to computers must be used on the web. Research program is under way all around the world to fulfill this aim. [Grimshaw, Undated]

And how the world wide web can be made useful , is through adding semantic to the search engines so that they can literally understand the meanings of the queries put by the users to fine the most relevant answers.

1.1.4Benefits of the Semantic Web to the World Wide Web

Although the World Wide Web is the biggest repository of information that has been ever created, which is growing in contents of various languages and fields of knowledge yet, in the long run, it is extremely difficult to make sense of this content. Search engines might help you find content containing specific words, but that content might not be exactly what you want.

Imagine this scenario. A business analyst receives a new project which required intellectual property (IP) protection. In which a list of intellectual property protection licenses. To become familiar with IP term, he may search for the IP term using any search engine such as Yahoo and Google. On searching the IP term the results obtained are hardly helpful because the search results obtained from this search consists of information related to internet protocol. After reading through a lot of search listings he is able to find the information about the intended search that is related to intellectual property. The irrelevant information was because of the different semantics of the word “IP” on web pages. A lot of efforts are required to get the required information.

However if we see a Semantic web environment, semantic web agents are available to search the web for “IP” where IP is an intangible assets of organizations. While using semantic web agents we find the relevant results. Semantic Web agent is also capable to discover IP related previous research on the internet.

Based on the semantic information on web available for IP, these semantic agents can also present a list of related technologies which can help users for better understanding of the related concepts. And in our example the business analyst can also realize that some more research is needed before beginning the project. With this information got by Semantic Web agent, he reads the IP related information and sends emails to the colleagues who have made IP-related materials available on the network to ask for their input before starting your new project.

The main problem is the lack of the understanding of the meanings, which make it difficult to find the exact match of the required information, as the search is based on the contents of pages and not the semantic meaning of the page's contents or information about the page.

Once the Semantic Web exists, it can make use of RDF and ontologies to provide: i) the ability to tag all content on the Web, ii) to describe what each piece of information is about and give semantic meaning to the content item. Thus, search engines become more effective than they are now, and users can find the precise information they are hunting. Organizations that provide various services can tag those services with meaning (e.g. storing the term "SOAP" with its meaning in the context of either a web service or a detergent); using Web-based software agents, you can dynamically find these services on the fly and use them to your benefit or in collaboration with other services. [Balani, 2005]

The main problem is the lack of the understanding of the meanings, which make it difficult to find the exact match of the required information, as the search is based on the contents of pages and not the semantic meaning of the page's contents or information about the page.

Once the Semantic Web exists, it can make use of RDF and ontologies to provide: i) the ability to tag all content on the Web, ii) to describe what each piece of information is about and give semantic meaning to the content item. Thus, search engines become more effective than they are now, and users can find the precise information they are hunting. Organizations that provide various services can tag those services with meaning (e.g. storing the term "SOAP" with its meaning in the context of either a web service or a detergent); using Web-based software agents, you can dynamically find these services on the fly and use them to your benefit or in collaboration with other services. [Balani, 2005]

1.1.5Semantic Web Technology Overview

Semantic Web technologies can be considered in terms of layers, each layer resting on and extending the functionality of the layers beneath it. Although the Semantic Web is often talked about as if it were a separate entity, it is an extension and enhancement of the existing Web rather than a replacement of it.

Figure 1.1. The Semantic Web Technology Stack. [Rob, 2008]

As shown in Figure 1.1, the base layer of the Semantic Web is HTTP and Uniform Resource Identifiers (URIs). These are commonly considered 'Web' rather than 'Semantic Web', but every proposed Semantic Web technology rests upon these Web fundamentals. URIs are the nouns of the semantic Web. Hyper Text Transfer Protocol (HTTP) is the verbs. The Resource Description Framework (RDF) is the workhorse of the Semantic Web. [Rob, 2008]. It is about metadata for Web resources, by resources we mean any object that can be found on the Web"[Heery, 1998]. After using RDF for describing resources and ontologies to determine the relationships between them, the next step is to find a way to get useful information out of them. [Rob, 2008]

To fully utilize the information resources available on corporate intranets or the Internet Metadata is used. The Resource Description Framework (RDF) makes it possible to create and exchange the metadata like any other Web data. The growing number of available information resources ultimately produces the large volumes of RDF metadata. [Nilsson, 2001]

Semantic Web technologies have been very successful for data integration in fields such as Bio-Informatics, Life Sciences, GIS (Geographic Information Systems) and Material Sciences [Newman, 2006]

The fact that raw media, in the form of text, HTML, images or video streams, contains meta-information mostly inaccessible to computers. Making this information available to computers in order to enhance their usefulness was the driving vision that created the Semantic Web project. [Nilsson, 2001]

Most traditional meta-data approaches take the view of meta-data as being mostly a digital indexing scheme to use in cataloging and digital libraries. What do distinguish the Semantic Web from these approaches to meta-data are two important things:

1.1.6Defining Semantics and Relationships

Implementing the Semantic Web requires adding semantic metadata, or data that describes data, to information resources. Semantically organized data allows machines to effectively process the data based on the semantic information that describes it. By linking data by using their semantic help in finding the related data. So when semantic information is associated with data, computers can make inferences about the data, i.e., understand what a data resource is and how it relates to other data.

XML (eXtensible Markup Language) adds semantics to data by using some metadata tags, which are only understandable by the humans. So this data can be used by two different parties if they have the understanding to the common tag meanings. In the Semantic Web technologies this problem has been addressed so that these tags information can be made understandable to both humans as well as machines.

The first step required for machines to understand data is to get that data into a uniform format. This type of functionality can be found today on Web sites that use forms that allow users to enter information and run a query, e.g. airline Web sites allow visitors to search for and book flights based on a variety of criteria. However, due to the large amount and variety of data available from different sources today, this may raise the scalability problem.

The next step towards the Semantic Web requires the classification of data from multiple domains based on its properties and its relationship with other data. This is from where Semantic Web technologies such as RDF, Resource Description Framework (RDF) Schema Specification (RDFS), and OWL originate. [Altova, 2008]

The Internet and the World Wide Web (WWW or just “the web”) today consists to a large extent of a distributed collection of Hypertext Markup Language (HTML) pages. The WWW was developed in the early 1990s at the European Organization for Nuclear Research (CERN) in Geneva, Switzerland, as an effort to ease the sharing of information among their many different computer systems. The basis for the WWW was HTML, which was meant to be a simple format language that structured the information in a document into logical components such as headings, paragraphs, and links. [Eriksson, 2003]

The following sub sections tell about the type of data that can be found on the semantic web and then the approach that are effective in querying that information in different ways.

With the growth of the web the amount of available information has become too large to browse manually. To find information about a particular topic various search engines have to be used. A major problem for these search engines is that in order to find relevant results they have to scan HTML documents to find words or phrases that match keywords used to describe the topic.

The first problem with this approach is that it is difficult to describe a topic with keywords that match the content of all relevant pages to a particular topic without also matching a lot of unrelated pages. A keyword describing a topic might occur in pages unrelated to the topic, and reversely, relevant pages do not necessarily contain the chosen keyword. The task of matching keywords with page content without generating “false positives” is further complicated by the fact that current web pages intermingle content with layout information.

Another even bigger problem with this approach is that the content of some topics is not text at all. When searching for any type of media files, programs, or other non-textual content, a search engine is totally dependent on the author of the content pages to label the content appropriately.

The problem with intermingled content and layout information can be solved by using a separate layout/style language like e.g. Cascading Style Sheets (CSS).

The problem with non-textual content is trickier. The only really good solution is adequate documentation of the content. This also addresses the problem with search accuracy. Searching information about content instead of the content itself is much easier, provided the information about the content is accurate. [Eriksson, 2003]

1.2.2 Internet Metadata

Metadata means data about data. In this context it is used to denote the information used to describe web content. Analogous to the benefits of separating layout information from content, there is an even greater benefit to be had from separating information about content from the content itself.

An early form of metadata was the use of the <meta> tag in HTML. This can be used to give information about a web page such as e.g. the author of the page or a list of keywords.

For textual content this information can be described in the content of the page itself, but it is easier for a search engine to know that the text string “John Smith” represents the author of the page when it is labeled within a <meta> tag as “author” than when found in the content as “Hi my name is John Smith. I wrote this page!”

The <meta> tag is used by the Platform for Internet Content Selection (PICS) which gives web page authors a way to label and categorize their pages with regard to their content. .

The main purpose of PICS is to provide a way to label pages so that parents can filter out content unsuitable for children.

The HTML <meta> tag approach is a common general way of presenting metadata; as label-value pairs, e.g. “author” = “John Smith”. In order for this to be understandable by search engines, the labels must be well known. If e.g. everyone agreed upon including “author”, “creation date”, and “keywords” as metadata about all web pages then it would not be of any use to a search engine if someone included their own metadata label “last edited” or even a variant or specialization of the established metadata labels, such as “creator” or “co-author”.

The problem thus becomes which metadata properties to support? Regardless of how cleverly chosen, the set of metadata properties will always be insufficient for some applications.

It is impossible to anticipate the need of everyone. What is needed is a way to extend the set of known metadata labels without breaking the backward compatibility.

1.2.3 RDF

The Resource Description Framework (RDF) was first presented in 1997 as an alternative way of representing information in general and metadata in particular. It is information Description language that addresses many of the problems with metadata presentation. RDF is ² Based on simple principles.

1.2.4Querying the Semantic Web Using Relational Databases

SQL-based databases, while hugely successful, work best in predictable environments, since schema evolution can be expensive and disruptive. The future enhancements and changes are possible only if an extremely general Relational Database Management System (RDBMS) schema is used. For example, In SQL systems, it is necessary for application developers to know in advance about all the specifications and requirements i.e. he must know in advance, the storage structure for the database and the kinds of inter-relations that hold between the entities represented in the database. So it is not an easy task to make frequent changes later after developing a system. [Brickley and Miller, 2000]

1.3Research Motivation

The popularity of semantic web is increasing day by day and RDF is an important part of the semantic web for managing semantics in the form of metadata. Devising a scheme for efficient and scalable querying of Resource Description Framework (RDF) data has been an active area of current research. The query languages designed so far, to query the semantic web have the some problems e.g. 1). Their integration with SQL is not easy and 2). They incur overhead to transform data from SQL to the corresponding language data format. [Chong et al., 2005]

The research motivation lies in two things. First, utilizing the benefits of SQL language as compared to other query languages for RDF. Second, enhancing its capabilities for removing duplicates when it is used to query data in the semantic web environment.

1.3.1SQL-based RDF Querying Scheme

One of the advantages of the SQL-based scheme lies in the use of RDF_MATCH table function. The call to RDF_MATCH table function is rewritten as a SQL query, thereby avoiding run-time overheads. It also enables optimization of rewritten query in conjunction with the rest of the query. [Chong et al., 2005]

Along with the benefit of providing RDF querying capability within SQL which will enable database system to support for building a wider range of semantically rich applications, it also carry the problems inherited from SQL, like duplicate data resulted from query results in the semantic web. So the main objective is to remove duplicate query results using SQL.

In the semantic web environment, a user can get information with the help of semantic web engines from more than one data sources. The data queried by the user may reside on a single or more than one data sources. Although the results (RDF data) returned from one particular data source in the semantic web may not contain duplicates, there are often many duplicates in the overall set of results returned from multiple semantic web resources. Duplication removal has been a hot issue. Many techniques have been developed so far for this purpose, but none of them make use of metadata available on the semantic web to make this process more efficient and reliable. Also, the purpose of my research is to reduce the storage requirement for the user to store the results returned to answer his query, by using SQL and a pass-through hashing algorithm.

1.4Research Objectives

The purpose of the steps given below to provide a "pass through architecture via hash techniques" to remove duplicate query results. The main objective of my scheme is to enhance the efficiency and performance of the existing scheme by the use of given algorithm.

A brief introduction to the new approach is given below:

1.5Research Problem

The RDF data in semantic web which is represented as a collection of <subject, property, object> triples can easily be stored in a relational database, but there a number of issues in efficiently querying such RDF data. Devising a scheme for efficient and scalable querying of Resource Description Framework (RDF) data has been an active area of current research. However, most approaches e.g. RDQL[ Andy Seaborne], RQL[Gregory Karvounarakis , Sofia Alexaki, Vassilis Christophides,Dimitris Plexousakis, Michel Scholl], SPRQL[Eric Prud'hommeaux, Andy Seaborne], SquinshQL[Libby Miller, Andy Seaborne, Alberto Reggiori] etc define new languages for querying RDF data , which in turn issue SQL to process user requests. However these schemes suffer from the following shortcomings: 1) They are difficult to integrate with SQL queries used in database applications, and 2) They incur inefficiency as data has to be transformed from SQL to the corresponding language data format. [Chong et al., 2005]

However, beside of the problems described above, there are a number of issues/problems in querying RDF data with SQL which are not attempted yet. A list of problems which occur when querying RDF data with SQL are mentioned below, that will be attempted in the research. :

5) A different storage representation for the RDF data which is queried using SQL table function, RDF_MATCH may be considered. The purpose of using this different storage representation will be to:

In the following sections a detailed overview of the RDF and SQL is given.

1.6Concepts Related to Research

In the following sections, some concepts which I have used in my research based thesis are described.

1.6.1Resource Description Framework (RDF)

RDF is the backbone of the W3C's Semantic Web activity; the Semantic Web idea was no based on just content management but also adding meaning to it. For example, in a Semantic Web, make it possible to differentiate between the concept of "python," a kind of snake, and "python," a computer programming language, while describing Web resources."

http://www.ibm.com/developerworks/xml/standards/x-rdfspec.html

The World Wide Web usually gives access to unnecessary information redundantly. The metadata about resources i.e. RDF make use of semantics stored with them which helps using information found on the web in an effective manner which requires common conventions about semantics, syntax, and structure. Individual resource description communities define the semantics, or meaning, of metadata that address their particular needs. [Miller, 1998]

1.6.1.1 Background of RDF

Platform for Internet Content Selection (PICS) is a mechanism from where the history of metadata began in 1995 at the World Wide Web Consortium (W3C). [Miller, 1998].

The purpose of the development of RDF was to design a general and flexible architecture or framework for supporting metadata on the web. RDF was the result of collaborative design effort of several W3C Member companies, which are contributing their intellectual resources. [Miller, 1998].

1.6.1.2The RDF Data Model

By using Resource Description Framework (RDF) model, we can make user of statements to describe a Web resource. These statements are in the form of triples, each of which has a subject, which is a Uniform Resource Identifier (URI); a predicate, which is also a URI; and an object, which is a URI or literal data value.

Resources have properties (attributes or characteristics) which are identified by property-types, and these property-types have some corresponding values. Property-types are used to express the relationships of resources with its values. In RDF, these values can either be atomic in nature (text strings, numbers, etc.) or other resources, which in turn can have their own properties. A collection of these properties that refers to the same resource is called a description.

Figure 1.2RDF Description [Miller, 1998]

The above Figure1.2 illustrates a generic RDF description.

The application and use of the RDF data model can be illustrated by examples. Consider the following statements:

Although above statements seems to have same meaning for us but in case of machines these are two completely different strings. RDF attempts to provide an unambiguous way of expressing semantics in a machine-readable form.

RDF provides a way to relate properties with resources, for example, in the design of the data model for the above statement "the author of Document 1 is John Smith" has a single resource "Document 1", a property-type of "author" and a corresponding value of "John Smith". In the above example we can see that resources are identified as nodes (Document1), property-types (Author) are defined as directed label arcs, and string values (John Smith) are quoted. [Miller, 1998]

Figure 1.3RDF Properties [Miller, 1998]

Given the representation, the data model corresponding to the already described statement is graphically expressed as above in (Figure 1.3), which shows that "Author" is an RDF property.

1.6.1.3The RDF Syntax and Schema

RDF defines a very powerful model for describing the resources in a quite simple way. XML is used as syntax to store instances of RDF model into executable files and applications can use it for communication purposes.

For resource description communities RDF is providing the ability to define semantics. Broader or narrower meaning of property-type "author" is dependent on different community needs so there must be some way to identify the semantic and rules for recognizing governing authority of the vocabulary. For this purpose XML namespace mechanism (NS) is used. For example, In Dublin Core Initiative, the property-type "author" is defined as the "person or organization responsible for the creation of the intellectual content of the resource" and is specified by the Dublin Core CREATOR element (DCES).

Figure 1.5 Dublin Core Schema [Miller, 1998]

The figure 1.5 is the data model which is representing the above example and Dublin Core RDF Schema. DC in figure 1.5 is the abbreviation of Dublin Core.

In short, the purpose of RDF schema is to encourage distinct information communities for data exchanging as well as to provide extension of metadata vocabularies among them. [Miller, 1998]

1.6.1.6Relational Database

When we talk about a relational database the first query language which came into mind is Structured Query Language (SQL). SQL is used to extract data from relation databases which store data in the form of tables. A database does not present information directly to a user; when user tries to access data from the database, request is sent to the application software and then to user so that user can understand it.

Data can be stored in the relational databases effectively, where data is stored in the form of tables which are further normalized so that user can access them easily. [Microsoft Corporation, 2008.]

With the growth of the web the amount of available information has become too large to browse manually. Many search engines are used to fine any information about a particular topic various search engines have to be used. A major problem for these search engines is that in order to find relevant results they have to scan HTML documents to find words or phrases that match keywords used to describe the topic. [Eriksson, 2003]

But with the evolving idea of semantic web, SQL has been used to query semantic web providing a number of benefits. It can not only query the semantic web in an effective manner but also avoids the drawbacks and issues of querying which are caused when using other approaches and query languages. This is why I am going to embed my algorithm in SQL-based approach for querying RDF data and removing duplicates.

1.7Thesis Roadmap

Chapter 2 contains of what has been published on a topic by accredited scholars and researchers. It gives descriptive and critical overview of prior research in the same area. The purpose of this chapter is to:

Chapter 3 contains the purposed idea for removing duplicates from semantic web.

Chapter 4 justifies my research regarding the idea defined in chapter 3.

Chapter 5 concludes the thesis and contains my concluding remarks and gives directions for further work.

We provide a professional essay writing service that thousands of our customers use as an effective way of improving their grades, improving their research and saving them lots of time.

Order Now. It takes less than 2 minutes.

  1.  
  2.  
  3.  
  1.  

Sign up and be the first to receive our latest offers:

Struggling? We can help!