Conceptual Framework For Ontology Based Information Computer Science Essay

Published:

Over the years, the volume of information available through the World Wide Web has been increasing continuously; unfortunately, the unstructured nature and huge volume of information accessible over networks have made it increasingly difficult to find the relevant information. The information retrieval techniques commonly used are based on keywords, wherein provided keyword list doesn't consider the semantic relationship between keywords nor it considers the meaning of words and phrases. With such system, users frequently have problems expressing their information needs and translation those needs into requests.

To overcome the limitations of keyword-based information retrieval, one must think of introducing conceptual knowledge to information retrieval which will help users to formulate their request. The semantic knowledge attached to information is united by means of ontologies. The mapping of concepts in information into conceptual models i.e. ontology appears to be useful method for moving from keyword based to concept based information retrieval.

Lady using a tablet
Lady using a tablet

Professional

Essay Writers

Lady Using Tablet

Get your grade
or your money back

using our Essay Writing Service!

Essay Writing Service

We have surveyed various approaches and models of ontology based information retrieval, built on various techniques in a way that resources may be retrieved based on the associations, semantic similarity, ranking algorithm, annotation, weighting algorithm. In the below survey we listed out few of the models of ontology based IR, also we tried to focus on various techniques used for retrieval efficiency, semantic similarity, ranking, weighting and annotations.

INDEX

Introduction

Background

Information retrieval system

Information retrieval models

Conceptual framework for ontology based information retrieval

Ontology

The origin of ontology

What is ontology

Reasons for developing ontologies

Types of ontology

Benefits of ontology

Applications of ontology

Ontology languages

Ontology-based information retrieval system

Literature survey

Ontology approaches of an IR

3.2.1 Association

3.2.2 Semantic Similarity

3.2.3 Relevance

3.2.3 Semantic Indexing

3.2.4 Semantic Annotation

4. Analysis

5.1 Analysis of different information retrieval models

5.2 Analysis of different approaches of ontology

5. Conclusion

References

Chapter-01

Introduction

1.1 Background:

In the recent years information retrieval system is based on keyword. The keyword-based information retrieval systems have been used to find information and to provide access to large amounts of information. For example, search engines accept keywords as input and return as output a list of links to documents containing those keywords. However, keyword-based search engines have some fundamental drawbacks. Search engines do not "understand" the semantic meaning of the words the user types into them, and therefore the engines may come up with an enormous number of false hits and users are less able to find related information.

To solve above problem is to improve information retrieval from traditional or keyword based approach to knowledge or concept based approach. Using conceptual knowledge to information retrieval which will help users to formulate their request. The semantic knowledge attached to information is united by means of ontologies.

Ontologies appear to be useful method for moving from keyword- based to concept based- information retrieval.

Ontologies can be general or domain specific, they can be created automatically or manually, and they can differ in their forms of representation and ways of constructing relationships between the concepts, but they all serve as an explicit specification of a conceptualization.

1.2 Information retrieval system[1]:

Information retrieval deals with access to information as well as it's representation, storage and organization. The overall goal of an information retrieval process is to retrieves the information relevant to a given request. The criterion for complete success is the retrieval of all the relevant information items stored in a given system, and the rejection of all the non relevant ones..The results of a given request usually contain both a subset of all the relevant items, plus a subset of irrelevant items, but the aim remains, of course, to meet the ideal criterion for success.

The figure(1) shows the basic concept of an information retrieval system, representation is defines as the stored information, matching function as a certain search strategy for finding the stored information and queries as the requests to the system for certain specific information.

Query

Matching function

Representation

Figure 1: A simple model for information retrieval

Lady using a tablet
Lady using a tablet

Comprehensive

Writing Services

Lady Using Tablet

Plagiarism-free
Always on Time

Marked to Standard

Order Now

Representation comprises as abstract description of the documents in the system .nowadays, more and more documents are full text documents, whereas previously, representation was usually built on references to documents rather than the documents themselves similar to bibliographic records in library systems. References to documents are normally semi structured information with predefined slots for different kinds of information, e.g. title, abstract, classification etc., while full text documents typically are unstructed, except for the syntax of the natural language.

The matching function in an information retrieval system models the system notion of similarity between documents and queries, hence defining how to compare requests to the stored description in the representation. Each model has it advantages and disadvantages with no single strategy being superior to the others.

1.3 Information Retrieval Models[2]:

A model of information retrieval predicts and explains what a user will find relevant given the user query.

A retrieval model specifies the three basic entities of retrieval:

- Representation r of information resources R,

- Representation q (called query) of users' information needs Q, and,

- Retrieval function M, assigning a set of resources r to each information need q.

The following major models have been developed to retrieve information: the Boolean model, the Statistical model, which includes the vector space and the probabilistic retrieval model, and the Linguistic and Knowledge-based models. The first model is often referred to as the "exact match" model; the latter ones as the "best match" models.

Queries generally are less than perfect in two respects: First, they retrieve some irrelevant documents. Second, they do not retrieve all the relevant documents. The following two measures are usually used to evaluate the effectiveness of a retrieval method. The first one, called the precision rate, is equal to the proportion of the retrieved documents that are actually relevant. The second one, called the recall rate, is equal to the proportion of all relevant documents that are actually retrieved. If searchers want to raise precision, then they have to narrow their queries. If searchers want to raise recall, then they broaden their query. In general, there is an inverse relationship between precision and recall.

1.3.1 Exact match models

Two models of information retrieval that provide exact matching, i.e, documents are either retrieved or not, but the retrieved documents are not ranked.

The Boolean model:

The Boolean model is based on set theory and Boolean algebra. Queries specifies as Boolean expression. The retrieval strategies in the boolean model is based upon binary decision criteria which denotes either the document to be relevant or not-relevant to a given query.

In the Boolean model, documents queries are Boolean expressions of keywords, connected by AND, OR, and NOT, including the use of brackets to indicate scope[3].

Ex. Q=("car" V "auto" V "automobile") ^ ("holiday" V "vacation")

Advantages:

It gives (expert) users a sense of control over the system. It is immediately clear why a document has been retrieved given a query. If the resulting document set is either too small or too big, it is directly clear which operators will produce respectively a bigger or smaller set.

Precise, if you know the right strategies

Easy to implement

Precise, if you have an idea of what you're looking for

Efficient for the computer

Disadvantages:

It does not provide a ranking of retrieved documents & no weighting of index or query terms. Boolean model is the exact match caused by the binary decision criterion. i.e. it retrieves document may be relevant or not relevant.

Users must learn Boolean logic

Boolean logic insufficient to capture the richness of language

No control over size of result set: either too many documents or none

When do you stop reading? All documents in the result set are considered "equally good"

What about partial matches? Documents that "don't quite match" the query may be useful also

Difficult to express complex user requests.

Difficult to control the number of documents retrieved.

Lady using a tablet
Lady using a tablet

This Essay is

a Student's Work

Lady Using Tablet

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

Examples of our work

All matched documents will be returned.

Difficult to rank output.

All matched documents logically satisfy the query.

Difficult to perform relevance feedback.

1.3.1.2 Region models:

Regions models are extensions of the Boolean model that reason about arbitrary parts of textual data, called segments, extents or regions. Region models model a document collection as a liberalized string of words. Any sequence of consecutive words is called a region. Regions are identified by a start position and an end position.

The main disadvantage of the Boolean model and the region models is their inability to rank documents

Statistical Model

The vector space and probabilistic models are the two major examples of the statistical retrieval approach. Both models use statistical information in the form of term frequencies to determine the relevance of documents with respect to a query. Although they differ in the way they use the term frequencies, both produce as their output a list of documents ranked by their estimated relevance. The statistical retrieval models address some of the problems of Boolean retrieval methods, but they have disadvantages of their own. Table 2.4 provides summary of the key features of the vector space and probabilistic approaches. We will also describe Latent Semantic Indexing and clustering approaches that are based on statistical retrieval approaches, but their objective is to respond to what the user's query did not say, could not say, but somehow made manifest [12].

1.3.2.1 Vector space model[7]:

The vector space model represents the documents and queries as vectors in a multidimensional space, whose dimensions are the terms used to build an index to represent the documents [4]. The creation of an index involves lexical scanning to identify the significant terms, where morphological analysis reduces different word forms to common "stems", and the occurrence of those stems is computed. Query and document surrogates are compared by comparing their vectors, using, for example, the cosine similarity measure. In this model, the terms of a query surrogate can be weighted to take into account their importance, and they are computed by using the statistical distributions of the terms in the collection and in the documents [4]. The vector space model can assign a high ranking score to a document that contains only a few of the query terms if these terms occur infrequently in the collection but frequently in the document. The vector space model makes the following assumptions: 1) The more similar a document vector is to a query vector, the more likely it is that the document is relevant to that query. 2) The words used to define the dimensions of the space are orthogonal or independent. While it is a reasonable first approximation, the assumption that words are pair wise independent is not realistic.

Advantages:

Provides term weighting scheme

Simple, mathematically based approach.

Considers both local (tf) and global (idf) word occurrence frequencies.

Provides partial matching and ranked results.

Tends to work quite well in practice despite obvious weaknesses.

Allows efficient implementation for large document collections

Disadvantages:

Missing semantic information (e.g. word sense).

Missing syntactic information (e.g. phrase structure, word order, proximity information).

Assumption of term independence (e.g. ignores synonymy).

Lacks the control of a Boolean model (e.g., requiring a term to appear in a document).

Probabilistic model:

The probabilistic retrieval model is based on the Probability Ranking Principle, which states that an information retrieval system is supposed to rank the documents based on their probability of relevance to the query, given all the evidence available [4]. The principle takes into account that there is uncertainty in the representation of the information need and the documents. There can be a variety of sources of evidence that are used by the probabilistic retrieval methods, and the most common one is the statistical distribution of the terms in both the relevant and non-relevant documents.

We will now describe the state-of-art system developed by Turtle and Croft (1991) that uses Bayesian inference networks to rank documents by using multiple sources of evidence to compute the conditional probability P(Info.need|document) that an information need is satisfied by a given document. An inference network consists of a directed acyclic dependency graph, where edges represent conditional dependency or causal relations between propositions represented by the nodes. The inference network consists of a document network, a concept representation network that represents indexing vocabulary, and a query network representing the information need. The concept representation network is the interface between documents and queries. To compute the rank of a document, the inference network is instantiated and the resulting probabilities are propagated through the network to derive a probability associated with the node representing the information need. These probabilities are used to rank documents.

1.3.3 Latent Semantic Indexing

In LSI the associations among terms and documents are calculated and exploited in the retrieval process. The assumption is that there is some "latent" structure in the pattern of word usage across documents and that statistical techniques can be used to estimate this latent structure. An advantage of this approach is that queries can retrieve documents even if they have no words in common. The LSI technique captures deeper associative structure than simple term-to-term correlations and is completely automatic. The only difference between LSI and vector space methods is that LSI represents terms and documents in a reduced dimensional space of the derived indexing dimensions. As with the vector space method, differential term weighting and relevance feedback can improve LSI performance substantially.

Linguistic and Knowledge-based Approaches

In the simplest form of automatic text retrieval, users enter a string of keywords that are used to search the inverted indexes of the document keywords. This approach retrieves documents based solely on the presence or absence of exact single word strings as specified by the logical representation of the query. Clearly this approach will miss many relevant documents because it does not capture the complete or deep meaning of the user's query. The Smart Boolean approach and the statistical retrieval approaches, each in their specific way, try to address this problem. Linguistic and knowledge-based approaches have also been developed to address this problem by performing a morphological, syntactic and semantic analysis to retrieve documents more effectively [Lancaster and Warner 1993]. In a morphological analysis, roots and affixes are analysed to determine the part of speech (noun, verb, adjective etc.) of the words. Next complete phrases have to be parsed using some form of syntactic analysis. Finally, the linguistic methods have to resolve word ambiguities and/or generate relevant synonyms or quasi-synonyms based on the semantic relationships between words. The development of a sophisticated linguistic retrieval system is difficult and it requires complex knowledge bases of semantic information and retrieval heuristics. Hence these systems often require techniques that are commonly referred to as artificial intelligence or expert systems techniques.

1.4 Conceptual frame work for Ontology based information retrieval system:

Fig. Conceptual Framework for ontology based information retrieval system

The steps involved in these layers are

Query parsing

To retrieve the information for the need of user, get the query from them. Split the query into meaningful words and apply the word stemming process.

Word stemming

Linguistically, words follow morphological rules that allow a person to derive variants of a same idea to evoke an action (verb), an object or concept (noun) or the property of something (adjective). For instance, the following words are derived from the same stem and share an abstract meaning of action and movement.

Activate -> Activates, Activated

The word "Activate" is used to represent the words "Activates and Activated". Stemming does the reverse process: it deduces the stem from a fully suffixed word according to its morphological rules. These rules concern morphological and inflectional suffixes. The former type usually changes the lexical category of words whereas the latter indicates plural and gender and it also removes the unwanted words like a, an, the etc. (Porter Stemmer [6])

For example, a list of stem words

Stop_words = ("the", "and", "a", "to", "of", "in", "i", "is", "that", "it", "on", "you", "this", "for", "but", "with", "are", "have", "be", "at", "or", "as", "was", "so", "if", "out", "not");

Ontology Matching

After splitting the query into meaningful words, each word should be checked against the ontology. All the combination of words is taken for processing. Specific domain ontology is taken to verify whether the word is present in that ontology. If yes then the relationship of the words are taken into the consideration.

Weight Assignment

The weight is assigned to each word with respect to other word according to the relationship in ontology like superclass, immediate subclass, subclass etc based on improved matching [4] algorithm.

Criteria 1: If the two stemmed words are not present in the ontology or any one word is not

present in the ontology then the weight is assigned as zero.

Criteria 2: If the root word is a direct superclass of another word and the then the

weight assigned is 1

Criteria 3: If the root word is a direct subclass of another word and the then the weight

assigned is 0.5

Criteria 4: If the root word is a subclass of another word and the then the weight assigned is

1/level of relationship.

Criteria 5. If the root word is a superclass of another word and the then the weight

assigned is (1/2+(1/level of relationship)).

Rank Calculation and Information retrieval

The cumulative weight is calculated for each combination of words based on improved matching algorithm. The best document gets the minimum score. The documents are arranged in ascending order according to their cumulative score.

For example, Ontology for Academic service

Academic institution needs amounts of documents to be maintained. Once if the document maintenance is made automated then it is very easy for the academician to retrieve the relevant documents. Any academic institution has to maintain the documents like (i) Admission Details (ii) Course Details (iii) Department Details (iv) List of programmes conducted by each department (v) Student details (vi) Staff Details (vii)Accounts (viii)Conferences and workshops organized (ix)Placement details (x) Examination details etc.

The Fig. shows the ontology for academic services[8]. The Ontology is created by having the root node as "thing" then follows the various categories like Administration, Controller of examination, Academic departments, and placement. Each sub category has many other nodes related to it

Thing

Administration Controller of examination Departments Placement

Account section office

Engineering polytechnic

CSE EE EXTC Engineering

Circulars class material students staff

Meeting test conference

Academics ontology

Sample Query Processing

This section shows how the above described conceptual frame work helps in efficient retrieval of documents for the query "I want to know the cse department details of psgtech"

QUERY PROCESSING

I WANT TO KNOW THE CSE DEPARTMENT DETAILS OF PG

WORD STEMMING

WANT,KNOW,CSE,DEPARTMENT,DETAILS,PG

ONTOLOGY MATCHING

WEIGHT ASSIGNMENT

RANK CALCULATION

AGGREGATE RESULT

1 0

2 0

3 1

4 1.5

5 0

6 2

INFORMATION RETRIEVAL

The minimum score is 1.the details that are present in the CSE (subclass pf department) node are retrieved. The next minimum score is 1.5. All the department details are retrieved.

t

KNOW

WANT

DEPARTMENT

DETAILS

CSE

PG

WANT

KNOW

DEPARTMENT

DETAILS

CSE

PG

0 0 0 0

0

0

CSE

KNOW

DEPARTMENT

DETAILS

WANTS

PG

KNOW

CSE

WANT

0

DEPARTMENT

0

1

0

0

0.5 0

DETAILS 0

0.5

0.5

PG

PG

KNOW

CSE

DEPARTMENT

WANTS

DETAILS

DETAILS

KNOW

CSE

DEPT

WANTS

PG

0 0

0 0

0 1

0 0 1

0

The above figure shows the diagrammatic representation of the proposed framework. The output of each phase is also shown in Fig . In weight assignment phase for each combination of words the score is calculated depending upon the above specified criteria. In ranking the aggregated weight is calculated for each combinations and it is sorted in ascending order. In our example the minimum score is 1. So the document which comes under department->cse is retrieved. The user also needs the same.

This method provides the new way of searching the contents on the domain/web. It finds the relevant document for the user query using the techniques called word stemming, ontology matching, weight assignment, rank calculation etc. In the ontology matching phase an improved matching algorithm is used to improve the relevancy of retrieval. The Query parsing and word stemming phase is further extended by including query expansion technique, rest of the phases are improved further by adding social annotations and phrase level matching.

Chapter-02

Ontology

2.1 The origin of Ontology:

The term "ontology" has been used for a number of years by the Artificial Intelligence & knowledge representation community but is now becoming part of the standard terminology of a much wider community including information system modelling.

The term is borrowed from philosophy, where ontology mean a systematic account of existence .

The term ontology has been applied in many different ways, but the core meaning is a model for describing the world that consists of a set of types, properties, and relationship types(Garshol, 2004). In the context of knowledge sharing, Gruber (1993a) uses the term ontology to mean a specification of a conceptualization. Gruber (1993b) defines conceptualization as " an abstract, simplified view of the world that we wish to represent for some purpose." Every knowledge management system is committed to some conceptualization, explicitly or implicitly (Gruber, 1993b). Ontology can help define the relationships among resources and find related resources. It is important to emphasize that there are multiple relationships between specific words and concepts. This means that in practice: 1) different words may refer to the same concept, and 2) a word may refer to several concepts.

2.2 What is Ontology?

An ontology is "the specification of conceptualization, used to help programs and humans share knowledge".

An ontology is a set of concepts- such as things, events, & relations that are specified in some way in order to create an agreed upon vocabulary for exchanging information.

In information management areas and knowledge sharing areas, ontology can be defined as follows:

An ontology is a vocabulary of concepts and relations rich enough to enable us to express knowledge and intension without semantic ambiguity.

Ontology describes domain knowledge and provides an agreed -upon understanding of a domain.

Ontologies are collections of statements written in a language such as RDF that defines the relations between concepts & specify logical rules for reasoning about them.

Main Definition of ontology:

"An ontology is a formal, explicit specification of a shared conceptualization "

"explicit" means that "the type of concepts used & the constraints on their use are explicitly defined";

"formal" refers to the fact that "it should be machine readable";

"shared" refers to the fact that "the knowledge represented in ontology are agreed upon and accepted by a group";

"conceptualization" refers to an a abstract model that consists the relevant concepts and the relationships that exists in a certain situation.

The conceptualization consists of,

The identified concepts (objects,events,etc)

For ex: Concepts: disease, symptoms, therapy

The conceptual relationships that are assumed to exist and to be relevant

For ex: Relationships: "disease causes symptoms", "therapy treats disease"

2.3 Reasons for developing ontologies?

An ontology defines a common vocabulary for researchers who need to share information in a domain. It includes machine-interpretable definitions of basic concepts in the domain and relations among them.

Why would someone want to develop ontology? Some of the reasons are:

• To share common understanding of the structure of information among people or

software agents

• To enable reuse of domain knowledge

• To make domain assumptions explicit

• To separate domain knowledge from the operational knowledge

2.4 Types of ontologies

(1) Top-level ontology,

(2) Domain ontology,

(3) Task ontology, and

(4) Application ontology.

First, top-level ontology describes very general concepts like space, time, and events, which are independent of particular problems or domains. Second, domain ontology describes the vocabulary related to a generic domain by specializing the concepts introduced in the top- level ontology. Third, task ontology describes the vocabulary related to a generic task or activity in the top-level ontologies.

Finally, application ontology is the most specific of ontologies. Concepts in application ontologies often correspond to roles played by domain entities while performing a certain activity.

Depending on the wide range of task to which the ontologies are put ontologies can vary in their complexity. Ontologies range from simple taxonomy to highly tangle networks including constraints associated with concepts and relations.

Light Weight Ontology

Concepts

'is-a' hierarchy among concepts

Relations between concepts

Heavy Weight ontology

Cardinality constraints

Taxonomy of relations

Axioms(restrictions)

In practical terms, developing an ontology includes:

defining classes in the ontology,

arranging the classes in a taxonomic (subclass-superclass) hierarchy,

defining slots and describing allowed values for these slots,

filling in the values for slots for instances.

We can then create a knowledge base by defining individual instances of these classes filling in specific slot value information and additional slot restrictions.

2.5 Benefits of ontology:

To facilitate communication among people and organisations: aid to human communication and shared understanding by specifying meaning

To facilitate communication among systems without semantic ambiguity: i.e. to achieve inter-operability

To provide foundations to build other ontologies (reuse)

To save time and effort in building similar knowledge systems(sharing)

To reuse domain knowledge.

To make domain assumptions explicit: ontological analysis clarifies the structure of knowledge & allow domain to be explicitly defined and described.

2.6 Application areas of ontologies:

Information Retrieval:

As a tool for intelligent search through inference mechanism instead of keyword matching

Easy irretrievability of information without using complicated Boolean logic

Cross language information retrieval

Improve recall by query expansion through synonymy relation.

Improve precision through word sense disambiguation(identification of the relevant meaning of a word given context among all its possible meanings)

Natural language processing:

Better machine translation.

Queries using natural language.

Knowledge management:

As a knowledge management tools for selective semantic access (meaning oriented access).

2.7 Ontology languages:

RDF:

Resource Description Framework : RDF is a framework for describing Web resources, such as the title, author, modification date, content, and copyright information of a Web page. RDF was designed to provide a common way to describe information so it can be read and understood by computer applications.

JENA:

Jenaâ„¢ is a Java framework for building Semantic Web applications. Jena provides a collection of tools and Java libraries to help you to develop semantic web and linked-data apps, tools and servers.

The Jena Framework includes:

an API for reading, processing and writing RDF data in XML, N-triples and Turtle formats;

an ontology API for handling OWL and RDFS ontologies;

a rule-based inference engine for reasoning with RDF and OWL data sources;

stores to allow large numbers of RDF triples to be efficiently stored on disk;

a query engine compliant with the latest SPARQL specification

servers to allow RDF data to be published to other applications using a variety of protocols, including SPARQL

2.8 An approach for Ontology-based Information Retrieval[20]

The logic based IR provides a sound platform to reason about the meaning of an information resources' content in the retrieval process, i.e. about the relevance of that meaning for the user's information need. In that way, the user can find resources that are relevant for his query even if there are no syntactical similarities between them. It is clear that the quality of the retrieval depends on the quantity and quality of the domain knowledge that is available to the reasoning process. Indeed, a logical system can retrieve a document about cars for a query for vehicle, if and only if there is a formally described statement that a car is a type of vehicle

Therefore, in order to enable retrieval of all semantic relevant resources for a query, the knowledge about domain has to be systematically acquired and described in the form of a domain theory. Moreover, in order to resolve the "prediction problem," the domain theory as to be commonly shared, i.e. a kind of common agreement about the used vocabulary should exist. Since ontologies represent explicit and formal specifications of the conceptualisation of a domain of interest, they seem to be very suitable for the extension of the logic-based IR systems in the above mentioned way.

Fig.2.8.1 ontology description[20]

Ontology-based Information Retrieval Model[20]

The Retrieval Model

The ontology-based model for information retrieval redefines the task of IR as an extraction from a given repository of information resources, of those resources r that, given query q, makes the formula O|- r → q valid, where r and q are formulae of the chosen logic, "→"denotes the brand of logical implications formalized by the logic in the question and O is a set of logical sentences called domain knowledge (ontology). A derivability relationship |- is defined between a set of formulae and a formula, if there exists a finite sequence of the inference rules that leads the set of formula to that formula.

Fig.2.8.2 ontology based retrieval model[20]

For the ontology-based IR, we have the following interpretation of the basic retrieval model

presented in section 1.2:

- LRes = ΚB(O), i.e. a resource is modelled as a set of relation instances (facts) from the corresponding knowledge base. This set can be treated as one of instance assertions; then the relations (concepts) of which a fact is asserted to be an instance constitute altogether the description of the resource;

- LQuery = Ω(O), i.e. a query is modelled as an ontology-based query Q(O); the intuitive meaning of this choice is that all resources represented by facts retrieved for query Q(O), i.e. the set of facts F(Q(O)), should be retrieved;

- IR = I(O) ⊆ LRes , i.e. a repository (collection) of information resources represents a set of all concept instantiations;

- M(I(O), Q(O)), the matching function between the repository and the given query, is implemented through logical inference defined by the logical language used for representing ontology O.

In this model we used bottom-up fix-point evaluation procedure. It means that some mappings in M are defined implicitly through the axioms from set A(O) .They can allow the

specification of lexical, "thesaural" knowledge as well, i.e. they contribute to the specification of the meaning of the terms used in both document representation and query formulation. In the inferring process this kind of knowledge is brought to bear (and thus serves as) "background knowledge" according to which queries are to be interpreted. The positive effect is that axioms are in fact a recall-enhancing mechanism, because they support the discovery of resources relevant to the query that would have otherwise gone undetected.

Given a retrieval model, the interaction with a simple ontology-based retrieval system may be described as follows (see Figure 2.).The set of information resources and their properties is represented as a set of instances in the knowledge base ΚB(O). A user's information need is conceptualised in an ontology-based query Q(O). This query is matched against the set of information resources, M(I(O),Q(O)) and the set of answers F(Q(O)) is returned to the user.

Chapter-03

Ontology-based approaches

Ontology-based approaches are characterized by the use of highly detailed conceptualizations in the form of ontologies and KBs. They provide formal descriptions of the meanings involved in user needs and contents. Therefore, these models have better chances to achieve the so-called semantic search paradigm.

4.1 Semantic Association Analysis in Ontology-based Information Retrieval

This system is based on semantic web. Semantic Web. RDF and SPARQL languages do not adequately provide a query mechanism to discover the complex and implicit relationships between the resources. Such complex relationships are called semantic associations [8]. The process of discovering semantic associations is also referred to as semantic analytics.

4.1.1.Semantic Association Analysis

The conventional and semantic supported search approaches typically respond to user queries by returning a collection of links to various resources. Users have to verify each document to find out the information they need, in most cases the answer is a combination of information from different resources. Relations are at the heart of Semantic Web [10]. Focusing on Semantic Web technologies, the emphasis of search will shift from searching for

documents to finding facts and practical knowledge. Relation searching is a special class of search methods which is concerned with representing, discovering and interpreting complex relationships or connections between resources.

Sheth et al (2005) discuss an algorithm developed to process different kinds of semantic associations using graph traversal algorithms at the ontology level. The relationships between two entities in the results of a semantic query could be established through one or more semantic associations. In this case the semantic associations could be represented by a graph which shows the connection between entities. It is also important to process and prioritize the semantic association based on user preferences and the context of search. There are also ranking algorithms proposed based on different metrics to grade the semantic association [10].

The semantic association analysis consists of several key processes and components like discuss ontology development, data set construction, semantic association discovery, semantic association ranking, results presentation, and performance evaluation s respectively. However, there are also other important issues such as entity disambiguation, data set maintenance and so on. Here Ontology Development: using protégé tool

4.1.2. Data Set Construction:

The data should be selected from highly reliable Web sites which provide data in structured, semi-structured, parseable unstructured form or with database backend. Structured data is preferred (i.e. RDF or OWL). Semi-structured or parseable unstructured data (i.e. XML) can be transformed to structured data using xPath or XSLT. Data with rich metadata and relations is preferred. For example, for a "Computer Scientist" class, the source also provides "address", "country" attributes as well as some relations with other classes such as "Research Area", "Publication", "Organization". The data set should have rich relations and large amount of instances which are highly connected.

4.1.3. Semantic Association Discovery Algorithms

Semantic association discovery can be seen as a special class of semantic search aiming to find out complex relationships between entities. The problem can be generalized as enumerating all possible paths between any two nodes in a semantic graph. The search is performed using ontologies and semantic data sets. The structure of the ontology constrains the possible paths that one can take from one node to another. Typically the structure of the ontology or relation between classes is simple; however, the relations between instances in the knowledge-base (i.e. instances) might be very complicated depending on the connectedness of the graph.

4.1.4 Semantic Association Ranking

Ranking mechanism is an important part of a search engine. Ranking algorithm reflects the cognitive thought of human beings towards the ranking of real world objects according to their perceived importance. The PageRank algorithm contributes to Google's success and it is one of the most important reasons that most people prefer to use it. Most of the current search engines rank documents based on vector space model. In semantic association analysis, an important task is incorporating the most meaningful associations out of all detected relations. However, new ranking algorithms need to be developed in order to utilize the advantages of Semantic Web technologies.

4.1.5 Presentation:

The identified semantic associations could be presented to users in a meaningful way which is able to help users understand the meaning of entities. We have implemented an automated hypermedia presentation generation engine, called MANA, to construct hypermedia presentations based on documents relating to entities in semantic associations .

Different components in a semantic enhanced information search and retrieval system[17]

The entities are hyperlinked to those documents which are able to provide external explanations that help users to explore relevant information regarding a submitted query. Figure shows different components and layers in an ontology-based information search, retrieval and presentation.

Finally we conclude that, semantic analytics area and in particular discovery and interpretation of complex relations between entities in a knowledge-base. In Semantic Web, semantic analytics demonstrates significant importance in various application domains by enabling search mechanism to discover and process meaningful relations between information resources.

4.2 Relevance Information Retrieval based on ontology: (ONTOBROKER SYSTEM)[20]

4.2.1 Relevance

Relevance is one of the most important concepts in the theory of IR. The concept arises from the consideration that if the user of an IR system has an information need, then some information stored in some resources in the information repository may be "relevant" to this need. In other words, the information to be considered relevant to a user's information need is the information that might help the user to satisfy his information need. Any information that is not considered relevant to a user's information need is to be considered "irrelevant" to that information need. This is a consequence of accepting concept of relevance.

Therefore, given a set of information resources and a query, the task of the retrieval process is to retrieve those resources, and only those whose information content is relevant to the information content of the query. Because relevance is important in the ranked retrieval for the quality of the retrieval process. The importance of relevance is the main reason why the logical formalisation of information retrieval is a non trivial problem:

first, in determining the relevance of a resource to a query, the success or failure of an implication relating the two is not enough. It is necessary to take into account the uncertainty inherent in such an implication,

second, the introduction of uncertainty can also be motivated by the consideration that a collection of resources cannot be considered as a consistent and complete set of statements. In fact, resources in the collection could and often do contradict each other in any particular logic and not all the necessary knowledge is available, and,

finally, what is relevant is decided by the user from session to session and from time

to time, and is then heavily dependent on judgments where highly subjective and scarcely reproducible factors are brought to bear.

Due to its conceptual nature, ontologies provide an ideal abstraction level, on the top of conditional reasoning, for defining this flexible notion of relevance, which we will call conceptual relevance. This relevance will be explained in detail in the next section. ontologies represent a conceptual model of a domain, they seem to be an ideal source for defining this epistemological view on relevance. Moreover, the conceptual notion of relevance is one of fundamental characteristics (advantages) of the ontology-based information retrieval

Considering conceptual level, there are two views on the relevance of a resource r for an information need expressed in query q, we define in following two definitions.

4.2.2 Conceptual Relevance

There are different interpretations of probability that can be used for calculating relevance [163]. Traditionally, one can understand probability from the frequency point of view. That is, probability is a statistical notion, concerning itself with the statistical laws of chance. On the other hand, probability can be interpreted as the degree of belief-the epistemological view. This view concerns the assignment of beliefs in propositions. Different interpretations of the theory of probability lead to different approaches for modelling relevance in information retrieval.

Therefore, in the traditional relevance models, relevancies are obtained simply by counting the number of resources containing a particular descriptor or index term. The argument for using the statistical notion of relevance is that probabilities should be viewed as a measure of chance at the implementation level. However, the neglect of other explanations of relevance at the conceptual level is perhaps the source of difficulties in the conventional relevance model. Since ontologies represent a conceptual model of a domain, they seem to be an ideal source for defining this epistemological view on relevance. Moreover, the conceptual notion of relevance is one of fundamental characteristics (advantages) of the ontology-based

information retrieval.

Considering conceptual level, there are two views on the relevance of a resource r for an information need expressed in query q, we define in following two definitions.

Collection relevance:

It represents relevance of the resource regarding the given information repository (so called collection relevance, in the notion ColRel):

ColRel : Π(O) x KB(O) → R

Explanation relevance

It represents relevance of the retrieval process M (see previous section) in which a resource (i.e. result of a query) is retrieved (so called explanation relevance, in the notion ExpRel): ExpRel : M x Π(O) → R

Fig.4.2 .2 The ranking process in the Ontobroker retrieval[20]

we presented the formal model for the ontology-based information retrieval, as an extension of the existing logic-based IR models, especially in defining the notion of the relevance. We proposed a comprehensive model for the relevance, the so-called Conceptual relevance that models not only whether an information resource is relevant for a query, but moreover why (and consequently, how strong) a resource is relevant for a query. In that way , by combining the relevance regarding how (i.e. why) an information is retrieved in a retrieval system (the so-called Explanation relevance) and how semantically is this information related to other relevant information (the so-called Collection relevance), we tried to mimic the relevance reasoning found in human beings.

4.3.Ontology based Information Retrieval by Semantic Similarity (SSRM)[18]

Measures of semantic similarity and relatedness for use in ontology-based information retrieval. The underlying hypothesis is that by extending the classical information retrieval models to include the knowledge contained in ontologies covering the domain of the information base, we obtain means for producing better answers to user queries.

Better answers are, in this context, primarily a more fine-grained ranking of information base objects, which is obtained by exploiting better methods for computing the similarity or relatedness between a query and objects from the information base.

Semantic similarity between concepts is typically calculated using only the information available in a concept inclusion hierarchy. However, semantic relatedness between concepts can be viewed as the aggregate of the overall interconnection between the concepts in question, considering a wider number of semantic relations.

In retrieval task, a user poses a query representing an information need to the system. The information retrieval system must satisfy the user's information need by analysing both the query and the documents and then presenting a list of documents to the user that are found relevant to that particular query. This list of documents is the result of a matching process that compares each document with the query. The main function of the analysis of the query is to derive a representation that can be matched with the document representation. One way of including the knowledge contained in the ontology is to choose a representation formalism where queries and objects are described using a concept language and where the expressions can be directly mapped into the ontology. We can then calculate the similarity between the description of the query and the descriptions of the objects, based on the nearness principle derived from the ontology.

One approach to a nearness principle is reasoning over the ontology. The approach in this is based on a relatedness measure between concepts, derived from the structure and relations of the ontology, which is then used to perform query expansion. By doing so, we can replace semantic matching from direct reasoning over the ontology with numerical similarity calculation by means of a general aggregation principle. This has at least two advantages in an information retrieval context. The first is that it allows for partial matching of queries, and the second is that it is less time consuming.

4.3.1 Representation of ontologies:

The main concept is measures of similarity derived from the structure and relations of an ontology for use in information retrieval. The aim is therefore to identify the type of ontology formalism, as well as the main ontology components needed to form the basis for deriving and calculating similarity.

Ontologies have typically been represented using frames [11], conceptual graphs [12], first-order logic or description logics [13]. Dominant in the last five years are new representation schemes based on description logic languages, such as OIL , and OWL [14].

4.3.2 Introduction to Description Logic:

Description logic is a knowledge representation formalism that represents the knowledge of an application domain by defining the relevant concepts and roles of the domain, and then using these concepts and roles to specify properties of objects and individuals occurring in that domain. Concepts are sets of individuals and roles are binary relationships between individuals. The atomic concepts and roles can, by means of concept constructors be combined into complex descriptions.

Apart from the representation formalism, description logic offers reasoning capabilities that allow for the inference of implicit knowledge, means for limited querying, and support for the identification of contradictory concepts. A knowledge base system based on description logic consists of two components, a TBox and an ABox.

The TBox describes the structure of a domain in terms of classes (concepts) and properties (roles). The description consists of a set of terminological axioms, which are statements about how the concepts and roles are related to each other. This means that in description logic, concepts are defined intentionally in terms of descriptions that specify what properties objects must have to belong to a certain class.

The ABox , consists of assertions about named individuals, using the concepts and roles defined in the TBox.

4.3.3 The description language:

The name of a description logic denotes the concept constructors available. AL is a Attribute Language, as introduced Schmidt-ScauB and Smolka[15]. Concept description in A L are formed according to the following syntax rules, where A denotes atomic concepts, C and D denote complex concepts, and R denotes atomic roles

Syntax rules for AL are,

C,D -> A (Atomic concept)

T (universal concept) etc.

4.3.4 Semantic Similarity Measures:

A way of measuring semantic similarity in a semantic network is to evaluate the distance between the concepts being compared, where shorter distance means higher Shortest Path Length .The semantic similarity measures are

Weighted Shortest Path

Depth-relative Scaling Approaches

Information Content

Hierarchical Concept Graphs

4.3.4.1 Weighted Shortest Path:

Another simple edge-counting approach was presented in Bulskov et al. [16] for use in information retrieval. We argued that concept inclusion (ISA) intuitively implies strong similarity in the opposite direction from inclusion (specialization). In addition, the direction of the inclusion (generalization) must contribute some degree of similarity. fraction of an. With reference to this following ontology, the atomic concept dog has high similarity to the concepts poodle and alsatian.

The measure respects the ontology in the sense that every concept subsumed by the concept dog by definition bears the relation ISA to dog. The intuition is that, to a query on dog, an answer including instances poodle is satisfactory (a specific answer to a general query). Because the ISA relation obviously is transitive, we can by the same argument include further specializations, e.g. to include poodle in the extension of animal. However, similarity exploiting the taxonomy should also, as was the case in Rada's approach, reflect \distance" in the relation. Intuitively greater distance (longer path in the relation graph) corresponds to smaller similarity.

Specialization Property

Concept inclusion implies strong similarity in the opposite direction of the inclusion. Furthermore, generalization should contribute to similarity. Of course, it is not strictly correct, but because all dogs are animals, animals are to some degree similar to dogs. Thus, the property of generalization similarity should be exploited. However, for the same reasons as in the case of specializations, transitive generalizations should contribute a decreased degree of similarity.

Generalization Property

Concept inclusion implies reduced similarity in the direction of the inclusion. A concept inclusion relation can be mapped into a similarity function in accordance with the two properties described above and the minimal distance property as follows. Assume an ontology given as a domain knowledge relation.

The above fig. can be viewed as such an example. To make "distance" influence similarity, we assume the ISA relation to be transitively reduced.

An example: ontology with relation ISA covering pets[18]

Similarity reflecting "distance" can then be measured from path length in the graph corresponding to the ISA relation. A similarity function "sim" based on "distance", dist(X,Y) in ISA should have following properties:

Sim: U x U -> [0,1], where U is the universe of concepts

Sim(x,y)=1 only if x=y

Sim(x,y) < sim(x,z) if dist(x,y) > dist(x,z)

Properties 2 and 3 correspond to the Identity property and the Minimal Distance Property respectively. By parameterized with two factors δ and γ ,expressing similarity of immediate specialization and generalization respectively, we can define a simple similarity function as follows. A path between nodes (concepts) x and y using the ISA relation.

P=(p1, . . , pn)

Where

Pi ISA Pi+1 or p i+1 ISA pi

for each I with x=p1 and y= pn.

Given a path P=(p1,….,pn), set s(P) to the number of specialisation and g(P) to the number of generalisation alog the path P, as follows,

s(P)= |{i|Pi ISA Pi+1}|

and

g(P)= |{i|Pi+1 ISA Pi}|

If P1; .. ; Pm are all paths connecting x and y, then the degree to which y is similar to x can be defined as follows.

simWSP(x,y) = max{σs(pj) γg(Pj)}

We denote this measure sim(x; y)WSP (Weighted Shortest Path), as similarity between two concepts x and y is calculated as the maximal product of weights along the paths between x and y. This similarity can be considered as derived from the ontology by transforming the ontology into a directional weighted graph, with σ as downwards and γ as upwards weights, and with similarity derived as the product of the weights on the paths. Figure 4.3.4.1 shows the graph corresponding to the above.

As such, the measure is in accordance with the Specialization Property, the Generalization Property, and the Identity Property, because there is an edge with weight 1 from every concept to itself. Furthermore, it conforms with the Minimal Distance Property, where\minimal" is interpreted as the maximal sum of the product of weights along all possible paths between two concepts. A widely acknowledged problem with the shortest-path approaches is that they typically rely on the notion of uniform distance in the taxonomy. This

implies, as mentioned previously, that not all edges (link) denote the same distance and therefore not the same similarity. There have therefore been various attempts at scaling the network by incorporating the position in the taxonomy of the concepts being compared

[18]The ontology transformed into a directed weighted graph, with the immediate specialization and generalization similarity values σ = 0.9 and γ = 0.4 as weights. Similarity is derived as the maximal (multiplicative) weighted path length, and thus sim(poodle; alsatian) = 0:4 ¤ 0:9 = 0:36.

4.4 Semantic knowledge annotation

All the meanings and information conveyed by content in unstructured form (such as text or audiovisual content) cannot in general be fully translated to a clear and formal semantic representation, for both pragmatic (cost) and intrinsic (problems for the formalization of the world) reasons. However, it is possible to formally describe parts of the conveyed information, albeit to an incomplete extent, as metadata. Metadata is data about other data (e.g., the ISBN number and the author's name are metadata about a book). For the same reason that it is generally useful to keep both parts of information (data and metadata) in the system, it is also relevant to have a link that connects the two of them, commonly known as annotation.

Different syntactic supports and standards have been proposed for the representation of metadata and annotations. Markup languages like HTML and XML are widespread nowadays, but they have limitations in their expressiveness and share ability (Passin, 2004). Ontology-based technologies have been developed in the last few years to address and overcome these limitations. For example, imagine a document that contains the keyword "jaguar". This keyword is ambiguous because it might refer to the animal or to the car. An ontology-based annotation can relate the word "jaguar", appearing in the document, to an ontology concept that defines "jaguar" as the abstract concept "animal", thus removing any ambiguity.

A survey of ontology-based technologies for semantic annotation is reported in (Uren, et al.,2006). This work proposes a document centric model for ontology-based semantic annotation that manages three elements: ontologies (metadata), documents (data, or content in unstructured form) and annotations (links between the data and the metadata). They identify seven requirements for ontology-based semantic annotation systems:

• Standard formats: using standard formats is preferred whenever possible because the investment in making up resources is considerable and standardization builds in future proofing because new tools, services, etc.

User centered/collaborative design: in the case of manual annotation tools, it is crucial to provide users with easy to use interfaces that simplify the annotation process and place it in the context of their every day work.

• Multiple Ontology support: annotation tools need to be able to support multiple ontologies. For example, in a medical context, there may be one ontology for general metadata about a patient and other technical ontologies that deal with diagnosis and treatment.

• Support of heterogeneous document formats: standards for annotation tend to assume that the documents being annotated are in Web-native formats such as HTML and XML. However, with the emergence of new multimedia content in the Web, documents will be in many different formats (audio, video, etc).

• Document evolution: Ontologies and documents change continuously, which means that the

annotation process should not be fixed.

• Annotation storage: The ontology-based semantic annotation model assumes that annotations will be stored separately from the original documents. However, many tools store the annotations as integral part of the documents and therefore they do not decouple data and metadata.

• Automation: an important aspect of easing the knowledge acquisition bottleneck is the provision of facilities for automatic mark up of document collections. To achieve this, the integration of knowledge acquisition technologies into the annotation environment is vital.

The work in (Uren, et al., 2006) also analyzes different annotation tools considering this seven annotation requirements. Fig 3.8 shows a comparison considering the first six requirements while Fig 3.9 represents just the automation requirement. As we can see in Fig 3.9, many systems have some kind of automatic and semi-automatic support for annotations

4.4.1 TAP is proposed as a Web-based search system where documents and concepts are nodes alike in a semantic network []. This work views the Semantic Web as a big network containing resources corresponding not just to media objects (such as Web pages, images, audio clips, etc.) as the current Web does, but also domain objects like people, places, organizations, and events. This vision is complemented with multiple relations between resources rather than just one kind (hyperlinks). These resources and their relationships are described in an RDF representation, where explicit, embedded annotations are added to link resources with the corresponding documents where they appear. The system relies on the Google Web search engine to carry out keyword- based searches. To extract the semantic information related to search results, the ontology can be queried by three different methods:

• GetData: a semi-structured query allowing Semantic Web applications to consume this semantic information. In this method, concepts and relations are expressed as: GetData(<

resource>, <property>) => value.

• Search: a string is taken as input, and all the resources that contain the string in their "title

property" are returned.

• Reflection: similar to the reflection methods provided by object oriented languages, returns

a list of incoming and outgoing arcs of a node.

Once