Semantic Information Retrieval Using Fuzzy Computer Science Essay

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.


Several approaches have been introduced in the field of information retrieval. Although these approaches are much effective but sometimes they are not able to appear in providing accurate information to the user. In this paper an ontology based approach of information retrieval has been presented that uses fuzzy set of various documents for a specific domain. An algorithm for fuzzy based classification of web documents is proposed to create semantic index. The proposed algorithm differs from others as: it utilizes K-means clustering algorithm to find semantically similar terms and domain ontology as well. The retrieved results would always be semantic as they are limited to a particular threshold of classified range.

Keywords: Information Retrieval, Ontology, Semantic search, Query Expansion, K-means clustering, Fuzzy Sets.


With the rapid growth in Information Technology's field, Information Retrieval on Internet is gaining importance, day by day. The web comprises of huge amount of data and search engines provide an efficient way to help navigate the web and get the relevant information. The World Wide Web has proven to be less efficient in providing relevant information from a query processed by a user. Though at a first glance Search engines seems to be useful for providing accurate information to the user, but most of the time they prove to be less efficient in obtaining most reverent information of users processed queries, Therefore, the efforts should be made to create such methodology which generates quality of results, instead of just sorting a mass of document which is an unimportant process in delivering quality results. The biggest challenge in this process, is finding most relevant data according to user interest. In today's scenario the relevancy of web document is evaluated by matching each keyword of query with information on web [14]. To overcome such problem the new upcoming Semantic approaches are becoming a good example in establishing a semantic relationship among the documents i.e. Semantic web [15]. In semantic web information is stored in a conceptualize hierarchy referred as Ontology developed in Web ontology language (OWL). Therefore in semantic web the similarity can be evaluated by using semantics of concepts in ontology. In Semantic web the information is represented in RDF (Recourse description framework) through which it provides user understandable meaning to each concept. As the need to extract information more intelligently so that it can meet user's requirement to much extant. In this respect several knowledge base repositories are used i.e. ontology. Ontology has the property to define concepts and relationships to give the knowledge about a specific document in domain specific term. Therefore using ontology for information retrieval allows extracting information on the basis of semantic association (Links) rather than just by matching keyword.

Fuzzy logic can prove itself quit beneficial in complex situation where people are facing difficulty in making decision and find it difficult to use complex mathematical model. One of such field is information retrieval, where IR system finds it difficult to make decision in providing accurate information.

A fuzzy set in a universe of discourse U is characterized by a membership function PA (x) that takes values in the interval [0, 1] [16]. The fuzzy sets theory proposed by Zadeh [16] in 1965 called as approximate reasoning'. Each elements of fuzzy set has a membership value between 0to1 this membership function plays an important role in defining the degree of membership of elements in the fuzzy set. This membership faction can be defined in two ways to represent fuzzy set. 1. Discrete type membership function. 2. Continuous type membership functions.

This paper presents an approach of finding relevant documents by using the semantic similarity, fuzzy concept and query expansion technique. Semantic similarity between concepts is determined by applying k-means clustering. The theme of the proposed technique is based on fuzzy based searching and extracting precise information from user defined knowledge base.

Rest of the paper is organized as: Section 2 provides the related work; Section 3 presents the proposed system consists of calculating the semantic similarity, using fuzzy concept and query expansion along with example. Finally section 4 concludes the paper.


So far many approaches have been proposed by a number of researches in information retrieval. In proposed approach interpretation of information, searching of information and collection of information can be manipulated by semantic based on ontology. S. Robertsone et all. have defined the ontology based Information retrieval model, as in large repository of documents the semantic search can be supported by domain specific knowledgebase[2]. Another research done by A. Kiryakov is based on producing architecture for Indexing, semantic annotation, and extracting documents with respect to repositories based on semantic [2]. Xing Jiang et al. Have proposed user ontology, it is an ontology based model it provides personalized information services which is used in semantic web. In this ontology concepts, taxonomic relationship and non - taxonomic relationship for a given domain ontology is used to assume the user interest [4]. A survey on existing research activities in this field have shown various applications for information retrieval such as :query expansion used in graph-based approach focusing on multi-document summarization [5], A syntactically-based technique of query reformulation for retrieving information[6], query can be formulated automatically through semi-supervised incremental algorithm [7] Consideration of concept for finding information is also an active area of research now a day's, ontology based[8], semantic[9] and conceptual based query expansion are some example of it. The search can be improved by using query expansion techniques. There are two methods for query expansion: Local analysis and global analysis [10].In local analysis new query generated on the basis of few retrieved documents. In global analysis new terms from external resources i.e. WordNet, thesaurus etc. are added in original query. There are many query expansion methods [11] although they are very helpful in improving search quality but they have some problems like: relevance feedback is not appropriate in many cases. Sometime finding relevant document becomes much difficult when the query is ambiguous [12] and in such situation user do not have any way to give his interest to the system.

There are different models for calculating semantic similarity between concepts in ontology. i.e. Lexical and syntactical, Structural, information-theoretical and future based models, among them feature based models are proven to be most efficient similarity technique[13] . We are motivated with feature based model because this model is shown to be very close to human judgment.


In this paper we have proposed a technique for getting few yet much more relevant results according to user query. The main objective of this approach is to solve the problem of finding non relevant result from user's view.

Overall Architecture

We have proposed architecture (as shown in Fig. 1) for semantic information retrieval based on document classification. In the proposed architecture, if a user enters a query through the user interface, firstly query expansion is performed. The expanded query is used for information retrieval through semantic IR module. The index manager receives the expanded query and the corresponding results are retrieved through the semantic index. The semantic index is created with the help of proposed Document Classification Algorithm that is based on Fuzzy, Domain Ontology and K-means clustering algorithm. Document classification is done by the document manager that receives the web documents fetched by the web crawler (for specific domain) and applies the Document Classification Algorithm.



U s e r I n t e r f a c e

Doc. Manager

Document classification


Query Expansion

Expanded Query

Index Manager

Semantic Index

Web Crawler




K-means algorithm

Semantic terms

Semantic IR

Fig 2.1: Overall Architecture for the proposed system

The steps performed in semantic IR process are given as follows:

Create or reuse an existing ontology

Web crawler fetches web documents having terms given from ontology (classes).

Extract semantic terms by applying k-means clustering algorithm to the classes found in fetched document and ontology.

4) The documents consisting of semantic terms are sent to document manager.

5) The proposed algorithm is used for document classification and a semantic index is created consisting of fuzzy set of documents with their attributes.

6) If a user enters a query, the query is expanded and the results are retrieved within a fuzzy threshold through the semantic index with the help of index manager.

Extraction of semantic terms:

Relevant classes found in the documents use k-means clustering algorithm [17] and find the terms that are semantically similar to the classes. Each cluster created using this algorithm consists of set of semantically related terms. It follows two steps:

Initially the terms (centroid) are chosen randomly.

Terms that are semantically similar to the centroid are assigned in same cluster.

These steps are repeated until a stopping condition is met.

For e.g.

Initial classes chosen randomly: Management, Technical, AppliedScience etc.

Semantically similar terms: Management = {MBA, BBA, PGDM}

Technical = {B.Tech, M.Tech, MCA, BCA}

AppliedScience = {B.Sc, M.Sc}





B.Tech ,BS, BE,ME,MS

M.Tech, PGDCA,



Applied Science






Fig.2 Terms Clustering

The Algorithm for Document Classification:

Let S(D) is the set of document i.e. D(web Document). SK, Semantic keywords. SKj (Di) is the jth term in ith document.

Input: Set of documents fetched by crawler

Output: Classified documents within a boundary

Step 1: A Set of Documents is fetched by the web crawler for concepts fetched through ontology

Step2: Compute SK by applying K-means clustering algorithm in ontology .

Step 3: Find the number of semantic terms presented in each document.

, where SKj is the set of semantic terms found in Di

Step3: Find the percentage of keyword match i.e. P(x) of each document Di

Pi(x) = (SKj / SK) * 100)

Step 4: compute membership function (µd(x)) for each document as ( µdx: X [0,1] )

µdi(x) ƒŸ Pi(x)

Step 5: Classify the documents through creating a fuzzy set of documents by setting a boundary of fuzzy value through membership function as,

0 if µdi(x) = 0

K[i] = 1 if µdi(x) = 1

1 < µdi(x) < 0 Otherwise

Step 6: Associate SKj to Di along with its url and value of K and create semantic index.

documents in result set i.e. RSet

RSet =

(Here the in integration sigh does not denote integration it denotes the collection of all points xЄU with associated value of K)

Firstly, a set of class from the ontology are extracted to find a set of relevant document which are fetched by the web crawler i.e. "Web document". K-means clustering algorithm is applied to the ontology to find the semantically related terms with respect to the classes. Then Pi(x) is calculated by finding the number of semantic terms presented in each document (shown in Step 3rd). Based on the value calculated for each document i.e. Pi(x), membership function is assigned i.e. µdi(x) = Pi(x). After assigning membership value to the document, the documents are classified as the document close to user query and document far from user query. On the basis of these values a semantic index is created where these document are stored along with their value. These documents are ordered in decreasing order of their value and are presented to the user as the final response.

We have assigned a threshold value here which will again filter the results if we apply this in the fuzzy set F then less value represent to the higher relevance. If a user enters a query through the user interface, the query is expanded and the expanded terms are submitted to the index manager which fetches the results between the threshold value to 0(Threshold<y<0) through the semantic index.







1 1

Fig2. Membership function for web document (µD) based on the matched semantic terms of document.

In Fig. 2 representation of fuzzy set through membership function is shown so that the document are classified on the basis of two parameter, closes to the selected classes of a domain specific ontology and far from the selected class. Here one set is a set of all web document fetched by the web crawler and another set is the set of much relevant document in this set high value means higher relevance. Final results fall between the ranges of threshold to 1.


If we use ontology to find any information it becomes feasible to expand the search from the relative item to the given one. For e.g. we have taken a Class management, technical and applied sciences from our ontology and submitted it to the web crawler. Crawler will fetch the url from web as shown in Table 1. Select only those url which contains given classes i.e.( management, technical and applied sciences) shown in Table 2. In Table 3 cluster having semantically similarly terms is described and final set of semantic keyword is generated. Match these semantic term in each document and give fuzzy value to them. This fuzzy value is assign to each of the selected page on the basis of the keyword hits then a semantic index has been created. For calculating fuzzy value of each document, let say SK = total semantic keywords And SKj= matched term, so the percentage of each document in the set can be calculated (K) for e.g. D1 has 40% matched term then the fuzzy value will be 0.4 like this all document have assigned fuzzy value.

TABLE 1. URLs fetched by the crawler

Web Pages










TABLE 2. Set of Cluster having semantic terms




M.Tech, B.Tech, MCA,BCA

Applied Science

M.Sc, B.Sc


M.Tech, B.Tech, MCA,BCA…….




FT ,PT, integrated …….

TABLE 3. Set of Cluster having semantic terms

SK =

Where SK is the set of semantic keywords. Ci is the cluster and represents the union of the clusters.

K = (SKj/SK)*100

For example, in document D1

Keyword matched, M = 8

SK = 20

K= (8/20) * 100 = 40%

Hence the degree of membership will be Pi(x) = 0.4

A limit has been set as threshold to 1 and final results have been extracted between this limit. Let say we have set the threshold value Th=0.75.When user enters a query .Query expansion is done and expanded query terms are fired into semantic index, and final results having fuzzy value between this range retrieved. Thus it reduces the number of results and at the same time it reduces the mismatch between user query and results obtain.

Conclusion and future work

In this paper we have presented a fuzzy based approach for document classification that is being used in semantic information retrieval. We have provided that the required information can be extracted in much more precise manner by using semantic index with ontology based approach. A threshold is set to limit the results and thus the technique shows its effectiveness in much more extant as explained in the given example.

The technique proposed in the paper is domain specific. In future the approach may be generalized for various domains along with mapping, for collaborative search.