Automatic Annotation Of Webpages For Semantic Computer Science Essay

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

Searching the vast and distributed structure of the web requires the efficient search schemes. Semantic annotation is used to associate the meaningful tags with a document to perform semantic search. This paper puts forward an automatic approach for annotating web documents. The proposed algorithm for semantic annotation constitutes five ontology based rules and provides the semantic tags along with the degree of correlation between a tag and the consequent web document. As the annotation would be done automatically, the results obtained for a query would always be relevant.

Keywords: ontology, semantic information retrieval, semantic annotation, Ranking, semantic index


In recent years, the World Wide Web has become the most significant source to publish as well as retrieve the personal, social and commercial information. Finding relevant information from the web which has a vast and distributed structure requires efficient search schemes. Semantic search provides the most relevant information by considering the actual user's intent as the underlying information should be in a structured form as RDF (Resource Description Framework). Semantic Web documents are differentiated by semantic annotation and meaningful relation to other documents. Semantic annotation is the process of adding some meta-tags that are related to the corresponding web documents so that if any of the term occurs in user query, the associated web document would be retrieved. Tagging is a method by which anyone associates the terms to any content such as documents, Web pages, pictures and videos to describe, find and organize content. As the conventional search engines do not lead these features, a semantic search engine should be customized. The attainment of metadata for the web documents would permit various applications in semantic web to appear and put on broad recognition. Such applications would present new access techniques of information retrieval based on the linked metadata. Presently, there exist various Information Extraction approaches that provide meaningful terms within the document text, and the relations between them which are obtained with the help of Ontology.

Ontology is an explicit specification of a conceptualization [4]. Ontology defines the various entities, events and relationships among them of a particular domain, in the form of Classes, Individuals and Properties. The existing approaches for semantic annotation [1] [2] are mostly manual or semi-automatic [7] based on the mapping of semantic terms between the documents and the consequent domain ontology. In some manual approaches the irrelevant terms may be associated with the documents due to which the results would be retrieved for dissimilar and unrelated queries as well [3]. To overcome this pitfall automatic annotation should be performed [6].

This paper puts forward an algorithm for annotating web documents followed by five rules. As rules are based on the frequency of a particular term in a document as well as on mapping the term with domain ontology, the tags thus obtained would always be semantic and somehow related with the corresponding documents.

The rest of the paper is organized as: section II presents the related work for semantic information retrieval, semantic annotation and semantic indexing. In section III the proposed architecture of semantic information retrieval is presented. The technique of semantic annotation and the proposed rules have been described. Section IV draws the conclusions and some future directions for the proposed scheme.


This paper is motivated by the important need for adding metadata to the existing web pages in an efficient and flexible manner that utilizes the advantages offered by RDF (Resource Description Framework) and ontology [4]. In recent years Information extraction and annotation through the web has been an active research area. Several researchers have presented their own techniques and systems to annotate the web pages and performing semantic information retrieval.

Yiyao Lu (2011) [1] presented an automatic annotation approach that first aligns the data units into several groups having the same semantic data and then for each group annotation is performed from different aspects and the different annotations are aggregated. An automatic annotation wrapper is constructed for the search sites.

David Sánchez (2011) [2] introduced a technique to moderately annotate textual web content in an automatic and unsupervised way that uses various learning techniques and heuristics to ascertain relevant terms in text and to correlate them to the classes of ontology through linguistic patterns.

In [3] the analysis of various problems and precision in the information retrieval has been performed according to the experimental data such as image retrieval with different retrieval terms and a search engine is constructed. To improve the precision of the search systems various recommendations have been suggested such as correcting keywords, building and enhancement of knowledge documentation, reasonable definition of features vectors, information matching and filtering, and increasing the intelligentization for spider, indexer and searcher.

Thomas R. Gruber [4] has given the formal description of the ontology and described the role of ontologies in various information sharing actions, and the guidelines for the development of ontologies have been presented. Engineering mathematics and bibliographic data was taken into consideration for the development and testing the ontology. In [6] an approach to resolve the problem of text categorization over a corpus of newspaper articles has been presented followed by the annotation. Lemmatization, a combination of Support Vector Machines (SVM), ontologies and heuristics has been applied to deduce the semantic tags for the annotation.

A semi-automatic annotation system is proposed in [7] that comprises an automatic annotator with a manual annotator. The manual annotator annotates the textual web data using the Knuth-Morris-Pratt (KMP) algorithm, and the automatic annotator allows a user to use the terms to annotate metaphors with high conception.

The technique proposed in [8] combines the lexical and semantic relationship to analyze user's query. A modulative method is proposed for result ranking based on the predictability of the results for users.

In [9] a technique for semantic annotation of web documents with individuals of ontology is proposed that recognizes the related individuals and marks them as role instances within OWL (Web Ontology Language) ontology by considering the tree structure of a web page and the semantics of the information it contains.

FF-ICF algorithm has been modified by [10] for ranking and scoring semantic document annotation based on document richness. The modified algorithm has been applied into a retrieval engine, PicoDoc, to measure its performance in ranking and scoring documents annotation.

The OntoGram-approach proposed in [11] performs indexing on texts by their conceptual content through ontologies along with syntactic grammars and lexico-syntactic information that is transformed into concept feature structures and mapped into concepts in a generative ontology.

In [12] a framework for Cognitive Linguistics theories is proposed that is based on Construction Grammar (CxG). RDF (Resource Description Framework) has been used in the domain ontology to build constructions and a set of rules based on linguistic typology have been presented to deduce the semantics and syntax of the constructions.


In this paper, the architecture of semantic information retrieval and a technique of annotating web pages and ranking have been proposed. As in manual annotation there is a possibility to have some irrelevant tags with the corresponding document; automatic annotation is being taken into consideration.

Overall Architecture

The overall architecture for the proposed system is given in Fig. 1. According to this architecture a semantic index is created with the help of semantic tags in annotated web pages and the degree of correlation between these tags and the documents.

Web Host


Web Crawler



Query Analyzer

Semantic Index

Semantic Annotator

Document Manager

Ontology Toolbox



Semantic Index Manager

Rule Store

Fig. 1: Overall Architecture of the proposed System

The Web crawler collects the web documents through the web and submits it to the document manager. The proposed algorithm for annotating web pages is performed by the document manager by interacting the Rule Store and Ontology Knowledge Base (OKB). The Rule Store contains the rules proposed for annotation and ranking, described in further section.

Ontology [4] describes a particular domain in a structured form. The ontology knowledge base (OKB) is created in the form of classes/concepts, individuals/instances and various properties/relationships among them in a hierarchical manner. Ontology for a specific domain is created by an ontology toolbox as we have created it for e-shopping (electronic shopping) domain by Protégé. Fig. 2 describes the class hierarchy for e-shopping domain and its graphical representation in Owlviz in the tool Protégé.

C:\Users\krishna\Desktop\NewP\eshponto.jpg C:\Users\krishna\Desktop\NewP\ovz.jpg

Fig. 2: Ontology constructed for e-shopping domain in Protégé

The document manager stores the semantic tags in XML (Extensible Markup Language) form with the documents and the semantic document base (SDB) is created that contains semantically enhanced documents. By this document base, semantic tags for each document and the Rule Store, the semantic index manager creates the semantic index for each document. The index contains the directory of each document with semantic tags having the corresponding degree of correlation.

Automatic annotation of the Documents and Query Flow

As the user searches for a query through user interface, it is submitted to the Query Analyzer. The Query Analyzer performs query pre-processing by extracting the meaningful query keywords. These query keywords are submitted to the semantic index manager. The index manager searches for these query keywords in its index if there exists the information of the documents regarding these keywords; it returns the corresponding documents by interacting with the document manager. Otherwise, it submits the query keywords to the document manager. The document manager asks web crawler to collect the web documents related to these words. The process of annotation is performed by using the proposed algorithm and the rules. The corresponding domain ontology [4, 5] is populated [5] to satisfy the rules and the algorithm. In this way, the semantic annotation would be performed automatically.

The documents are then submitted to the ranker and after performing ranking the ranked results are sent to the user. Thus, the retrieved results are only that documents that are somehow related to the query keywords. As Ontology is being used for annotation, the actual user intention is automatically considered before sending the results.

The Approach for Annotating and Ranking the Web Documents

This paper puts forward a technique for annotating the web documents is proposed. The approach consists of two parts; one is the set of five rules based on document attributes and Ontology (explicit specification of a conceptualization [2]) and another one is an algorithm for semantic annotation and ranking that uses the rules defined in first part.

Rules for Annotation and Ranking

To annotate the web pages or the documents, a set of rules has been proposed. There are five rules in this set. In each rule correlation (CR) and the degree of correlation (DCR) is computed. The value of DCR is being taken in between 0 and 1. In any case if it goes to beyond 1 then it will be considered as 1 i.e. closely related.

Rules R1 and R2 are based on the frequency of certain keywords/instances (I) occurred in the document. In R1, a particular threshold is taken into consideration to limit the occurrence of a specific instance. The correlation (CR) and the degree of correlation (DCR) between that instance (I) and the corresponding document (D) will be computed accordingly. In R2, the attributes of that instance are also being considered.

In case of R3 and R5, Ontology [1], is considered to obtain more effective correlation between the instance and the document considering the properties of that instance.

The rule R4 is the ideal case that if an instance belongs to the title of the content in the document, that it is closely related to the corresponding document having the degree of correlation (DCR) equal to 1.

R1: if (freq(Ii) > Th) then

CR (Ii, Di) = TRUE

DCR(Ii, Di) = …………………. (1)

Where, freq(Ii) is the frequency of the Instance Ii , Th is the threshold, CR(Ii, Di) is the correlation of Ii with Di and DCR(Ii, Di) is the Degree of correlation between Ii and Di.

If (DCR > 1) then Set DCR = 1

R2: if freq (Ii + > Th) then

CR (Ii, Di) = TRUE

And [CR (Ii, Di)ƒŸR2] > [CR (Ii, Di)ƒŸR1]

DCR (Ii, Di) = …. (2)

Where, At[Ii] is the attribute of Ii m is the number of attributes of Ii presented in the document Di

If (DCR > 1) then Set DCR = 1

R3: if ((Ii Є O AND Di) AND ((Ii) Є Di) then

CR (Ii, Di) = TRUE

And [CR (Ii, Di)ƒŸR3] > [CR (Ii, Di)ƒŸR1, CR (Ii, Di)ƒŸR2]

DCR (Ii, Di) = ………………. (3)

Where, O is the domain ontology, Pj is the Property of Ii and n is the number of properties of Ii presented in the document Di

If (DCR > 1) then Set DCR = 1

R4: if (Ii Є Title (Di) then

CR (Ii, Di) = TRUE And

[CR (Ii, Di)ƒŸR3] = Highest

DCR (Ii, Di) = 1 ………………………… (4)

R5: if (Ii Є O) AND (ObjPj (Ii) Є Di)) then

CR (Ii, Di) = TRUE

And [CR (Ii, Di)ƒŸR5] > [CR (Ii, Di)ƒŸR1, CR (Ii, Di)ƒŸR2]

DCR (Ii, Di) = ...… (5)

Where, ObjPj(Ii) is the object property of Ii and n is the number of object properties of Ii presented in the document Di

If (DCR > 1) then Set DCR = 1

The Procedure for Algorithm

The web pages/documents contain the meaningful information with some extra links, advertisements etc. The unnecessary things need to be removed so that the purified web pages would be obtained. These purified web pages (PWP) are the input for the algorithm. As the output of this algorithm, the semantically enhanced web pages will be obtained having some semantic tags with the corresponding degree of correlation.


Input: Purified WebPages (PWP)

Output: Semantically enhanced Documents/WebPages

foreach PWP(i) do

find Did(i) [wi], Dt(i) [wi], Ti [wi], URLi [wi]

AT [wi] = Did(i) [wi] + Dt(i) [wi] + Ti [wi] + URLi [wi]

Store AT[wi] into XML form

Store all the text content of the webpage wi into XML form in tag <text>

Perform stemming on the text to obtain semantic keywords

SKW [wi] = [STEMƒŸ Text (wi)]

Foreach instance Ij of SKW[wi] do

Apply Rn from the set of rules

If (CR (Ij, Dj) = TRUE) then

IN = IN + Ij

Foreach instance Ik in O do

if (Ij = Ik) then do

ST = ST + Ij

End for


End for

Store the keywords of SMT in XML form as semantic tags for annotating wi

Apply Rn from the set of rules

Add the degree of correlation to wi for each semantic tag according to DCR(Ii, Di )

End for

The algorithm is defined by various steps as defined above. In step 2, the attributes of a web page (w) such as Document id (Did), title (T), time of post (Dt), and URL (Uniform Resource Locator) etc. need to be extracted and stored in XML (Extensible Markup Language) form. The text content of the corresponding web page is stored in <text> tag of XML format. Stemming is the process of removing the stop/raw words to obtain meaningful keywords such as is, are, they, what, how, why etc. Stemming (STEM) is being performed on the text content and meaningful keywords are being identified in set SKW (semantic keywords). Using this set SKW, semantic tags are being obtained by two ways: firstly, by applying the rules from the set of rules in set IN (Instances) and secondly, using domain ontology (O) in set ST (semantic tag). Finally, these two sets are combined into SMT (semantic tag) and stored in XML form for annotating the corresponding web page. The rules are also applied with the degree of correlation (DCR) for ranking purpose accordingly. If a user enters a query that is consisting any of these keywords, the web document would be automatically ranked according to DCR value associated with it and then it would be provided to the user. Thus, as the output of this algorithm, semantically rich WebPages would be obtained with the degree of correlation.


The proposed architecture provides the vision of semantic web. Ontology has been utilized to represent the domain knowledge. As in manually annotating the documents there are the possibilities to add some irrelevant tags that are generally entered by the user; such documents would have always high rank and display on the top and thus provides poor precision and recall. The proposed approach of automatic semantic annotation and ontology overcome this limitation and ensure that the annotated tags would always be semantic and thus more accurate results.

The proposed technique is the conceptual one though some parts of the system have been implemented. In future the approach would be implemented and integrated with the information retrieval system.