This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.
In internet, Information retrieval is not very efficient as most of the web contents are not understandable by computers.GA enabled ontology for platform free dynamic semantic web aims at enhancing the web content with semantic structure in order to make it more meaningful. The analysis of the extracted token is carried out similar to Bootstrapping Ontology using TF/IDF for word ranking and Genetic Algorithm for finding the promise able region of identifying relevant words and fuzzy logic is used for ontology mapping. Using fuzzy logic, page with maximum optimization or best fit is ranked and grouped in index order, based on the priority background knowledge to the reference word and other possible links are displayed in the adjacent frame of the splitter window so that the user can toggle between the splitter windows. Thus an ordinary web page is converted into semantic web in the space specified for the user dynamically and similarly as and when the user clicks on the other references these grouped contents are also converted into semantic structure dynamically.
Index Terms-Ontology, Semantic web, Information retrieval, Bootstrapping Ontologies, Term frequency, inverse document frequency, Genetic Algorithm.
The semantic web provides meta data for the web content with relevant information carried out by computers. Relevant information has to be extracted and mapped accordingly. Information Extraction techniques aim at retrieving unique pieces of information from relevant documents instead of showing the user long lists of document links. Content of the Invisible Web cannot be retrieved in the result pages from general Web search engines as their contents are made up of thousands of searchable databases that can be retrieved using Web and results are delivered to the user in web pages within the space allocated to that particular user - dynamic pages. Such pages are often not stored anywhere so it is easier and cheaper to dynamically generate the answer page for each query rather than storing all the possible pages containing all the possible answers to all the possible queries. Information retrieval systems on the internet are not much efficient as most of the web contents are not understandable by computers since their forms are designed to be presented to humans. Automatic semantic web for information retrieval systems aims at enhancing the web content with semantic structure in order to make it meaningful for both systems as well as humans. A static page is converted into dynamic semantic page. The concept of boot strapping ontology is used for evaluating the key words by integrating two methods, namely Term Frequency/Inverse Document Frequency (TF/IDF) and web context generation and makes use of the service free text descriptor to validate the concepts. The genetic algorithm based ontology mapping is used to optimize ontology alignments . Most of the current forms of Web content are designed to be presented to humans. The Semantic Web project provides support for adding semantic annotations (meta-data that describes their content) to web pages or multimedia objects. To express the meta-data there is a need for standardized vocabularies and constructs explicitly and formally defined in domain ontology's .
The performance of current search engines and IR systems suffers because of the ambiguity of the natural language: words in documents and queries have multiple meanings and the retrieval results often include the wrong meanings in addition to the desired meanings. Better results will be achieved if web pages contain precise semantic annotations. Thus enabling the search agents to navigate, collect, and utilize information on the Web in trust worthy ways. The semantic-based categorization is performed offline at the universal description discovery and integration (UDDI) .An efficient matching of the enhanced service request with the retrieved service descriptions is achieved utilizing Latent Semantic Indexing (LSI) . Ontology provides a common vocabulary to support the sharing and reuse of knowledge. Ontology mapping model contains several aspects like concept similarity computing and semantic heterogeneity is the most important part were heterogeneity is a big bottleneck of ontology application and ontology mapping is considered as the base for integration of heterogeneous ontology .
The Semantic Web Ontology learning technology introduced is to reduce the overall time of ontology construction .Figure 1 shows the basic flow of ontology learning process Ontology learning (OL) is defined as an approach of building Ontology from knowledge sources using a set of machine learning techniques and knowledge acquisition methods. A unique data source cannot cover all concepts of a target domain of knowledge. Since web is a rich textual source, the Web can be considered as a learning corpus from which domain ontologies are extracted to be used in semantic search systems. Our main objective is to make the semantic search engine more flexible and autonomous to construct domain ontologies from relevant documents in an incremental manner, by combining ontology learning from text and semantic search technology. Initially methods and techniques that allow reducing the effort necessary for the knowledge acquisition process, and second one is in the sense of building ontologies require much time and resources are developed. There are different Ontology learning approaches that focus on the type of input used for learning. Whenever text is supplied for the search, the search result depends on how well text has been interpreted by the machine. There may be varieties of text phrases for similar types of meanings. These are the ontologies which plays the important role in making the search approachable to the content which matches with the semantics of the phrase. Grammar used in the text phrase plays an important role in the process of the ontology learning .
Figure 1 Ontology learning process
The Semantic Web ontology is a fuzzy rough concept known as imprecise concept partial ordering set , namely, a binary pair composed of a fuzzy rough concept set and a kind of partial ordering relation. Ontology Model and Semantic Link Network (SLN) are two important semantic data models in Semantic Web . Semantic web ontology and rules are used for user personalization .Using Ontology Learning and Grammatical Rule Inference Technique Semantic Web Mining can be done. Ontology alignment achieves adequate interoperability between people or computer by using ontologies overlapping to represent common knowledge by finding the semantic relations between various ontologies .Ontology matching is a process for selection of a good alignment across entities of two or more ontologies. It can be viewed as a two-phase process were applying a similarity measure to find the correspondence of each pair of entities from two ontologies is preliminary and extraction of an optimal or near optimal mapping is secondary.
Figure 2 Detailed view of the architecture of an IR system.
The information retrieval system architecture is shown in figure 2, Text Operations are used to pre-process the documents collections and to extract indexed words. The inverted index from words to document pointers is constructed using the index module, based on the inverted index the given query searching module retrieves documents that contain given query words. The user interface manages interaction with the user and query operations transform the query in order to improve retrieval. Text operations are applied to the documents text based on the description of the user information needed in order to transform them in a simplified form needed for computation. The documents are indexed and used to execute the search. Based on the retrieved ranked documents, the user provides feedback which can be used to refine the query and restart the search for improved results.
Latent Semantic Indexing (LSI) is an extension of the vector space retrieval method which can retrieve relevant documents even when they do not share any words with the query . Keywords are replaced by concepts, so that only a synonym of the keyword is present in a document and the document will be still found relevant. The idea behind LSI is to transform the matrix of documents by terms in a more concentrated matrix by reducing the dimension of the vector space. The number of dimensions becomes much lower as there is no longer a dimension for each term and only a dimension for each latent concept or group of synonyms. The advantage of LSI is its strong formal framework that can be applied for text collections in any language and the capacity to retrieve many relevant documents.
The F-measure is used to combine the precision and the recall by taking their harmonic mean using Eq. The F-measure is high when both precision and recall are high. Precision (P) measures the ability to retrieve top-ranked documents that are mostly relevant. Recall (R) measures the ability of the search to find all of the relevant items in the corpus.
R is the Number of retrieved documents that are relevant to the Total number of relevant documents
P is Number of relevant documents retrieved to the Total number of documents retrieved
A generalization of the F-measure is the E-measure, which allows emphasis on precision over recall or vice-versa. The novelty ratio is the proportion of documents retrieved and judged relevant by the user and of which the user was previously unaware; it measures the ability to find new information on a topic. The convergence ratio is the proportion of relevant items retrieved out of the total relevant documents known to a user prior to the search. The amount of work required from the user in query formulation, conducting the search and output screening is taken as user effort and measured. The response time is the time interval between the reception of a user query and the presentation of system responses. The form of presentation is the influence of the search output format on the user's ability to utilize the retrieved materials. The collection coverage measures the extent to which any/all relevant items are included in the document collection. TF/IDF is a common mechanism in IR for generating a robust set of representative keywords from a corpus of documents. The concept evocation identifies the descriptors which appear in both the TF/IDF method and the web context method, the possible concept names that could be utilized by the ontology evolution is identified using these descriptors. The context descriptors are also used in the convergence process of the Relations between concepts. A genetic algorithm based approach to find out how to aggregate different similarity metrics into a single measure starting from an initial set of population where each one represents a specific combination of measures and an algorithm finds the combination that provides the best alignment quality. The major steps involved in the process of Genetic Algorithm are application description, objective function calculation, problem representation, problem initialization, calculating the fitness function and perform the genetic operations.
Bootstrapping an ontology based address is the problem of multiple, largely unrelated concept. The advantage is that the web services usually consist of both WSDL and free text descriptors which is evaluated using Term Frequency/Inverse Document Frequency (TF/IDF) and web context generation. Bootstrapping process integrates the results of both methods and uses service free text descriptor to validate the concepts. The ontology bootstrapping process is based on analyzing a web service using the Term Frequency/Inverse Document Frequency (TF/IDF) method to analyze the web service from an internal point of view, which is what concept in the text best describes the content of the WSDL document, the Web Context Extraction method is used to describes the WSDL document from an external point of view which is the most common concept representing the answers to the web search queries based on the WSDL content and the Free Text Description Verification method which is used to resolve inconsistencies with the current ontology. The token extraction process extracts tokens representing relevant information from a WSDL document and extracts all the labels name, parses the tokens, and performs preliminary filtering, then parallel analysis of the extracted WSDL tokens using TF/IDF analyzes were the most common terms appearing in each web service document and appearing less frequently in other documents are measured. Sets of tokens as a query to a search engine is used by web context extraction that clusters the results according to textual descriptors and classifies which set of descriptors identifies the context of the web service. The concept evocation identifies the descriptors that appear in both the TF/IDF method and the web context method where possible concept names that could be utilized by the ontology evolution are identified by these descriptors. The context descriptors also assist in the convergence process of the relations between concepts. Finally, the ontology evolution expands the ontology as required according to the newly identified concepts and modifies the relations between them.TF/IDF, is defined for freq (ti:Di) as the number of occurrences of the token within the document descriptor . The term frequency each token is defined as stated in Eq.
The inverse document frequency is the ratio between the total number of documents and the number of documents that contains the term, given by the Eq.
Idf () =log
D is defined as a specific text descriptor. The TF/IDF weight of a token, annotated as w (), is given by Eq.
While the common implementation of TF/IDF gives equal weights to the term frequency and inverse document frequency (i.e., =)* idf, we chose to give higher weight to the idf value. The reason behind the modification is to normalize the inherent bias of the tf measure in short documents. The ranking is used to filter the tokens according to a threshold that filters out words with a frequency count higher than the second standard deviation from the average weight of token w value.
The implementation of GA in application of the ontology matching problem incorporates three basic steps so as to formulate the algorithm for the specific application. First step is the problem representation then second is the presentation of individual, that is the encoding mechanism of problem, and the third step is formulation of the fitness function that gives each individual a measure of performance. The problem of ontology matching is represented as an optimization problem of a mapping between two ontologies where every ontology has its associated feature sets. Genetic algorithm is employed for the ontology matching problem. For a given optimization problem a certain mapping as optimizing object for genetic algorithm, an objective function or fitness function is defined as a global similarity measure function between two ontologies based on feature sets, then set of experiments are conducted to analysis and evaluate the performance of genetic algorithm in solving ontology matching problem. Here the input to the system is feature sets O1 F , O2 F of two compared ontologies 1 O and 2 O respectively, parameters of genetic algorithm (size of population pop size, crossover rate pc, mutation rate pm, and maximal generation Max Gen),1 n and 2 n are the number of concepts in 1 O and 2 O respectively. The expected output of the system is the best mapping optimized by genetic algorithm between the compared ontologies. The steps involved in genetic algorithm operations are as follows: Generate randomly an initial population of pop size individuals, these individual are an one-dimensional integer array with 1 n elements taking value from 1 to 2 n ,which corresponds to a mapping between concepts from 1 O to 2 O . While certain termination criterion is not met, such as the number of generations is less than a given value of Max Gen, or fitness of the best individual is not close to the optimal 1 then crossover is carried out crossover individuals according to a crossover probability pc to form new individuals. Mutation operation is carried out to mutate individuals formed after crossover operation with a mutation probability pm. Every individual is evaluated and fitness value computed, according to their fitness and with a selection probability ps, some individuals are selected to generate next new population. The best individual in current population will be reserved into the next population. Elitist strategy is employed, Elitism is a mechanism were best individual are safeguarded without any change .
Semantic based web service discovery involves semantic-based service categorization and semantic enhancement of the service request. The objective is to find solution for achieving functional level service categorization based on an ontology framework. The web services is classified using clustering based on service functionality and the semantic-based categorization is performed offline at the universal description discovery and integration (UDDI) that involves semantics augmented classification of web services into functional categories , Service selection is carried out as parameters-based service refinement and the semantic matching based on similarity. The web service input and output parameters have the functional knowledge which was extracted for improving service discovery. Parameter-based service refinement is used to exploit a combination of service descriptions, input and output to narrow the set of appropriate services that match the service request, by combining semantics and syntactic characteristic of a WSDL document. The refined set of web services is matched with an enhanced service request as part of Semantic Similarity-based Matching and the service request is enhanced by adding ontology concepts that are relevant, to improves the matching of the service request.
Semantic Web-based Ontology Mapping Model uses the conception of ontology mapping and proposes an semantic web-based concept similarity method . The basic idea of the method is that the two factor of concept semantic and concept instance is used as a whole to calculate the similarity between the concepts. The semantic calculation employed by collaborative filtering technology in personalized recommendation which has undergone concept similarity calculation. These methods are combined to a mapping algorithm.
The Semantic Web ontology is a fuzzy rough concept partial ordering set  namely, a binary pair composed of a fuzzy rough concept set and a kind of partial ordering relation. It has been represented as two types of practical forms were one is the fuzzy rough concept list and the other is the fuzzy rough concept lattice. The imprecise data is given a clear linguistic based data and thus enabling fuzziness. Based on the elements of imprecise semantic web ontology, a model in set theory form of the imprecise Semantic Web ontology is set up. Based on the concepts and partial ordering relations two-dimensional table and its corresponding lattice graph is drawn.
Information retrieval has several pre-processing steps needed to prepare the document collection for the IR task. The first step is to filter out unwanted characters and mark-up. Then the text needs to be broken into tokens (keywords) by using as delimiters white space and punctuation characters. The keywords can either be used as such or transformed into a base form, like nouns in the singular form, verbs in the infinitive form. A common approach is to stem the tokens to stem forms, Stemming the terms before building the inverted index has the advantage that it reduces the size of the index, and allows for retrieval of web pages with various inflected forms of a word, Stemming is easier to do than computing base forms, because stemmers remove suffixes, without needing a full dictionary of words in a language. A popular and fast stemmer is Porter's stemmer. Another useful pre-processing step is to remove very frequent words that appear in most of the documents and do not bear any meaningful content called as stop words. Important phrases composed of two or more words could also be detected to be used as keywords (possibly using a domain specific dictionary, or using a statistical method for analysing the text collection in order to detect sequences of words that appear together very often). The inverted index is built for the text that stores a list of documents for each key word that contain it and allow for fast access during the retrieval step. IR on the internet briefs all the major components of the Information retrieval system .The main components of a search engine are the Web crawler which has the task of collecting web pages. The Information Retrieval system has the task of retrieving text documents that answers a user query. Practical considerations include information about existing information retrieval systems and a detailed example of a large scale search engine, including the page ranking. The part of the Web that is not indexed by search engines called invisible web is briefed. In addition it provides a clear knowledge of the other types of information retrieval systems distributed information retrieval systems and provides a clear discussion of the Semantic Web and future trends in visualizing search results and inputting queries in natural language.
There are several pre-processing steps needed to prepare the document collection for the IR task. The first step is to filter out unwanted characters and mark-up. Then the text needs to be broken into tokens (keywords) by using as delimiters white space and punctuation characters. The keywords can either be used as such or else it can be transformed into a base form as stated earlier, one of the common approach is to stem the tokens to stem forms, the terms before building the inverted index has the advantage that it reduces the size of the index and allows for retrieval of web pages with various inflected forms of a word. Stemming is easier to do than computing base forms because stemmers remove suffixes, without needing a full dictionary of words in a language. A popular and fast stemmer is Porter's stemmer. Another useful pre-processing step is to remove very frequent words that appear in most of the documents and do not bear any meaningful content called as stop words. Important phrases composed of two or more words could also be detected to be used as keywords (possibly using a domain specific dictionary, or using a statistical method for analysing the text collection in order to detect sequences of words that appear together very often). The inverted index is built for the text that stores a list of documents for each key word that contain it and allow for fast access during the retrieval step.
Ga ENABLED ONTOLOGY FOR SEMANTIC WEB
Figure 3 Architecture of semantic web conversion using GA enabled ontology.
Figure 3 shows the architecture of semantic web conversion using GA enabled ontology, Initial input provide to the system is the URL of the web page whose contents has to be made semantic. The content of the URL corresponding to the web page has to be presented as the input to the JSP page. The web content to the specified URL is downloaded and HTML DOM parser is used the servlet java code to extract the text content from the URL link which saved as a text document. A stop list file that consists of verbs, pronouns and commonly used terms is used as a supporting file which is used to compare with the downloaded text content and remove the stop words. After removing stop words from the main file that is the downloaded file the remaining words are saved in a key word file. The extracted key words which is stored in the keyword text file is verified using the dictionary API and the meaningful word is shortlisted in order to calculate the term frequency and inverse document frequency. TF/IDF, is defined for freq (ti;Di) as the number of occurrences of the token ti within the document descriptor Di. The term frequency each token is given by Eq as discussed earlier, The inverse document frequency is the ratio between the total number of documents and the number of documents that contain the term is given by Eq as discussed earlier,the TF/IDF weight of a token, annotated as,) is calculated using Eq as discussed earlier.
While the common implementation of TF/IDF gives equal weights to the term frequency and inverse document frequency (i.e., =)* idf) we chose to give higher weight to the idf value. The reason behind the modification is to normalize the inherent bias of the tf measure in short documents. The ranking is used to filter the tokens according to a threshold that filters out words with a frequency count higher than the second standard deviation from the average weight of token w value . To perform the excel operation a jar file named jxl.jar is used so that the term frequency and inverse document frequency for each keyword is saved in an excel file were the first column holds the key words while the second and third holds the term frequency and inverse document frequency respectively. The word with the highest weight calculated based on the term frequency is considered to be the subject of the document, Based on the frequency measure the predicate, subject and object is predicted for which the domain knowledge from which the actual content was extracted is used.
A genetic algorithm based approach to find similarity metrics into a single measure is presented. From an initial population the individuals are chosen where each one represents a specific combination of measures, and finds the combination with the best alignment quality using the algorithm.
The steps involved in the process of Genetic Algorithm are as follows:
DESCRIPTION OF APPLICAION
The problem of ontology matching is represented as an optimization problem where the word with maximum similarity with the subject word has to be found and sorted.
The words with maximum similarity while compared to the subject word has to be found, so the objective function is represented as maximum type and calculated using Eq.
X=subject word of current document
y=total no of words
For Each value of n x=new value x.
Max (word, subject word)...equation
The comparison of keyword with the subject word has to be carried out under the condition that the word already compared shouldn't be compared for that particular document containing the key words and the search should be within the total number of key words.
The problem is represented in binary form, initially a random population with 10 chromosomes, whose binary bit is of length 10 is generated as shown bellow
First Population: [Chromosome: 0010111101, Chromosome: 1000001010, Chromosome: 0110011000, Chromosome: 0100000000, Chromosome: 1010111001, Chromosome: 1100001011, Chromosome: 1000000011, Chromosome: 1010100001, Chromosome: 1101111100, Chromosome: 0001000101]
Each chromosomes binary representation is converted to is corresponding value as shown in table 1.
Table 1 Initial population generated by Random numbers and their equivalent binary numbers.
The unsigned integer is used in the java code so that all the decimal values are converted into positive integer.
Fitness function is calculated for the words in the excel sheet corresponding to the positive decimal equivalent value of the initially generated chromosomes. If the random no exceeds the length of the list then the check and divide technique is applied as shown in the following code and fitness is calculated.
for (Object s:population)
int g = Integer.parseInt (((Chromosome)s).getBitString(),2);
if (g>totalnumberofkeywords )
technicalAPI ta=new technicalAPI();
val1=ta.max (keyword, subjectword);
FINDING BEST VALUE
The values of max comparison are sorted and the maximum value is considered to be the best fit value. For the java URL example whose subject is predicted as "java" from the
Calculated maximum frequency value is mapped with the keywords as shown in table 2.
Table 2 Sorted list of population after Fitness calculation.
7. GENETIC OPERATIONS
The binary equivalent of the maximum similarity words are selected example ,the chromosomes shown in table 3 are taken into consideration.
Table 3 Chromosomes with max value selected for genetic operations
7.2 CROSS OVER
Single point crossover is carried out between the two chromosomes, for the above selected chromosomes the crossover will be carried out as follow
0100000000 01000 | 00000
0010111101 00101 | 11101
NEWCHROMOSOME AFTER CROSSOVER
For which the fitness is calculated as shown table 4.
Table 4 Fitness calculation for new chromosomes after cross over.
For the new population the fitness is calculated again and the process is iterated for the specified generations.
Mutation is carried out when values are repeated or a continuous zero is obtained for more than a set of new population generated after cross over.
For example the value of 100011101 as shown in
table 5,whose similarity value with subject word
Table 5 Selected sample chromosome for mutation.
so the chromosome is mutated as follows
10001110| 1 After mutation 100011100 whose decimal equivalent is 284 and the word corresponding to the decimal equivalent index value is compared with the subject word.
As discussed above the ontology matching is represented as an optimization problem for which GA operation is carried out that includes the description of application, objective function, problem representation, problem initialization, fitness calculation, genetic operations, selection, cross over, and mutation for ontology mapping.
The Floating point values that obtained after the specified number of generations are considered as input for the fuzzy if then rules.
The words of best fit value is given as the input for spiders, the crawled links are mapped to the corresponding words and the actual content of the original web page is presented to the user as a semantic web in one part of the sever space web page, on the other splitter window the other possible meaningful words earlier calculated by GA is displayed in semantic form while the other splitter window holds the information of web page which were traversed earlier.
Figures and Tables
The similarity measure was carried out for different URl, if relation exists between each pair under test that consists of a word and subject word then it returns the true value in TF/IDF and boot strapping else it returns zero as shown in figure 5.The GA enabled ontology returns a floating point value with the help of technical java API that calculates the relationship between words and it was extended to TF/IDF and bootstrapping ontology ,were all three resulted in floating point values ranging between 0 to 1 as shown in figure 11.The figure 6,figure 7 shows the word frequency calculated using TF/IDF and bootstrapping while figure 8 shows the comparison of TF/IDF and bootstrapping.
Figure 5 Boolean based TF/IDF,Bootstraping and GA based mapping.
Figure 6 TF/IDF frequency
Figure 7 Bootstrapping ontology frequency
Figure 8 Comparison of TF/IDF and bootstrapping.
The figure 9 shows the comparison of relation calculated for boolean TF/IDF,bootstrapping and GA ontology.It is noted that GA ontology returns value for all possible combination even with lower similarity and TF/IDF returns false based on its boolean poplicy that even if the word is of is 0.99 it will return false and will return true one and only if the relation is true,i.e., 1.
Figure 9 Relation comparison for values less than 0.99
Figure 10 shows the relation between TF/IDF and bootstrapping whos relations were calculated bassed on floating point similrity measure.
Figure 10 Floating point based calculation for TF/IDF and Bootstrapping.
Figure 11 Comparison GA based ontology with TF/IDF and bootstraping ontology based on floating point values.
Figure 11 shows the similarity comparison between TF/IDF,bootstrapping and GA enabled ontology,which was calculated based on tenchnical API that returns afloating point values for all possible relations.
The result shows that the number of words in GA enabled ontology when compared to TF/IDF and bootstrapping is less. It is also noted that the maximum best value of GA enabled ontology is higher than the average of TF/IDF and bootstrapping ontology.
In this paper ,we present a word net dependent TF/IDF that helps in finding the closer subject related to the text content of the web document which was extracted from the specific URL and a GA enabled ontology that identifies the similarity between words and results in best fit values which are floating point value within the range 0 to 1.The values of higher range is chosen and provided as input to the web crawler, based on which the meta data to the selected word is provided and the non semantic web content is converted into semantic page within the user space.
It will be interesting in the future to take the web services into consideration and build GA enabled ontology for tokens extracted from wsdl files and further creating an OWL file dynamically with the help of java for the resulting relations without using tools like protégée were entries are manual.