This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.
In this paper, authors propose a Web mining approachs for the Semantic Web. The approach uses a search engine and the traditional web as a source of information to produce semantically rich information. This paper investigates the problem of extracting knowledge from large number of web documents in order to develop ontologies. This research introduces web usage patterns as a novel source of semantics in ontology learning. The proposed methodology combines web content mining with web usage mining in the knowledge extraction process. Therefore, both the web user's and web author's perspectives are captured with respect to the web content, which ultimately leads to extraction of more realistic set of conceptual relationships. The evaluation results prove the effectiveness of the proposed methodology. This solution is intended to be usable for transformation of large web corp uses to semantic web and also it could be used to develop cross domain ontologies.
Though the Web is rich with information, gathering and making sense of this data is difficult because the document of the Web is largely unorganized. The biggest challenge in the next several decades is how to effectively and efficiently dig out a machine-understandable and queriable information and knowledge layer, called Semantic Web, from unorganized, human-readable Web data.
Semantic web is the Sir Tim Berners-Lee vision of web. The semantic Web is an extension of the current Web 2.0. The Semantic Web is a web that is able to describe things in a way that computers can understand. If computer can understand the meaning behind the information they can learn what we are interested and help us better find what we want. In semantic web information is given well-defined meaning, and with changing Web contents into machine understandable form, would promote quality and intelligence of the Web.
Since the semantic Web mainly focuses on the data and information. Data in the Semantic Web is well defined and linked in a way that can be used for more effective discovery, automation. The nature of most data on the Web is so unstructured that they can only be understood by humans, but the amount of data is so huge that they can only be processed efficiently by machines. Ontology is the backbone of semantic web. The ontology is a formal representation of a collection of concepts and their relationships. This formal representation allows the web documents to be more understandable to machines as well as the humans. Ontology Vocabulary used to describe the various types of resources and the relationship between resources.
Web pages need not be directly linked with each other. Therefore, when broadly considered same web content has two perspectives, web author's perspective and web user's perspective. These hidden knowledge and the interests have to be utilized by the semantic web. Web documents could be categorized in to three categories based on their structure namely, un-structured, semi-structured and fully-structured. Semantic web plays an important vital role in fully structured web pages.
Another emerging area is web mining is the application of data mining techniques to extract knowledge from Web data. Web mining is seen as a helpful tool in the process of transforming human understandable content in to machine understandable semantics. Web mining is divided into three types Web content mining Web structure mining Web usage mining.
Web Content Mining is the process of extracting useful information from the contents of Web documents. It examines content of the web pages as well as results of web searching. Content data corresponds to the collection of facts a Web page was designed to convey to the users.
Web structure mining is mostly interested in the hyperlinks of the web pages. Web Structure Mining can be is the process of discovering structure information from the Web. Web structure mining is used to improve the structure of the web pages. Depending upon the hyperlink, categorizing the Web pages and the related Information @inter domain level.
Web usage mining is the process of extracting useful information from server logs i.e. user's history and web user behavior. Web usage mining performs mining on web usage data or web logs. The logs can be examined by client perspective or server perspective.
These paper attempts the combination web content mining and web usage mining. To automate the process of semi structure web documents. Currently the combination of web mining approaches used in the semantic web personalization and search engine optimization. In the future it is expected that, a combination of these web mining approaches will be used for extracting semantics from web resources.
The related work focused on the following techniques
Natural language processing techniques
Fully-structured web documents
Semi-structured web documents
In natural language processing techniques the un-structured web documents free text, mostly uses the statistical approaches and simple text mining in the ontology learning process.
In the fully-structured web documents takes in to account the structure of the web documents. The work attempts have been benefited by the standardized syntax like XML. Less effort has to be put in to extract semantics from the fully structured web documents compared to other types of web documents because of this standardization.
Semi-structured web documents are extract the plain text from the semi structured web documents in the pre-processing stage and then simple text mining techniques are applied on the extracted free text. By extracting the plain text from the semi-structured text valuable information that is stored inside the HTML markers are lost without utilizing properly in the conceptual relationship extraction process.
The semi structured nature of the web pages in ontology development process have become specific to special type of web sites such as template driven web sites. as template driven web sites .the syntactic structure infer semantic relationships that concepts have the same syntactic structure should be semantically related however, these approaches will give valid results for temple driven web sites and applicability of them in general is questionable.
The work which the definition of an ontology-based IR model, oriented to the exploitation of domain Knowledge Bases to support semantic search capabilities in large document repos tories.
The proposed work combines two web mining techniques web content mining and web usage mining. The techniques are used to the process of extracting conceptual relationships. The extracted concepts and the conceptual relationships give rise to a semantic network and form of ontology.
There are three main stages in this work are
Concept and conceptual relationship extraction through web content mining
Conceptual relationships identification through web usage mining
Refining/Merging the conceptual relationships obtained through the web content and web usage pattern information, to derive the final conceptual relationships or the web of conceptual relationships.
Weighted Frequency is used when extracting concepts from web content. For each word in the web content the weighted frequency is calculated. Weighted frequency of a word depends on the frequency of occurrence. After identifying the weights of each word sequence in the web content, their conceptual meaning (sense) needs to be identified. In order to identify the sense from the web content need to analyze the sentence structure in detail. Part Of Speech (POS) tagging, a Natural Language Processing (NLP) technique is used to obtain well formed terms, where each word is annotated with its corresponding grammatical category to identify the sentence structure.
Conceptual Relationships Extraction :
To identify the structure of a web document the concept of content section is used. The concept of content section is based on how web author create the content, which is in the form of heading. The web authors the web author designs the web documents in such a way that the information stored down in the hierarchy is made more specific to produce the hierarchical tree structure. If concepts are frequently occurring together, it hints an existence of conceptual relationships.
Apriori is the widely used algorithm in extracting these kinds of relationships. The algorithm uses two measures, support and confidence. Support is a measurement of usefulness and confidence is a measurement certainty. To identify that two concepts are related with each other. This is a new measure to indicate how closely two concepts are related to each other and is named as the relationship strength measure. Relationship value is obtained by combining the three measures, support, confidence and relationship strength. The generated logical tree structure is traversed and values for the three measures, support, confidence and the relationship strength are obtained. Using these measures, the relationship value is calculated for each conceptual relationship. Unwanted conceptual relationships are filtered out using the externally specified threshold values.
3. Mining the web usage patterns to extract conceptual relationships:
By using web usage mining, the user navigation patterns could be identified. The users navigate along the web site according to their aspiration and according to how they relate the content in the web pages. The web pages to be extracted according to user perspective. If the web site has a rapidly changing nature where the content is changed frequently, identifying user navigation patterns becomes difficult and may not reveal adequate information. Therefore, web usage mining information is used only in the conceptual relationships refining stage to give suggestions to the ontology developer.
A log entry is automatically added to the log file by the web server it is important to understand entry of each user. The log files in general contain the IP address of the requester, the user name of the user who generated the request (If applicable), the date and the time of the request, the method of the request (GET or POST), the name of the file requested, the results of the requests, the size of the data sent back.there are two main limitations in identifying user sessions based on the users IP address i.e. 1) One user could have several IP addresses even in the same session and 2) Several users could have the same IP address due to the effect of network address translation. Cookies can be used for better session identification.
These extracted user sessions are used to generate usage clusters. Usage clusters are used to establish user groups with similar navigation patterns. K-means clustering algorithm was used to partition the user sessions in to set of clusters based on the Euclidean distance function. Therefore, each cluster represents group of web transactions that are similar based on the co-occurrence patterns of the URLs. The web pages that are frequently accessed together suggest that the concepts that reside inside those web pages to be related with each other .The threshold value that was used in the web content mining is lowered by 25%. This is called the negative border. The ontology developer could refine the extracted concepts and the conceptual relationships in the refinement stage.
This paper shows the combination of the two fast-developing research areas Semantic Web and Web Mining. The authors discussed how Semantic Web Mining can improve the results of Web Mining by exploiting the new semantic structures in the Web .This paper is the concept of using both web authors's and web user's perspectives in the ontology learning process. The best solution for the growing semantic web. This work could be extended to be used with large collection of documents which will considerably reduce the cost in terms of time and money in developing semantic web related applications. These research findings could also be used for search engine optimization which will make web crawlers' task more effective and could be used to get a higher rank in web search. Furthermore, many types of sites can profit from reorganization as semantic web sites and these types of web sites could be benefited by these research findings.