This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.
Abstract- Introduction about Data mining and Web mining. To defining the Approaches of the Web mining. Data mining techniques applied in web domain. To explain about Scope of Data mining. Web mining can be categorized based on which part of the web to mine. Web usage mining itself can be classified further depending on the kind of usage data considered. To use data mining on our web sites.
Data mining is the nontrivial process of identifying valid novel, potentially useful, and ultimately understandable patterns in data - Fayyad. The most commonly used techniques in data mining are artificial neural networks, decision trees, genetic algorithm, nearest neighbour method, and rule induction.
The web contains collection of pages that includes countless hyperlinks and huge volumes of access and usage information. Because of the ever-increasing amount of information in cyberspace, knowledge discovery and web mining are becoming critical for successfully conducting business in the cyber world. Web mining is the discovery and analysis of useful information from the web. Web mining is the use of data mining techniques to automatically discover and extract information from web documents and services (content, structure, and usage)
Approaches of web mining
Two different approaches were taken in initially defining web mining: i. Process centric View - Web mining as a sequence of tasks ii. Data centric view - web mining as a web data that was being used in the mining process. The important data mining techniques applied in the web domain include Association Rule, Sequential pattern discovery, clustering, path analysis, classification and outlier discovery.
1. Association Rule Mining: Predict the association and correlation among set of items "where the presence of one set of items in a transaction implies (with a certain degree of confidence) the presence of other items. That is, 1) discovers the correlations between pages that are most often referenced together in a single server session/user session.
2) Provide the information: i. What are the set of pages frequently accessed together by web users? ii. What page will be fetched next? iii. What are paths frequently accessed by web users?.
3) Associations and correlations: i. Page association from usage data - user sessions, user transactions ii. Page associations from content data - similarity based on content analysis iii. page associations based on structure - link connectivity between pages.
A) Guide for web site restructuring - by adding links that interconnect pages often viewed together.
B) Improve the system performance by prefetching web data.
Sequential pattern discovery: Applied to web access server transaction logs. The purpose is to discover sequential patterns that indicate user visit patterns over a certain period. That is, the order in which URLs tend to be accessed. Advantage: a) useful user trends can be discovered b) predictions concerning visit pattern can be made c) to improve website navigation d) personalize advertisements e) dynamically reorganize link structure and adopt web site contents to individual client requirements or to provide clients with automatic recommendations that best suit customer profiles..
Clustering: Group together items (users, pages, etc.,) that have similar characteristics. a) Page clusters: groups of pages that seem to be conceptually related according to users' perception. B) User Cluster: groups or users that seem to be behave similarly when navigating through a web site.
Classification: maps a data item into one of several predetermined classes. Example: describing each user's category using profiles. Classification algorithms are decision tree, naÃ¯ve Bayesian classifier, neural networks.
Path Analysis: A technique that involves the generation of some form of graph that "represents relation[s] defined on web pages. This can be the physical layout of a web site in which the web pages are nodes and links between these pages are directed edges. Most graphs are involved in determining frequent traversal patterns/ more frequently visited paths in a web site. Example: What paths do users traversal before they go to a particular URL?
III.SCOPE OF THE DATA MINING
The scope of data mining is i. Automated prediction of trends, and behaviors ii. Automated discovery of previously unknown patterns.
Web mining is searches for i. Web access patterns, ii. Web structure, iii. regularity and dynamics of web contents. The web mining research is a converging research area from several research communities, such as database, information retrieval, and AI research communities, especially from machine learning and natural language processing. World wide web is a popular and interactive medium to gather information today. The WWW provides every Internet citizen with access to an abundance of information. Users encounter some problems when interacting with the web.
Finding relevant information (information overload - Only a small portion of the web pages contain truly relevant/useful information):
low precision (the abundance problem - 99% of information of no interest to 99% of people) - which is due to the irrelevance of many of the search results. This results in a difficulty of finding the relevant information.
Low recall (limited coverage of the web-Internet sources hidden behind search interface) - due to the inability to index all the information available on the web. This results in a difficulty of finding the unindexed information that is relevant.
Discovery of existing but "hidden knowledge (retrieve 1/3rd of the "index able
Personalization of the information (type & presentation of information) -
Limited customization to individual users.
Learning about customers/individual users.
Lack of feedback on human activities.
Lack of multidimensional analysis and data mining support.
The web constitutes a highly dynamic information source. Not only does the web continue to grow rapidly, the information I holds also receives constant updates. News, stock market, service centre, and corporate sites revise their web pages regularly. Linkage information and access records also undergo frequent updates.
The web serves a broad spectrum of user communities. The Internet's rapidly expanding user community connects millions of workstations, and usage purposes. Many lack good knowledge of the information network's structure, are unaware of a particular search's heavy cost, frequently get lost within the web's ocean of information and lengthy waits required to retrieve search results.
Web page complexity far exceeds the complexity of any traditional text
Document collection. Although the web functions as a huge digital library, the
pages themselves lack a uniform structure and contain far more authoring style and content variations than any set of books or traditional text-based documents. Moreover, searching it is extremely difficult.
IV.WEB MINING TASK
web mining tasks are: i. Mining web search engine data ii. Analyzing the web's link structures iii) classifying web document automatically iv) mining web page semantic structure and page contents v) mining web dynamics vi) personalization.
Thus, web mining refers to the overall process of discovering potentially useful and previously unknown information or knowledge from the web data. Web mining aims at finding and extracting relevant information that is hidden in web-related data, in particular in text documents that are published on the web like data mining is a multi-disciplinary effort that draws technique from fields like information retrieval, statistics, machine learning, natural language processing and others. Web mining can be a promising tool to address ineffective search engines that produce incomplete indexing, retrieval of irrelevant information unverified reliability or retrieved information. It is essential to have a system that helps the user find relevant and reliable information easily and quickly on the web. Web mining discovers information from mounds of data on the www, but it also monitors and predicts user visit patterns. This gives designers more reliable information in structuring and designing a web site.
Given the rate of growth of the web, scalability of search engines is a key issue, as the amount of hardware and network resources needed is large, and expensive. In addition, search engines are popular tools, so they have heavy constraints on query answer time. So, the efficient use of resources can improve both scalability and answer time. One tool to achieve these goal is web mining.
V.web mining categorization
Web mining can be categorized into three areas of interest based on which part of the web to mine (Web mining research lines):
Web content mining - discovery of useful information from the web contents/data/documents (or) is the application of data mining techniques to content published on the Internet. The web contains many kinds and types of data. Basically, the web content consists of several types of data such as plain text (unstructured), image, audio, video, meta data as well as HTML (semi Structured), or XML (structured documents), dynamic documents, multimedia documents. Recent research on mining multi types of data is termed multimedia data mining. Thus we could consider multimedia data mining as an instance of web content mining. The research around applying data mining techniques to unstructured text is termed knowledge discovery in texts/ text data mining/ text mining. Hence we could consider text mining as an instance as an instance of web content mining. Research issues addressed in text mining are: topic discovery, extracting association patterns, clustering of web documents and classification of web pages.
Issues in Web content Mining:
developing intelligent tools for information retrieval
finding keywords and key phases
discovering grammatical rules collections
extracting key phrases from text documents
learning extraction rules
Web content mining approaches: Agent based and Data base approaches
Agent based approaches: Involves AI systems that can "act autonomously or semi autonomously on behalf of a particular user, to discover and organize web based information". Agent Based approaches focus on intelligent and autonomous web mining tools based on agent technology. i. Some intelligent web agents can use a user profile to search for relevant information, then organize and interpret the discovered information. example: Harvest. ii) Some use various information retrieval techniques and the characteristics of open hypertext documents to organize and filter retrieved information. Example: Hypursuit. iii) Learn user preferences and use those preferences to discover information sources for those particular user. Example: Xpert Rule Miner.
Data base approach: focuses on "integrating and organizing the heterogeneous and semi-structured data on the web into more structured and high level collections of resources". These organized resources can then be accessed and analysed. These "metadata or generalization are then organized into structured collections and can be analysed.
2.Web Structure Mining: operates on the web's hyperlink structure. This graph structure can provide information about page ranking or authoritativeness and enhance search results through filtering i.e., tries to discover the model underlying the link structures of the web. This model is used to analyse the similarity and relationship between different web sites. Uses the hyperlink structure of the web as an additional information source. This type of mining can be further divided into 2 kinds based on the kind of structural data used. a) Hyperlinks: A hyperlink is a structural unit that connects a web page to different location, either within the same web page (intra_document hyperlink) or to a different web page (inter_document) hyperlink. b) Document structure: In addition, the content within a web page can also be organized in a tree structured format, based on various HTML and XML tags within the page. Mining efforts here have focused on automatically extracting document object model (DOM) structures out of documents.
Web link analysis used for:
ordering documents matching a user query (ranking)
deciding what pages to add to a collection
finding related pages
finding duplicated web sites
and also to find out similarity between them
Web Usage Mining: Web usage mining is the application of data mining techniques to discover interesting usage patterns from web data, in order to understand and better serve the needs of web-based applications. It tries to make sense of the data generated by the web surfer's sessions/behaviors. While the web content and structure mining utilize the primary data on the web, web usage mining mines the secondary data derived from the interactions of the users while interacting with the web. The web usage data includes the data from web server logs, proxy server logs, browser logs, and user profiles. (The usage data can also be split into 3 different kinds on the basis of the source of its collection: on the server side (there is an aggregate picture of the usage of a service by all users), the client side (while on the client side there is complete picture of usage of all services by a particular client), and the proxy side (with the proxy side being some where in the middle). Registration data, user sessions, cookies, user queries, mouse clicks, and any other data as the results of interactions. Web usage mining analyzes results of user interactions with a web server, including web logs, click streams, and database transactions at a web site of a group of related sites. Web usage mining also known as web log mining. Web usage mining process can be regarded as a three-phase process consisting:
Preprocessing/ data preparation - web log data are preprocessed in order to clean the data - removes log entries that are not needed for the mining process, data integration, identify users, sessions, and so on
pattern discovery - statistical methods as well as data mining methods (path analysis, Association rule, Sequential patterns, cluster and classification rules) are applied in order to detect interesting patterns.
and pattern analysis phase - discovered patterns are analyzed here using OLAP tools, knowledge query management mechanism and Intelligent agent to filter out the uninteresting rules/patterns.
Many companies wanting an on-line presence believe that all they have to do is build a web
site and sit back and reap the benefits. In most cases this has been a fruitless exercise and
companies will be unable to improve the situation without first gaining a basic understanding
of the visitors to their web site.Web mining puts e-tailers in the unprecedented position of being able to understand and predict the behaviour of their customers. Companies can now optimise their e-business sites for maximum commercial impact and personalise the on-line content of their web site.
It is those companies who adopt a web mining strategy NOW to learn about their customers
who will gain the competitive edge in the new 'digital economy'.