An Efficient Web Content Outlier Mining Algorithm Computer Science Essay

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

Outliers, commonly referred to as exceptional cases, exist in many real-world databases. Detection of such outliers is very important for numerous applications, such as detecting criminal activities in E-commerce. The web serves as a huge, widely distributed, global information services now a day. Web Outlier Mining can be defined as to identifying the diverse patterns from the web data. Studying the extraordinary behavior of web outliers helps uncovering the valuable knowledge hidden behind them and that help on the decision makers to make improve the service and quality on the web. In this paper we mainly describes web outlier mining as the discovery and to analysis of rare and interesting patterns from the web and presents a framework for mining web outlier with introducing new method for detecting outliers and simultaneously scaling up performance evolution

Keywords: Data Mining, Web Mining, Web Outliers, N-Grams, Text Categorization, Web Contents, Content Specific Algorithm, Dissimilarity Measure


Outliers are data objects with different characteristics compared to other data objects. Formal definition of outliers is given by [5] as "An Outlier is an observation that deviates so much from other observations so that it arouses suspicion that it is generated by a different mechanism". Detection of such outliers is important for numerous applications, such as detecting criminal activities in E-commerce, video surveillance, weather prediction, intrusion detection and pharmacentical research. Exploring the diverse and dynamic web data for outliers is more interesting than finding outliers in numeric data sets. Interestingly, the existing web mining algorithms have concentrated on finding patterns that are frequent while discarding the less frequent ones that are likely to contain the outlying data. Web outliers are data objects that show significantly different characteristics than other web data. Although the presence of web outliers appears obvious. Identifying the diverse patterns from web data is known as Web Outlier Mining.

Exponential growth of the web makes it a popular and fertile place for research. The huge, diverse, dynamic and unstructured nature of the web calls for automated tools for tracking and analyzing web data and their usage patterns. This has given rise to the deployment of many sever-side and client-side intelligent systems for mining information on the web.

This paper broadly describes web outlier mining as the discovery and analysis of rare and interesting patterns from the web. The differences in information contents of web pages and servers make web outlier mining more challenging than traditional outlier mining. Unlike traditional outlier mining algorithms designed solely for numeric data sets, web outlier mining algorithms should be applicable to data of varying types including text, hypertext, video, etc. Thus, it is impossible to design single algorithm for mining web outliers. Thus, web outlier mining is categorized into three components: web content outlier mining, web structure outlier mining, and web usage outlier mining depending on the source and data types involved in the mining process

Web usage outliers are those present in web usage data. Web structure mining is the discovery of interesting patterns in the hyperlink structure of the web. A web content outlier is described as page(s) with completely different contents from similar pages within the same category

This paper is organized as follows. In sections 2 related works in the field of outlier mining are discussed. Sections 3 contain motivation and applications of web outlier mining and explore the fields of web outlier mining. In this section an efficient web content outlier mining algorithm is also proposed. Section 4 shows experimental work and results analysis. Sections 5 conclude entire paper and explain area of future work. Section 6 shows reference used in this paper.

2. Related Work

An outlier is an observation that deviates so much from other observations. Traditional outlier mining techniques include those from statistics and data mining. Studies on outlier detection are numerous and can be grouped into several general categories.

Distribution Based, which deploy some standard distribution model (e.g. normal) and flag as outliers those points that deviate from the model [11].

Depth Based is category of outlier mining in which mining is based on some definition of depth. In depth based each data object is represented as a point in a k-d space, and is assigned a depth. With respect to outlier detection, outliers are more likely to be data objects with smaller depths

Distance Based methods was purposed by Knorr and Ng [7]. A distance-based outlier in a data set D is all objects with pet% of the objects in D having a distance of more than dmin away from it.

Deviation based outlier detection does not use statistical tests or distance based measures to identify exceptional objects, Instead, it identifies outliers by examining the main characteristics of objects in a group. Objects that "deviate" from this description are considered outliers.

Density Based approach was purposed by Breuning [10]. It relies on the Local Outlier Factor (LOF) of each point, which depends on the local density of its neighbors. An effective algorithm for mining local outliers is also proposed by W Jin et al [12].

All the above specified outlier detection methods are mainly based on the statistical approaches. In the field of finding outliers in web data these approaches do not provide effective methods by which outliers are identified from web data. Major work in this field is presented by Agyemang M et. al.[1]. They provide detailed description of web content outlier mining approach.

3. Proposed Architecture

Great deals of techniques exist for classifying web documents into categories. Interestingly, almost none of the existing algorithms consider documents having 'varying contents' from the rest of the documents taken from the same domain (category) called web content outliers.

This paper proposes a web content outlier mining algorithm using N-Gram techniques which analyze the contents of the Meta data of the web pages of related category and identify the web pages which are having significantly different content as compared to other pages

The algorithm takes help of the HTML structure of the web documents. The algorithm uses data captured in Meta data field. Only description of Meta field is used. Application of algorithm on Meta data produces results which are similar when it is applies on the Body of the web document [2].

This paper explores the advantages of using N-Gram [3][6] to determine the similarity of string and expand it to include pages containing similar string. Paper establishes dissimilarity measure and uses it to determine documents having different contents from similar pages within the given category. We also take help of domain dictionary which contains words belonging to the category of interest

Proposed Framework is as shown below:

Major phases of our proposed architecture are

Extraction of Resources


Web Content Outlier Detection

Analysis of Outliers

3.1 Phase 1: Extraction of Resources

Resource extraction is the process of retrieving the desired web pages belonging to the category of interest. This can be achieved using any of the existing web search engines or web crawlers.

This phase of architecture extracts web page information from web which contains Meta data field of the web pages of same category. By using web crawler or web search, desired web page information is achieved, this is the main task of resource extraction phase.

3.2 Phase 2: Preprocessing

This phase of proposed architecture (2) transforms the data extracted from the previous phase into a structured form to be used by the outlier detection algorithm. This step shows the use of the content based filtering of the extracted data.

The first step is removing unwanted fields of the extracted data, this may include hyperlink, sound, picture etc or it may include removal of field based on content base filtering methods.

The second step of preprocessing includes removal of Stop Words. Stop Words are the words having frequency greater than some user specified frequency. Special care is taken so that important words that occur more frequently are not removed. The stop-word removal is done with the aid of a publicly available list of stop-words [8]. Using public list of stop-words is category independent and ensures important words within a category that occur more frequently are not removed. The disadvantage is that there are many different public lists of stop-words all of which may not be the same. Nevertheless, a number of the list could be compared and the appropriate one chosen.

The next step in preprocessing phase includes removal of duplicate words in contents of Meta field.

Result of preprocessing phase contains Meta field description of the web pages after removing stop and duplicate words in the description.

3.3 Phase 3: Web Content Outlier Detection

The goal of the outlier detection phase (3) is discovering rare patterns existing in the web contents. The main inputs are preprocessed data and a domain dictionary. The algorithm assigns weights to words, of the web pages based on whether words or their n-gram frequency present in the domain dictionary. The weight of words on a page are computed and compared with a user defined weight for every page in the domain. The details of step in this phase are given below:

Domain Dictionary

Domain dictionary containing the important words of the category of interest. It

should be varying efficient and chosen carefully.

Weight Assignment

In this phase the Algorithm assigns weights to the words depending upon the

presence of the word in the dictionary

The main steps in this phase are

1. Generate N-Grams for each word which does not match with dictionary contains The n-grams are generated for each word which does not match with dictionary contains. N-Grams of higher lengths are used because higher order n-grams are capable of capturing similarities between different but related words compared to n-grams of shorter length [6].

The n-gram frequency profile is computed as following:

Generate all possible n-grams having length greater than 4.Tokens that do not have the complete n-gram-length are removed.

Finally, output the n-grams generation methods is serve as input to the weighting function now.

2. Compute Dissimilarity Measures

The goal is to compute the dissimilarity for determining the differences among pages within the same category. In this phase Weights are assigned to words of each page based on presence of n-grams or whole words in the dictionary. In this step algorithm matches whole words of the web documents whether they are present in the domain dictionary or not. If yes than assign weight to the word and move to next word in the document otherwise matches the n-grams of that word in the dictionary. The advantage of using N-Grams for weight assignment is that:

N-grams are more efficient in determining similarity between different but related words in text processing.

N-grams support partial matching of strings with errors.

The next step in web content outlier detection phase includes comparison of the weight of a page with a user defined threshold weight. Outlying documents are those with weights less than the user defined threshold weight. The threshold criteria used in our algorithm is having advantage that threshold value vary according to the percentage of the number of unique words in web document and it is not affected by length of the unique word list of the web pages.

3.4 Phase 4: Analysis of Outlier

Outlier Analysis phase (4) shows the resultant Outliers Pages in the given domain. This phase also provides visualization of the outlier's pages.

By visualizing the statistics of outlier's page we can get the information about whole document set and also able to done outlier analysis as per the given requirement

3.5 Proposed Algorithm


Input: Dictionary, documents Di

Outputs: Outlier pages

Other variable: Total weight of document WDi, Threshold weight WDmin

1. Read the contents of the documents (Di) and dictionary

2. For (int i =0; i< NoOfDoc i ++) {//Beginning of the first outer loop

3. For (int j =0; j< NoOfWords j++) {//Beginning of the first inner loop

4. IF (word exists in the dictionary){//Beginning of the outer IF-ELSE

5. Increase Weight of the doc.


7. Else {

8. Generate n-grams for the word








9. For (int n =0; n< NoOfNgrams; n++) {//Beginning of the second inner loop

10. IF (N-gram exists in dictionary) {//Beginning of the inner IF-ELSE

11. Increase Weight of the doc.

12.} Else

13. Weight retained as it is

14.} End IF//Ending of inner IF-ELSE

15.} // Ending of second inner for loop

16.} End IF//Ending of outer IF-ELSE

17.} // end of first inner for loop

18. WDi =Total weight of the doc.

19. Pages with WDi < WDmin are outliers

20.}//end of outer for loop

21. End of Algorithm.

4. Experimental Work and Result Analysis

This section presents the analysis of experimental results of the n-gram-based algorithm for mining web content outlier

For experimental test framework developed in java based on the proposed algorithm take Mata tag file of Resume related web pages. This is generated by Win Web Crawler. The format of file is URL, Base, Domain, Title, Description, Keyword, Body Text, Last Modified, and Content Length. Now this file is used to analyze the content of the pages of the given category and finding outliers between them

Experimental file contains the information for 100 resume related web pages (Pages mainly from Resume related service providers)

The contents enclosed in the Meta tags field were retrieved and preprocessed. After that weights are assigned based on the presence of words in domain dictionary. And than compare the each web page weight with threshold weight.

The resultant outlier page information file is generated by the algorithm which identifies seven outlier pages which having less relevant contents. The resultant file looks like

Figure 2 Resultant Outlier Pages

The result shows number of outlier pages as a part of whole documents set

The results generated by our algorithm indicate the proposed algorithm is capable of identifying web content outliers efficiently. And results also shows that using N-Gram technique performs matching of the words vary effectively.

Results generated by algorithm also show comparison between N-gram technique and without using N-Gram technique.

The resultant graphs for comparison are

Figure 3 Comparison of in terms of Weights

Figure 4

Comparison of in terms of No of Outlier Pages

5. Conclusion

This paper provides realistic meaning of web outliers. In addition, it provides an algorithm for mining web content outliers. Our Web Content Outlier Mining Algorithm retrieves the outlier pages very efficiently. Used N- Gram technique improves the accuracy of the algorithm and also the size of domain dictionary is controlled. The selection of Domain Dictionary and Threshold value should be very efficient for yielding efficient algorithmic result.

Area of future research includes the experimental evaluation of full word match algorithms and n-gram-based algorithms in terms of response time. Future scope also includes different weighting methods by which efficiency and performance of the algorithm can be improved. The future work may also lode into the development of the algorithm without using Domain dictionary.

Finally, benchmark data needs to be established for evaluating the performance of web outlier mining algorithms.