# Fuzzy Clustering In Web Text Mining English Language Essay

Published:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

This paper presents a overview of methods for fuzzy clustering and states desired properties for an optimal fuzzy document clustering algorithm. Clustering is useful technique in the field of textual data mining. Cluster analysis divides objects into meaningful groups based on similarity between objects. Copious material is available from the World Wide Web (WWW) in response to any user-provided query. It becomes tedious for the user to manually extract real required information from this material. Large document collections, such as those delivered by Internet search engines, are difficult and time-consuming for users to read and analyze. The detection of common and distinctive topics within a document set, together with the generation of multi-document summaries, can greatly ease the burden of information management. This paper focus on this problem of mining the useful information from the collected web documents using fuzzy clustering of the text collected from the downloaded web documents.

Keywords -- hard c -means, fuzzy c -means, possibilistic c-means , hyper-spherical fuzzy c-means, text mining

INTRODUCTION

Clustering is an unsupervised classification of objects (data instances) into different groups. In particular we are talking about the partitioning of a dataset into subsets (clusters), so that the data in each subset (ideally) share some common property. This property is usually defined as proximity according to some predefined distance measure. The goal is to divide the dataset in such a way that objects belonging to the same cluster are as similar as possible, whereas objects belonging to different clusters are as dissimilar as possible. The computational task of classifying the data set into k clusters is often referred to as k-clustering. Although estimating the actual number of clusters (k) is an important issue we leave it untouched in this work. Fuzzy clustering [2, 3] in contrast to the usual (crisp) methods does not provide hard clusters, but returns a degree of membership of each object to all the clusters. The interpretation of these degrees is then left to the user that can apply some kind of a thresholding to generate hard clusters or use these soft degrees directly. With more than two billion pages created by millions of Web page authors and organizations, the World Wide Web is a tremendously rich knowledge base. The knowledge comes not only from the content of the pages themselves, but also from the unique characteristics of the Web, such as its hyperlink structure and its diversity of content and languages. A considerably large portion of information present on the World Wide Web (WWW) today is in the form of unstructured or semi-structured text data bases. The WWW instantaneously delivers huge number of these documents in response to a user query. However, due to lack of structure, the users are at a loss to manage the information contained in these documents efficiently. The WWW continues to grow at an amazing rate as an information gateway and as a medium for conducting business. Web mining is the extraction of interesting and useful knowledge and implicit information from atrifacts or activity related to the WWW.

In this context, the importance of data/text mining and knowledge discovery is increasing in different areas like: telecommunication, credit card services, sales and marketing etc [1]. Text mining is used to gather meaningful information from text and includes tasks like Text Categorization, Text Clustering, Text Analysis and Document Summarization. Text Mining examines unstructured textual information in an attempt to discover structure and implicit meanings within the text.

One main problem in this area of research is regarding organization of document data. This can be achieved by developing nomenclature or topics to identify different documents. However, assigning topics to documents in a large collection manually can prove to be an arduous task. documents into the related topics. Clustering is the proven technique for document grouping and categorization based on the similarity between these documents [4]. Documents within one cluster have high similarity with each another, but low similarity with documents in other clusters.

## .

FUZZY CLUSTERING ALGORITHMS

In this section, we present some of the fuzzy clustering Algorithms mainly based on the descriptions in [5]. We devote the majority of space to the hard c-means, fuzzy c-means ,H-FCM and possibilistic c-means. All algorithms described here are based on objective functions, which are mathematical criteria that quantify the quality of cluster models. The goal of each clustering algorithm is the minimization of its objective function. The following syntax will be used in the equations, algorithms and their explanations:

… .objective function

X={X1 ,….,Xn } .. …dataset of all objects (data instances)

C = {C1,….,CC} ……set of cluster prototypes (centroid

vectors)

dij = ||X1-C1||…..distance between object X1 and centre C1

µij ……….weight of assignment of object xj to cluster i.

µj= (µij ,… µcj) T………membership vector of object xj

U=( µij )=( µ1,… ,µn)…partition matrix of size c x n

2.1 HARD C -MEANS (HCM)

Hard c-means is better known as k-means and in general this is not a fuzzy algorithm. However, its overall structure is the basis for all the others methods. Therefore we call it hard c-means in order to emphasize that it serves as a starting point for the fuzzy extensions.

The objective function of HCM can be written as follows:

(2.1)

As mentioned HCM is a crisp algorithm, therefore:

µij ïƒŽ{0,1} also required that each object belongs to exactly one cluster

Before outlining the algorithm, we m ut s know how to Calculate new membership weights

And based on the weights, how to derive new cluster centres.

(2.3)

The algorithm can now be stated very simply as shown below.

INPUT:

A set of learning objects to be clustered and the number of desired clusters c

OUTPUT:

Partition of learning examples into c clusters and membership values for each example Xi and cluster i.

ALGORITHM (2.1) The hard c-means algorithm:

(randomly) generate clusters centres

repeat

for each object recalculate membership weights

using equation (2.2)

recomputed the new centres using equation (2.3)

until

no change in C can be observed .

The HCM algorithm has a tendency to get stuck in a local minimum, which makes it necessary to conduct several runs of the algorithm with different initializations. Then the best result out of many clusterings can be chosen based on the objective function value.

2.2 Fuzzy C -MEANS (FCM)

In hard clustering, data is divided into distinct clusters, where each data element belongs to exactly one cluster. In fuzzy clustering, data elements can belong to more than one cluster, and associated with each element is a set of membership levels. These indicate the strength of the association between that data element and a particular cluster. Topics that characterize a given knowledge domain are somehow associated with each other. Those topics may also be related to topics of other domains. Hence, documents may contain information that is relevant to different domains to some degree. With fuzzy clustering methods documents are attributed to several clusters simultaneously and thus, useful relationships between domains may be uncovered, which would otherwise be neglected by hard clustering methods.

Probabilistic fuzzy cluster analysis [2,3] relaxes the requirement µij ïƒŽ{0,1}, which now becomes µij ïƒŽ [0,1].

However still holds. FCM optimizes the following objective function:

(2.4)

Parameter m, m>1, is called the fuzzyfier or the weighting

exponent. The actual value of m determines the 'fuzziness' of the classification. It has been shown [4] that for the case m=1, becomes identical to and thus FCM becomes identical to hard c-means.

The transformation from the hard c-means to the FCM is very straightforward; we must just change the equation for calculating memberships (2.2) with:

(2.5)

and function for recomputing clusters (2.3) with

(2.6)

Equation (2.5) clearly shows the relative character of the probabilistic membership degree. It depends not only on the distance of the object Xj to the cluster Ci , but also on the distances between this object and other clusters. Although the algorithm stays the same as in HCM , we get probabilistic outputs if we apply above changes. The (probabilistic) fuzzy c-means algorithm is known as a stable and robust classification method. Compared with the hard c-means it is quite insensitive to its initialization and it is not likely to get stuck in an undesired local minimum of its objective function in practice. Due to its simplicity and low computational demands, the probabilistic FCM is a widely used initializer for other more sophisticated clustering methods.

2.3 Possibilistic c-means (PCM)

Although often desirable, the relative property of the probabilistic membership degrees can be misleading. High values for the membership of object in more than one cluster can lead to the impression that the object is typical for the clusters, but this is not always the case. Consider, for example, the simple case of two clusters shown in figure 2.1. Object X1 has the same distance to both clusters and thus it is assigned a membership degree of about 0.5. This is plausible. However, the same degrees of membership are assigned to object X2 even though this object is further away from both clusters and should be considered less typical. Because of the normalization the sum of the memberships has to be 1. Consequently X2 receives fairly high membership degrees to both clusters. For a correct interpretation of these memberships one has to keep in mind that they are rather degrees of sharing than of typicality, since the constant weight of 1, given to an object, must be distributed over the clusters.

Figure 2.1:Example of misleading interpretation of the

FCM membership degree

Therefore PCM, besides relaxing the condition for to µij ïƒŽ{0,1} as in case of FCM, also drops the normalization requirement: The probabilistic objective function that just minimizes squared distances would be inappropriate because with dropping of the normalization constraint a trivial solution exists for µij = 0, for all iïƒŽ { 1,…c}, and j ïƒŽ{ 1,…n} , , i.e., all clusters are empty. In order to avoid this solution, penalty a term is introduced that forces the memberships away from zero.

Objective function is modified to:

(2.7)

Where ï¨i> 0 for all i ïƒŽ {1,…., c}.

In the PCM algorithm, the equation for calculating cluster centres stays the same as in FCM (2.6). But the equation for

recalculating membership degree changes from (2.5) to:

(2.8)

This also slightly changes the original procedure since we must recompute ï¨I using the equation (2.9) before calculating the weight .

(2.9)

Properties of PCM [5] are the following:

Cluster Coincidence: since PCM is not forced to partition

data exhaustively it can lead to solutions where two or more clusters occupy the same space (same objects with the same membership weighting).

Cluster Repulsion: objective function is, in general, fully optimized only if all clustered centres are identical. Because of that, other, not optimal solutions are found just as a side effect of getting stuck in a local optimum.

2.4 Other reviewed algorithm

During the review of fuzzy clustering algorithms we considered also the following algorithms. We will not precisely describe them in this paper, since we decided that they are not the best choice .An interesting reader can find their descriptions in [6] or [5].

• Gustafson-Kessel Algorithm: while FCM and PCM can only detect spherical clusters GKA can identify also clusters of different forms and sizes. It is more sensitive to initialization and has higher computational costs.

• Fuzzy Shell Clustering: can, in contrast to all the algorithms above, identify also non-convex shaped clusters. They are especially useful in the area of image recognition. We think that this property in not needed in text clustering.

• Kernel-based Fuzzy Clustering: are variants of fuzzy clustering algorithms that modify the distance function to handle non-vectorial data, such as sequences, trees or graphs, without the requirement to completely modify the algorithm itself. In text clustering we are dealing with vectors so there is no need for such an advanced method.

2.5 HYPER-SPHERICAL FUZZY C-MEANS (H-FCM)

Recently the Fuzzy c-Means (FCM) algorithm is modified for clustering text documents based on the cosine similarity coefficient rather than on the Euclidean distance. The modified algorithm works with normalized k-dimensional data vectors that lie in hyper-sphere of unit radius and hence has been named Hyper-spherical Fuzzy c-Means (H-FCM). The H-FCM algorithm for document clustering has shown that it outperforms the original FCM algorithm as well as the hard k-Means algorithm.

The objective function the H-FCM minimizes is similar to the FCM one, the difference being the replacement of the squared norm by a dissimilarity function Diα:

(2.10)

The cosine coefficient ranges in the unit interval and when data vectors are normalized to unit length it is equivalent to the inner product. The dissimilarity function Diα in equation (1) consists of a simple transformation of the cosine similarity coefficient, i.e. Diα = 1- Siα.

(2.11)

The update expression for the membership of data element xi in cluster α, denoted as uαi and shown in equation (2.11), is also similar to the original FCM expression since the calculation of Diα does not depend explicitly on uαi. However, a new update expression for the cluster centroid vα, shown in equation (2.12), had to be developed. Like the original algorithm, H-FCM runs iteratively until a local minimum of the objective function is found or the maximum number of iterations is reached.

2.5.1 Finding the optimum number of clusters

The H-FCM algorithm requires the selection of the number of clusters c. However, in most clustering applications the optimum c is not known a priori. A typical approach to find the best c is to run the clustering algorithm for a range of c values and then apply validity measures to determine which c leads to the best partition of the data set. The validity of individual clusters is usually evaluated based on their compactness and density. In low-dimensional spaces it is acceptable to assume that valid clusters are compact, dense and well separated from each other. However, text documents are typically represented as high-dimensional sparse vectors. In such problem space, the similarity between documents and cluster centroids is generally low and hence, compact clusters are not expected. Therefore, the approach mentioned above for finding the optimum c is inappropriate. A question that arises is how the H-FCM algorithm is able to discover meaningful document clusters considering such low similarity patterns. As observed for the hard k-Means algorithm, the good performance of the H-FCM is justified by the fact that documents within a given cluster are always more similar to the corresponding centroid than documents outside that cluster, regardless of the number of clusters that has been selected. It is believe that in the high-dimensional document space the issue of finding the optimum number of clusters is not so relevant. The choice of c should rather address the desired granularity level, since the higher the number of clusters the more specific will be the topics covered by the documents in those clusters.

3 CONCLUSION AND FUTURE WORK

This paper presents an overview of fuzzy clustering algorithms that could be potentially suitable for document

clustering, we have surveyed HARD C -MEANS (HCM), Fuzzy C -MEANS (FCM), Possibilistic c-means (PCM),and HYPER-SPHERICAL FUZZY C-MEANS (H-FCM) clustering algorithms. The HCM algorithm has a tendency to get stuck in a local minimum, which makes it necessary to conduct several runs of the algorithm with different initializations. In hard clustering, data is divided into distinct clusters, where each data element belongs to exactly one cluster. In fuzzy clustering, data elements can belong to more than one cluster, and associated with each element is a set of membership levels. These indicate the strength of the association between that data element and a particular cluster. PCM is not forced to partition data exhaustively it can lead to solutions where two or more clusters occupy the same space ie.same objects with the same membership weighting. The H-FCM generates clusters with a higher level of granularity and that the resulting clusters hierarchy successfully links clusters of the same topic.

There are many areas in text mining; where one may carry the work to enhance various areas. Out of these, the labeling of the clusters is a very daunting challenge of this time. No remarkable effort has been made in this regard to get good result. That is why automatic labeling of the clusters is not so much accurate. A keen and concerted work has been done to remove this hurdle. It will certainly serve as a lime length for future researchers.