This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.
In data mining, Clustering is considered as an important technique to differentiate various groups of datasets, classes or clusters within a set of objects. The data objects those are similar to each other are grouped together to be placed in the same cluster, while the objects different to one another are placed in different groups or clusters. Faber has defined clustering as "the method to divide a given set of data points into distinct and non-overlapping groups or clusters, in a way that the data points placed in the same cluster are more similar to each other than the data points placed in other clusters. In clustering technique, "more similar" points, refer to the data points closer or more identical to one another by certain measures of proximity" (Faber, 1994).
According to Frahling et al.; "A computational method of dividing a given input or a group of objects, into subsets of the same or identical characteristics is referred as Clustering. The subsets obtained through this division process are called Clusters, and each cluster is comprised of the similar objects, while the objects those are dissimilar are placed in other clusters. That is how clusters can be obtained from datasets, and can be used as a coarse representation of data. Using clustering technique, simplification of huge data is achieved quite easily, but on the other hand; accuracy of original datasets might be affected" (Frahling and Sohler, 2006).
Clustering techniques are applied in various domains of computer science's application, for example, computational biology, image segmentation, machine learning, data mining and pattern recognition (Frahling and Sohler, 2006). Clustering is also applicable in software architecture recovery, pattern recognition, economic science (e.g. market research), spatial data analysis, image processing and document classification (Jain et al., 1999).
This chapter provides a description for application of clustering to resolve problems in program comprehension. Software comprehension and modularization techniques are used to split system components or modules into "clusters" (subsystems) to obtain a minimized inter-connectivity (connections between the components of two different clusters) with a maximized intra-connectivity (connections between the components of the identical cluster) (Davey and Burd, 2000; Anquetil and Lethbridge, 1999; Mancoridis et al, 1999). This chapter describes two major types of clustering, that is, Hierarchical clustering and Partitional clustering. Further, types of Partitional clustering are also brought under discussion, along with a detailed description of clustering application in different areas of computer science. Also, in this chapter we have discussed applications of k-means algorithm in different fields, since we have used k-means algorithm in our experiments, so we have also discussed significance and role of this algorithm.
This chapter is divided into following sections:
Section 3.1, illustrates different techniques of clustering.
Section 3.2, presents the motivation for using k-means algorithm.
Section 3.3, describes a simple k-means clustering algorithm.
Section 3.4, presents an overview about the applications of k-means algorithm in different domains.
3.1 Clustering Techniques
According to Frahling and Sohler "There are various clustering algorithms proposed by the researchers over the years, still there is no general clustering algorithm so far" (Frahling and Sohler, 2006).
Clustering algorithms can be classified into Hierarchical and Partitional. In Hierarchical clustering, a tree of objects is created, where the branches of tree structure are merged on the basis of higher similarity among the components. Partitional clustering creates partitioning for the set of objects on an even level such that each object belongs to only one group (Law, 2006).
Taxonomy of Clustering is shown in Error: Reference source not found (Jain et al, 1999).
Figure 3. 1: Taxonomy of Clustering
Hierarchical clustering algorithms are classified into agglomerative and divisive methods. In Agglomerative method, clustering is done in a way that each of the cluster includes exactly one object. Then a series of operations is performed on these clusters to put the objects in the same group. On the other hand, divisive clustering works the opposite way (Rui Xu et al, 2005).
Single linkage and complete linkage methods are the most popular agglomerative clustering algorithms being used so far. In single linkage method, distance between the two closest objects across different clusters is measured. This method is also called nearest neighbor method. The complete linkage method works in an opposite way. This method is used to determine the farthest distance between two objects in a way so that the distance between the clusters (inter-cluster distance) is also measured (Xu and Wunsch, 2005).
Partitional clustering is divided into four further types naming, Square error, Graph theoretic, Mixture resolving and Mode seeking. The most commonly used technique in partitional clustering techniques is the square error, which works on isolated clusters (Abonyi and Feil, 2007). The algorithm applied most commonly in square error method is k-means algorithm (Jain et al., 1999). Here we have presented a description on the roles and implementations of k-means clustering algorithm, since this is the algorithm we have used in our experiments.
3.2 Motivation for Using k-means Algorithm
The k-means algorithm is mostly preferred by the researchers because of the following issues.
Time complexity in Hierarchical clustering is O (n2), where n is total number of objects (Wang et al, 2009). Because of time complexity issue, it's difficult to get well scaled clustering using Hierarchical clustering technique.
In certain cases, when number of desired clusters is known, we cannot obtain our required number of clusters using hierarchical clustering.
In hierarchical clustering, we cannot obtain distinct clusters without specifying clusters with cutoffs. Whereas, cutoffs' selection (merge/split points) needs lots of technical expertise, since a selection made once, is not possible to reverse.
We get solution in the form of hierarchy while using Hierarchical clustering, and that is not required always, specially in case if instances belong to one cluster only.
Arai and Barakbah have given reference of work by Ralambondray, published in 1995. In their paper, published in 2007, they have referred to k-means algorithm as the most popular method for clustering. According to them "The k-means algorithm (developed by Mac Queen in 1967) is the most commonly used method in clustering because of its simplicity. Since k-means algorithm is capable of catering huge amount of data in an efficient manner, it makes most researchers prefer k-means algorithm for clustering. For numerical or conceptual clustering, k-means algorithm provides many ways of possibilities in terms of distance as well as prototype choice" (Ralambondrainy, 1995).
Law has referred k-means algorithm as the best clustering algorithm (Law, 2006). According to Kanungo et al., senior members of IEEE, " k-means algorithm is the most commonly used method among all the clustering techniques formulated on the basis of minimization of objective functions" (Kanungo et al, 2002)
Jain and Dubes in 1988 have presented their views about k-means algorithm by referring to it as the most popular and the best known algorithm. According to them, the computational complexity of algorithm increases linearly, just with the number of data objects. So the algorithm works efficiently for clustering huge data sets (Bottou and Bengio, 1995; Pham et al, 2007).
According to a description given by Davidson and Ravi, k-means algorithm is the most widely used clustering technique in domain of data mining because of its simplicity (Davidson and Ravi, 2005).
According to a comparative case study performed by Costa et al., in 2004, maximum accuracies in the results of experiments were obtained by using k-means, dynamical clustering and SOM. They had conducted their experiments on five different clustering techniques in order to perform a data driven comparative study on various clustering methods. They used the methods for clustering from the literature of gene expression analysis, and performed their experiments on agglomerative hierarchical clustering, CLICK, dynamical clustering, k-means and self-organizing maps. So k-means algorithm was among the methods which showed higher rates of accuracies during these experiments.
To determine the efficiency of clustering techniques, Hobbs et al. performed a comparison among the two most popular clustering methods, Hierarchical and k-means algorithm. They experimented to find out the performance of gene expression with respect to variability within clusters. The outcome to their comparative studies showed that k-means method of clustering performs better within cluster in terms of gene variability, also k-means method is capable for re-organizing genes into cluster-less variable with respect to gene expression in case of increasing cluster size (Law, 2006).
Wei has referred k-means clustering as the most insightful, powerful, efficient and famous algorithm among all the clustering algorithms developed so far (Wei, 2005).
Although k-means algorithm is the most widely used technique in different fields because of its vast applicability, it has not been studied or experimented much for the domain of program comprehension.
3.3 k-means Clustering Algorithm
These are the steps to apply k-means clustering algorithm
Select k entities from the set of entities to be clustered
Name them k centroids
Assign each entity, which is not selected, to its closest centroid on the basis of Euclidean distance.
This assignment to entities gives initial outcome for clustering
Until "converge" do:
Compute centroids of clusters once again
Now the values obtained for new centroids may not be the same as that of the original entity set
Reassign all entities to the closest centroid
2. Update clusters
Although definition of convergence is based on the basis of implementation of experiment, here, Convergence means that the clusters remain unchanged, or if the changes occur, it does not make considerable difference to the clusters' definition.
3.3.1 Time Complexity of k-means Clustering Algorithm
If t distance is the time to calculate the distance between two entities or an entity and centroid of a cluster, then the time complexity for each iteration is:
n = number of objects
k = number of clusters
3.4 Application of k-means Algorithm in Different Areas
Kanungo et al. presented their application of Lloyd's k-means algorithm in 2000, and named it the filtering algorithm. They not only implemented this algorithm theoretically but also practically through experiments. They conducted a data sensitive analysis on real data sets as well as on synthetically generated data, with the help of experiments. Their implementation of k-means algorithm showed its efficiency on both the grounds (Kanungo et al, 2000).
Davidson has proposed nine points to analyze the implementation of k-means algorithm for clustering; he also gave an in-depth view to describe effectiveness of these behaviors of k-means algorithm in data representation, formulation of problem and their interpretation to obtain the resultant clusters (Davidson and Ravi, 2002).
Implementation of k-means clustering method is presented by Clifton and Vaidya in their paper published in 2003. They have demonstrated how different sites, possessing various attributes for a common set of entities, maintain non-disclosure of data, when each site learns the cluster of each entity, knowing nothing about the entities of other sites. That's how it provides the capability of non-disclosure of data (Vaidya and Clifton, 2003).
Parallel implementation of k-means algorithm is presented by Kantabutra in his paper, offering a parallel solution for cluster of relatively cheap workstations and for low price hard disks. He showed that relatively much faster speed can be obtained with the application of this parallel algorithm, and k-means proved to be most efficient algorithm in this implementation than any other algorithm (Kantabutra et al., 2003).
Ordonez proposed k-means implementation in relational DBMS. He presented two applications of k-means algorithm in SQL. According to him, k-means is capable of clustering huge data sets and linear scalability. Implementations presented by Ordonez performed clustering for huge datasets in a relational database management system without any requirement for data access or data export outside the database system (Ordonez,2004).
To perform clustering for large sparse text data, Jing et al. introduced implementation for a new subspace clustering algorithm to calculate the feature weights in the k-means clustering technique, automatically. Clusters are discovered with the help of these feature weights, from subspaces of vector space of the text. The feature weights also identify items that represent the semantics of clusters. They showed from their experiments that this algorithm converges towards an optimal solution; also it is capable of scaling to various numbers of documents, items and the number of clusters (Jing et al, 2005).
Another conceptual implementation of k-means algorithm was presented by Martinez et al. with an idea of using similarity functions. The similarity functions are able to deal with objects on the basis of mixed features concept. In their proposed algorithm, k-means algorithm is used with similarity functions, and they named it KMSF. In this algorithm (KMSF), a similarity function is used to define comparison functions for features to present the method for evaluation of the values for feature comparison in order to obtain the ultimate task of problem solving. Transformation of the features is not required in this algorithm (Martínez et al., 2005).
Frahling and Sohler presented a combination of Lloyd-steps and random swaps, in order to implement k-means algorithm more efficiently. In their paper, they proposed the use of corsets to make the algorithm work quickly, where a corset is defined as a set of small weights of points to approximate the actual set of points in terms of the particular problems under discussion. The most effective and powerful part of their algorithm is speeding up clustering process of a single point set with respect to different values of k. Their idea lies behind the logic of choosing the most appropriate values for k, since the value for k is unknown, when clustering is carried out for various values of k, it becomes easy to select a proper value, hence clustering done independent of k leads to reliable measures (Frahling and Sohler, 2006).
In order to apply clustering for categorical data, Lei et al. proposed a k-means-like algorithm in 2006. They provided the idea of "cluster centers" and also presented the ways to apply these "cluster centers" on categorical dataset of objects (Lei et al., 2006).
Lingras proposed some modifications for clustering techniques, in 2007. He suggested these modifications for Genetic Algorithms, k-means algorithm, and Kohonen Self-Organizing Maps (SOM). He presented the idea of using clusters as rough sets, where a rough set is a set of objects with no information about its boundary. Lingras has demonstrated practical applicability of his proposed approaches for the scenarios of grouping for web users, highway section and for the customers at supermarkets. He presented the useful impacts of using rough clustering in an effective manner to support his proposed concepts (Lingras, 2007).
In order to reduce the heavy computations in domain of multi-layered neural network, and to increase the learning capabilities, Faraoun and Boukelif presented a technique in their paper, using k-means algorithm. They adopted the methodology of automatic selection of optimal set of samples by applying k-means algorithm on the required datasets. This technique was used to minimize the number of samples used in the neural network. When they applied this technique to the KDD99 dataset, their experiments showed higher accuracy rates along with less computation time, in comparison of the existing learning schema that was using full dataset (Faraoun & Boukelif, 2007).
Arai and Barakbah presented Hierarchical approach to implement k-means algorithm, naming it Hierarchical k-means algorithm. In their paper, they proposed concept for optimization of initial centroids used in k-means algorithm. Hierarchical k-means algorithm provides speed efficiency of k-means algorithm, as well as accuracy of hierarchical algorithm, since it combines both the concepts (Arai and Barakbah, 2007).
Brunzel conducted a comparative performance study in his paper, using flat k-means method for clustering versus hierarchical Bi-Secting-k-means clustering. He showed that clustering performed using hierarchical Bi-Secting-k-means method provides a hierarchy, but does not work as efficiently as the k-means algorithm. The advantage of obtaining hierarchy may be considerable for automatic ontology learning, by using hierarchical Bi-Secting-k-means clustering (Brunzel, 2007).
Ubeyli and Dogdu demonstrated k-means implementation to define an automated system for detection of diseases. They presented the concept of optimum classification for this purpose. Accuracy rate for this classification was 94.22% while using k-means algorithm for clustering (Ubeyli & Dogdu, 2008).
For the comparison of various classification schemes, Artur and Moens proposed their methodology in their paper published in 2008. They proposed comparison of clusters obtained from k-means algorithm with classification schemes. They showed with the help of experiments that the quality difference of two classification schemes can be determined by using their proposed method (Artur et al, 2008).
To implement k-means clustering technique on LIDAR data, Nesrine et al. presented a hierarchal way to filter data, where LIDAR (Light Detection and Ranging) is a sensing technology based on optical remote technique to determine the properties of scattered light, used to find its range or information about a distant target. They found k-means clustering better methodology than any other filtering methods used for such experiments (Nesrine et al, 2008).
In 2006, Kanellopoulos and Dimopulos described comprehension methods for an object oriented system, in order to gather information from source code. They proposed a model along with processes associated, in order to extract system elements, where elements are entities i.e. (classes, data members, methods etc), attributes are class name's, parent-class and method's name, and metrics. Since their purpose was extraction of information from the source code, so they extracted data and produced a system overview on the basis of information gathered, they didn't make any conclusion on the performance of k-means clustering though (Kanellopoulos and Dimopulos, 2006).
We have discussed Clustering in this chapter, along with its applications in various fields. Also, we have discussed different types of clustering here. The main focus of this chapter was k-means clustering algorithm, since we have used this algorithm for our work, that's why we have highlighted this theorem much. We have also discussed the motivations and causes for using k-means algorithm for our work; meanwhile we have presented a detailed overview for the applications of this algorithm in various domains.