This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.
Clustering is a technique in data mining used to distinguish between various groups of datasets, placed inside a given set of objects. Clustering is performed on the basis of similarity or dissimilarity among those datasets. The technique is applied in a way so that such data objects, those are similar or identical to one another, are put in the same cluster, while the data objects which are different or dissimilar to each other are placed in different clusters. Faber has described Clustering as a technique to divide a set of data points or data items into non-overlapping groups or clusters. According to Faber, data points are placed in the same cluster if they are "more similar" to each other, where "more similar" refers to the closeness of data points in terms of similarity (Faber, 1994).
Frahling et al, in their paper referred clustering as a computation task to divide a given input i.e. group of objects, into subsets possessing similar traits and properties. Frahling et al. named these subsets as "Clusters" and these clusters must contain the similar objects within the same cluster while the dissimilar objects are placed in other clusters. According to this approach, clusters are used as a "coarse representation of data" (Frahling and Sohler, 2006).
Hence, the major task of clustering is simplification of huge datasets. However, clustering might cause the loss of originality of datasets.
Clustering technique is being applied in various fields of computer sciences for example in computational biology, image segmentation, machine learning, data mining and pattern recognition. Clustering applications are also useful in different areas of software architecture recovery, pattern recognition, economics, data analysis, image processing and document classification (Jain et al., 1999).
The main objective of using program comprehension and modularization activities is to divide the system components into clusters or subsystem. While reorganizing these subsystems, application of this approach causes to reduce inter-connectivity between the components of subsystem and increases intra-connectivity between the components of the similar clusters (Davey and Burd, 2000; Anquetil and Lethbridge, 1999; Mancoridis et al, 1999).
This chapter aims to discuss the following factors in clustering:
To give description of clustering and an overview of its practical applications towards program comprehension.
Applications of clustering in different domains.
Overview of two main types of clustering.
Role of k-means algorithm in various domains
We have also described two main types of Partitional clustering along with giving a description of two main types of clustering, that is, Hierarchical clustering and Partitional clustering. We have discussed application of k-means algorithm in our experiment and also presented the motives for using k-means algorithm.
This chapter is organized as follows:
Section 3.1, illustrates different techniques of clustering.
Section 3.2, presents the motivation for using k-means algorithm.
Section 3.3, describes a simple k-means clustering algorithm.
Section 3.4, presents an overview of applications of k-means algorithm in different domains.
3.1 Clustering Techniques
According to Frahling and Sohler, development of so many different clustering algorithms is still failed in presenting a standardized general clustering algorithm (Frahling and Sohler, 2006).
Clustering algorithms are classified into hierarchical and partitional clustering approach. Hierarchical clustering develops a hierarchy in the form of a tree-like structure of objects. Then the branches of tree structure are merged on the basis of higher similarity among the components. While partitional clustering creates portioning for the set of objects on a horizontal level such that each object belongs to only one group (Law, 2006).
Taxonomy of Clustering is shown in Error: Reference source not found (Jain et al, 1999).
Figure 3. 1: Taxonomy of Clustering
Hierarchical Clustering algorithms are classified as:
In agglomerative method, clustering is done in a way so that each of the clusters includes exactly one object. Then a series of operations is performed on these clusters to put the objects in the same group. Agglomerative clustering algorithms devised so far, use methods of single linkage and complete linkage. In single linkage method, distance between the two closest objects across different clusters is measured. This method is also called nearest neighbor method. The complete linkage method works in an opposite way. This method is used to determine the farthest distance between two objects in a way so that the distance between the clusters is also measured (Xu and Wunsch, 2005). Divisive clustering works in the opposite way of the agglomerative clustering (Rui Xu et al, 2005).
Partitional clustering is further divided into following four types:
The most frequently used technique in partitional clustering is square error. Square error technique works on compact clusters individually (Abonyi and Feil, 2007). The algorithm applied most commonly in square error method is k-means algorithm. Jain et al. has suggested k-means algorithm the simplest one to apply for square error method (Jain et al., 1999).
As we have used k-means algorithm for our experiment, so here we have given a description on the role of k-means clustering algorithm.
3.2 Motivation for Using k-means Algorithm
In hierarchical clustering, k-means algorithm is mostly preferred by the researchers due to the following reason:
Time complexity is an issue in hierarchical clustering. It does not scale well and its time complexity is at least O(n2), where n is the total number of instances (Wang et al , 2009).
Hierarchical clustering does not facilitate to derive the desired number of clusters which is required in some domains where the number of clusters is known.
Hierarchical clustering does not give distinct clusters until and unless we specifically define clusters with cutoffs. The selection of cutoffs (merge/split points) requires a lot of experience, as once the merge/split is decided, it cannot be reversed.
Hierarchical clustering always presents a hierarchical solution and gives a hierarchy which is not required in all cases, especially where the items belong to just one cluster.
Arai and Barakbah have presented their views about the k-means algorithm in their paper published in 2007. According to their perceptions, they have referred k-means algorithm as the most well known clustering method. They have strongly acknowledged the simplicity and popularity of k-means algorithm developed by Mac Queen in 1967. According to them, k-means algorithm is very much capable of clustering huge data in a very efficient manner and it does provide a framework for development of numerical clustering systems (Ralambondrainy, 1995).
According to Kanungo et al., "Among clustering formulations that are based on minimizing a formal objective function, perhaps the most widely used and studied is k-means clustering" (Kanungo et al, 2002). Law has mention k-means algorithm as the best known clustering algorithm (Law, 2006).
Jain and Dubes (1988) have referred k-means algorithm as the best known and the most popular algorithm because of its efficiency at clustering huge data sets. According to them, its computational complication is only affected with the number of data points (Bottou and Bengio, 1995; Pham et al, 2007).
According to Davidson and Ravi, k-means clustering algorithm is the only tool in data mining, providing the simplest and the easiest technique for clustering (Davidson and Ravi, 2005).
Cost et al. conducted an experimental study to compare various clustering methods in 2004. He made a comparison of five clustering methods used in analysis of gene expression time series. He performed this comparative study on agglomerative hierarchical clustering, CLICK, dynamical clustering, k-means and self-organizing maps. As a result of these studies, he found that SOM(self-organizing maps), k-means, and dynamical clustering possess the maximum levels of accuracies among all clustering methods used in the analysis of gene expression time courses.
Hobbs et al. performed experiment to test the performance of two popular methods in use, that is hierarchical clustering and k-means clustering. His experimental study was aimed to compare the efficiency of two methods in terms of within cluster variability of gene expression. He found k-means clustering performing better than hierarchical clustering for within cluster gene variability and also k-means method is much capable in terms of re-organization of genes into clusters-less variables if the cluster size is increased within gene expression (Law, 2006).
Wei has referred k-means algorithm as the most insightful, useful and the most popular algorithm among all the clustering algorithms (Wei, 2005).
Although k-means clustering algorithm is being used widely because of its simplicity and applicability in so many domains, this algorithm is not gone through much for its application in program comprehension.
3.3 k-means Clustering Algorithm
The steps of k-means clustering algorithm are given below:
Choose k entities among the set of entities to be clustered
Call them k centroids
For each entity not selected, assign it to its closest centroid on the basis of Euclidean distance
All assignment give initial clustering
Until "converge" do:
Recompute centroids of clusters
New centroids may not be entities of original entity set
Reassign all entities to the closest centroid
2. Updates clusters
Convergence may be defined differently depending upon the implementation, but it normally means that either the clusters remain unchanged or the changes do not make a material difference in the definition of the clusters.
3.3.1 Time Complexity of k-means Clustering Algorithm
Let tdist be the time to calculate the distance between two entities or an entity and centroid of a cluster, then the time complexity of each iteration will be:
n = number of objects
k = number of clusters
3.4 Application of k-means Algorithm in Different Areas
Application of Lloyd's k-means clustering algorithm is presented by Kanungo et al. in 2000. They have presented an efficient implementation of k-means algorithm in his paper and called it the filtering algorithm. Their paper presents practical efficiency of k-means algorithm both theoretically and empirically. A data sensitive analysis is given to exhibit application of this theorem theoretically while an experimental demonstration is presented on real data sets as well as on synthetically generated data (Kanungo et al, 2000)
Kanungo et al. in 2000 presented an efficient implementation of Lloyd's k-means clustering algorithm (very similar to the one shown in Section 3.3) in their paper and they called it the filtering algorithm. They also demonstrated the practical efficiency of this algorithm both theoretically through a data sensitive analysis and empirically through experiments on both synthetically generated and real data sets (Kanungo et al, 2000).
Davidson has described nine important points, in his paper published in 2002, to analyze application of k-means algorithm in the clustering. He has presented a detailed overview how these nine factors affect data representation and formulation of the problem in clustering and how to re-obtain the resultant clusters (Davidson and Ravi, 2002).
An application of k-means algorithm is presented by Clifton and Vaidya in their paper in 2003. They have presented a technique for k-means clustering when different sites are having various attributes for a set of entities. According to them each site keeps the information for the cluster of each entity but the information about the attributes of other sites is kept hidden from that site. They have emphasized on the goal of achieving valid results whereas nondisclosure of data is provided throughout the clustering (Vaidya and Clifton, 2003).
Kantabutra et al. have provided the concept of parallel application for k-means algorithm in his paper. According to his theory, k-means clustering can use inexpensive workstation for clustering along with the use of low price hard disks. He has verified with his experiments parallel application of k-means algorithm obtains the goal of much efficient and speedy clustering (Kantabutra et al., 2003).
Application of k-means algorithm on database is presented by Ordonez in his paper. He has implemented two applications of k-means in SQL. He has suggested k-means implementation in database to cluster huge data sets keeping linear scalability. These implementations are also applicable in huge data sets in relational DBMS. According to him, implementation of k-means algorithm has eliminated the cumbersome requirement of data to be exported or accessed from outside the DBMS (Ordonez, 2004).
A study of subspace clustering algorithm is conducted by Jing et al. They have presented a performance study for large sparse text data. The algorithm devised by them calculates the attribute weights in the k-means clustering process. These attribute weights determine the clusters from subspaces of the text vector space and also used to recognize those terms which represent the semantics of the clusters. It is shown with the help of experiments by Jing et al that this algorithm is very useful for convergence towards optimal solution and works efficiently irrespective of the numbers of documents and the number of clusters.
Martinez et al. have presented application of k-means algorithm on similarity functions. Their implementation permits to deal with objects of mixed features. In their algorithm, they have used combination of k-means algorithm and similarity function (KMSF), the similarity function is used for comparison of features and it narrates the way values for the features are compared in KMSF. Their proposed algorithm does not need transformation of features (Martínez et al., 2005).
Farhling and Sohler have proposed implementation of k-means clustering algorithm as a combination of Lloyd-step and random swaps. In their paper, they have presented a new and efficient approach for k-means clustering algorithm. This is a conceptual approach for implementation of k-means algorithm. The algorithm developed by Frahling and Sohler uses corsects to make the implementation of k-means algorithm much more efficient. A corset is a small set of points containing waits for these points, it is used to approximate the original point set with respect to the problem under discussion. The theorem proposed by Farhling and Sohler possesses the main characteristic of providing the capability to determine clustering of the same point set for different values of k. According to Farhling and Sohler, knowing a good value for k is very essential and helpful in the implementation of k-means algorithm. This theorem provides an opportunity to choose an appropriate value for k, so that clustering is carried out without any botheration for choosing a value for k (Farhling and Sohler, 2006).
Lie et al. presented their experimental approach to apply the notion of cluster center when the dataset consists of categorical object. Their algorithm was proposed in 2006 and it was developed to provide solution for clustering of categorical data. This theorem is referred to as a k-means like algorithm (Lei et al., 2006).
In 2007, Lingras proposed modifications in clustering based algorithms to make possible for these algorithms to represent clusters as rough sets. Rough set theory provides a concept of representing sets with no exact boundaries' information provided. In his paper Lingras has described these modifications on Genetic algorithm, k-means algorithm and Self-Organizing Maps (SOM). He has presented applications of these approaches for grouping highway sections, internet users and supermarket customers to exhibit the implementation of rough clustering (Lingras, 2007).
A new approach for application of k-means algorithm was presented by Faraoun and Boukelif in order in the domain of neural network. The main objective for their implementation of k-means algorithm was to increase the learning capabilities and reduce the computation levels in competitive learning multi-layered neural network using the k-means clustering algorithm. Faraoun and Boukelif implemented k-means algorithm to dataset to minimize the amount of samples to be presented to the neural network. According to their approach this was obtained automatically by selection of an optimal set of samples. From their experiments they found this technique performing very well in the terms of accuracy and time of computation. They had applied their technique to the KDD99 dataset compared to a standard learning schema that used the full dataset (Faraoun and Boukelif, 2007).
Optimization for the initial centroids for k-means algorithm was proposed by Arai and Barakbah. They presented their approach for Hierarchical k means algorithm. Hierarchical k-means approach provides the capability of k-means algorithm in terms of speed while Hierarchical algorithm is focused for exactness and accuracy. Hence their implementation provides the advantage of both techniques (Arai and Barakbah, 2007).
On the other hand, Brunzel conducted his research to exhibit the implementation of flat k-means clustering with hierarchical Bi-Secting-k-means clustering. He did not find the results that much encouraging but he found that the hierarchy obtained from this approach is supportive for the domains of human ontology engineering in the areas of semi automatic ontology (Brunzel, 2007).
Ubeyli & Dogdu conducted their research for implementation of k-means algorithm in the field of medicines. In their paper they have presented k-means clustering applicable for automated detection of diseases. They determined an optimum classification scheme for a disease in their research. They found that the classification accuracy of the k-means clustering for the disease was 94.22%. (Ubeyli & Dogdu, 2008).
A comparative study for different classification schemes is proposed by Artur and Moens. They have developed a methodology to compare the classification schemes with the clusters produced by the k-means algorithm. They showed with their experiments that their proposed comparison method can detect the difference in the separation quality of the two presented classification schemes (Artur et al, 2008).
Nesrine et al. have implemented k-means clustering using hierarchical approach to filter LIDAR data, where LIDAR (Light Detection and Ranging) is an optical remote sensing technology used to measure the properties of scattered light to find range and other information of a distant target. They found this technique far better than the prevailing techniques used for filtering LIDAR data. Hence Nesrine et al. have shown the applicability of k-means algorithm in the domain of electronics (Nesrine et al, 2008).
In 2006, Kanellopoulos and Dimopulos proposed their methodology for knowledge extraction from source code for the sake of program comprehension for an object oriented system and assess its performance. They presented a model along with associated process to determine elements in the system, whereas they defined elements as entities i.e. classes, member functions etc and attributes i.e. class names', superclasses' and methods' name and metrics. They implemented k-means clustering algorithm on the data extracted this way, to produce a system overview. However, they didn't describe the performance of k-means since the objective of their research was knowledge extraction for an object oriented system (Kanellopoulos and Dimopulos, 2006).
We have discussed clustering, applications of clustering in different domains, and different types of clustering here in this chapter. Since we have used k-means clustering algorithm for our work therefore this chapter has focused on discussion about k-means algorithm. We have also described the motives behind the use of k-means clustering algorithm and given an overview of its applications in different domains.