# K Means Clustering Algorithm Cultural Studies Essay

Published:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

Abstract: Clustering is a process of organizing the objects into groups based on its attributes, either keywords or content. Clustering techniques for image retrieval are presented to group similar images into clusters and to compute the cluster centers, so that during retrieval, the query image need not be compared exhaustively with all the images in the database. This paper is aimed to analyse and compare the strengths and weaknesses between hierarchical clustering algorithm and k-means clustering algorithm besides introducing both algorithms. Analysis is made by comparing the strengths and weaknesses of both algorithms.

## 1. Introduction

Image retrieval is the process of browsing, searching and retrieving images from a large database of digital images (.S.V.S. Murthy, E.Vamsidhar, J.N.V.R. Swarup Kumar, P.Sankara Rao, 2010). Image collections are growing at an exponential rate and becoming more diverse. Thus, solutions are needed to manage vast databases of images are hence highly sought after. Retrieving images from such huge collections is a challenging task.

Most of the image retrieval systems present today is done through a query-by text approach, exploiting manual annotation of the stored data. However, there are several drawbacks in this approach. First, manual annotation is very time consuming. Second, it might lead to a rather subjective result, as rich features present in an image cannot be described by keywords completely. Third, the result of the query depends highly on the quality of the annotations as visual content that has not been transcribed into the meta- data can therefore not be retrieved afterwards. Fourth, The result of a query can be manipulated by the type and number of tags associated with the visual data [2].

Some methods for image search result clustering were proposed in literatures. In traditional Content-Based Image Retrieval (CBIR), image clustering techniques are used to extract feature vectors to group perceptually similar images together. One of the main tasks for CBIR systems is similarity comparison, extracting feature signatures of every image based on its pixel values and defining rules for comparing images. These features become the image representation for measuring similarity with other images in the database. Images are compared by calculating the difference of its feature components to other image descriptors [3].

Clustering is a process of organizing the objects into groups based on its attributes. A cluster is therefore a collection of objects which are "similar" between them and are "dissimilar" to the objects belonging to other clusters. An image can be grouped based on keyword or its content [4]. Clustering techniques for image retrieval are presented to group similar images into clusters and to compute the cluster centers, so that during retrieval, the query image need not be compared exhaustively with all the images in the database. To retrieve similar images for a given query, the query image is first compared with all the cluster centers, then a subset of clusters that have the largest similarity to the query image is chosen and all the images in these clusters are compared with the query image[5]. The aim of this paper is to analyse and compare the strengths and weaknesses between hierarchical clustering algorithm and k-means clustering algorithm besides introducing both algorithms.

## 2. Hierarchical Clustering Algorithm

Hierarchical clustering algorithms[6] are divided into agglomerative and divisive methods. Agglomerative, which refers to bottom-up clustering, begins with treating each individual sample as an individual cluster. Clusters are merged with their most similar neighbours using a form of similarity, and this process is repeated until a pre-defined number of clusters remain. These clusters then form the top layer of the generated tree. On the other hand, divisive, or known as top-down clustering, begins with all samples starting as a single large cluster which is then iteratively split into smaller clusters until a termination criterion is met.

The similarity between two images is measured by calculating the similarity between the histograms of the corresponding rectangular regions. Then a single measure of similarity between the two images is calculated by adding the individual similarities between the corresponding regions. We have used the histogram intersection measure to compare the individual histograms (probabilities). Given two normalized histograms, p= {p1,p2, ..., pm}, q={q1,q2, ..., qm}, the similarity measure is defined by

Let n be the number of images in the database, the similarity between all pairs of images is precomputed. The hierarchical clustering is performed as follows:

1. The n images in the database are placed in n distinct clusters. These clusters are indexed by {C1, C2,..., Cn}. For the kth cluster, the set Ek contains all the images contained in that cluster and Nk denote the number of images in the cluster.

Ek={k} and Nk =1 for k=1,2,...,n

2. Two clusters Ck and Cl are picked such that the similarity measure Sk,l is the largest. (The similarity measure between two clusters is defined in the following subsection). These two clusters are merged into a new cluster Cn+1. This reduces the total number of unmerged clusters by one. En+1 is updated to En+1={EkUEl} and Nn+1 is updated to Nn+1=Nk+Nl. Out of the two children of Cn+1, one is referred to as the right child RCn+1=k and the other as the left child LCn+1=l. The similarity measures between the new cluster Cn+1 and all the other unmerged clusters are computed as discussed below.

3. Steps 2 and 3 are repeated until the number of clusters has reduced a required number or the largest similarity measure between clusters has dropped to some lower threshold.

Figure 1 shows a simple example of hierarchical clustering with 8 images. The clustering is stopped after reaching two clusters. For this example, N14=5, N12=3, E14={1,2,3,4,5} and E12={6,7,8}. Also, RC14=5 and LC14=13. Each cluster has anassociated tree. The clustering algorithm presented here does not directly depend on the nature of similarity measure.

Figure 1 : A sample of cluster merging process with hierarchical clustering

The similarity measure between two images is defined in the previous section. The measure of similarity, Sk,l, between two clusters, Ck and Cl, is defined in terms of the similarity measures of the images that are contained in those clusters as follows:

(1)

where si,j is the similarity measure between images i and j, Ek is the set of images present in the cluster Ck and Nk is the number of images in cluster Ck. Pn is the number of pairs of images in a cluster with n images:

In other words, Sk,l is defined to be the average similarity between all pairs of images that will be present in the cluster obtained by merging Ck and Cl. Since the similarity between clusters is defined in terms of the similarity measures between the images in the clusters, there is no need to compute the cluster centers every time two clusters are merged. When two clusters, Ck and Cl, are merged to form a new cluster Cm, then it is necessary to compute the similarity of this cluster with all other clusters as given in Eq. (1). This computation is cumbersome as shown below. For any cluster Ct, the similarity measure between Cm and Ct is:

(2)

Sm,m is set equal to Sk,l.

A simple recursive method to achieve the same can be obtained using the fact that the first term in Eq. (2) is equal to , the second term is equal to , and the third term is equal to. Thus the similarity Sm,t can be obtained from, Sl,k, Sl,t, Sk,t, St,t, Sl,l, and Sk,k. The following equation requires far less computation compared to the one above.

(3)

In Eq. (2), Sm,t is computed by summing up the similarity measures of all pairs of images in Cm and Ct, and hence the computation grows as the square of number of images present in the two clusters. The computation in Eq. (3) is independent of the number of images in the clusters. At the beginning of clustering, for all the clusters Si,j is set equal to si,j and Si,i is set equal to zero[5].

## 3. K-means algorithm

K-means algorithm is the simplest and most commonly used algorithm that employ a squared error criterion. This algorithm partitions the data into K clusters (C1,C2, .... ,CK), represented by their centers or means. The center of each cluster is calculated as the mean of all the instances belonging to that cluster. Figure 2 presents the pseudo-code of the K-means algorithm. The algorithm starts with an initial set of cluster centers, selected at random or according to some heuristic procedure. In each iteration, each instance is assigned to its nearest cluster center according to the Euclidean distance between the two. Then the cluster centers are re-calculated. The center of each cluster is calculated as the mean of all the instances belonging to that cluster:

where Nk is the number of instances belonging to cluster k and Âµk is the mean of the cluster k. A number of convergence conditions are possible. For example, the search may stop when the partitioning error is not reduced by the relocation of the centers. This indicates that the present partition is locally optimal. Other stopping criteria can be used also such as exceeding a pre-defined number of iterations.

Input: S (instance set), K (number of cluster)

Output: clusters

1: Initialize K cluster centers.

2: while termination condition is not satisfied do

3: Assign instances to the closest cluster center.

4: Update cluster centers based on the assignment.

5: end while

Figure 2: K-means Algorithm

The K-means algorithm may be viewed as a gradient-decent procedure, which begins with an initial set of K cluster-centers and iteratively updates it so as to decrease the error function[7].

In this section, Lloyd's k-means algorithm, also known as the filtering algorithm[8], is introduced. The algorithm is based on storing the multidimensional data points in a kd-tree. The basic elements of this data structure is summarized.

Define a box to be an axis-aligned hyper rectangle. The bounding box of a point set is the smallest box containing all the points. A kd-tree is a binary tree, which represents a hierarchical subdivision of the point set's bounding box using axis aligned splitting hyperplanes. Each node of the kd-tree is associated with a closed box, called a cell. The root's cell is the bounding box of the point set. If the cell contains at most one point (or, more generally, fewer than some small constant), then it is declared to be a leaf. Otherwise, it is split into two hyperrectangles by an axis-orthogonal hyperplane. The points of the cell are then partitioned to one side or the other of this hyperplane. (Points lying on the hyperplane can be placed on either side.) The resulting subcells are the children of the original cell, thus leading to a binary tree structure. There are a number of ways to select the splitting hyperplane. One simple way is to split orthogonally to the longest side of the cell through the median coordinate of the associated points [7]. Given n points, this produces a tree with O(n) nodes and O (log n) depth. First, compute a kd-tree for the given data points. For each internal node u in the tree, compute the number of associated data points u:count and weighted centroid u:wgtCent, which is defined to be the vector sum of all the associated points. The actual centroid is just u:wgtCent/u:count. It is easy to modify the kd-tree construction to compute this additional information in the same space and time bounds given above. The initial centers can be chosen by any method desired. A common method is to sample the centers at random from the data points. Recall that, for each stage of Lloyd's algorithm, for each of the k centers, the centroid of the set of data points for which this center is closest needs to be computed. This center is then moved to the computed centroid and proceed to the next stage. For each node of the kd-tree, a set of candidate centers is maintained. This is defined to be a subset of center points that might serve as the nearest neighbor for some point lying within the associated cell. The candidate centers for the root consist of all k centers. Candidates are then propagated down the tree as follows: For each node u, let C denote its cell and let Z denote its candidate set. First, compute the candidate z* Ïµ Z that is closest to the midpoint of C. Then, for each of the remaining candidates z* Ïµ Z\{ z* }, if no part of C is closer to z than it is to z*, it could be infered that z is not the nearest center to any data point associated with u and, hence, we can prune, or "filter", z from from the list of candidates. If u is associated with a single candidate (which must be z*) then z*is the nearest neighbour of all its data points. We can assign them to z* by adding the associated weighted centroid and counts to z*. Otherwise, if u is an internal node, we recurse on its children. If u is a leaf node, we compute the distances from its associated data point to all the candidates in Z and assign the data point to its nearest center. (See Fig. 1.) It remains to describe how to determine whether there is any part of cell C that is closer to candidate z than to z*. Let H be the hyperplane bisecting the line segment . (See Fig. 2.) H defines two halfspaces; one that is closer to z and the other to z*. If C lies entirely to one side of H, then it must lie on the side that is closer to z* (since C's midpoint is closer to z*) and so z may be pruned. To determine which is the case, consider the vector , directed from z* to z. Let denote the vertex of C that is extreme in this direction, that is, the vertex of C that maximizes the dot product . Thus, z is pruned if and only if dist(z, â‰¥ dist(z*, ). (Squared distances may be used to avoid taking square roots.) To compute , let, ] denote the projection of C onto the ith coordinate axis. The ith coordinate of to be if the ith coordinate of is negative and ] otherwise. This computation is implemented by a procedure z.isF arther(z*, C), which returns true if every part of C is farther from z than to z*. The initial call is to Filter ( r,), where r is the root of the tree and is the current set of centers. On termination, center z is moved to the centroid of its associated points, that is, .

## 4. Methodology and Results

## 4.1 Comparison between Strengths and Weaknesses of Hierarchical Clustering

## and k-means algorithm

By analysing both k-means and hierarchical clustering algorithm, comparison could be made between both algorithm in terms of their strengths and weaknesses. Table 1 shows the analyse made in this paper to compare Strengths and Weaknesses of Hierarchical Clustering and k-means algorithm.

k-means algorithm

Hierarchical Clustering

Strength

The advantages of the k-means algorithms :

1. Its time complexity is

O(mkl), where

m is the number of instances;

k is the number of clusters; and

l is the number of iterations taken by the algorithm to converge.

This algorithm is computationally applicable to large number of instances.

2. Its space complexity is O(k+m). It requires additional space to store the data matrix. It is possible to store the data matrix in a secondary memory and access each pattern based on need.

3. It is order-independent. For a given initial seed set of cluster centers, it generates the same partition of the data irrespective of the order in which the patterns are presented to the algorithm.

Other advantages are its ease of interpretation, simplicity of implementation, speed of convergence and adaptability to sparse data (Dhillon and Modha, 2001).

The advantages of the hierarchical clustering algorithms :

1. Hierarchical Clustering possesses embedded flexibility regarding a level of granularity.

2. Ease of handling of any forms of similarity or distance.

3. It is consequently applicability to any attributes types.

4. The algorithm is more versatile.

Weakness

Yet, they have the following disadvantages:

1. However, K-means algorithm is sensitive to noisy data and outliers as a single outlier can increase the squared error dramatically. It is applicable only when mean is defined and it requires the number of clusters in advance, which is not trivial when no prior knowledge is available.

2. Moreover, K-means algorithm is sensitive to initial seed selection and even in the best case, it can produce only hyperspherical clusters and works well only on data sets having isotropic clusters. Hierarchical algorithms are more versatile.

3. As it is possible to store the data matrix in a secondary memory and access each pattern based on need, it requires a huge access time because of the iterative nature of the algorithm. Hence, processing time takes long.

Yet, they have the following disadvantages:

1. The time complexity of hierarchical agglomerative algorithms is O(*log m).

2. The space complexity of agglomerative algorithms is O(). This is because a similarity matrix of size has to be stored. It is possible to compute the entries of this matrix based on need instead of storing them.

Table 1: Comparison between Hierarchical Clustering and k-means algorithm

## 4.2 Analysis and Comparison

From the table of comparison, it could be analysed that the time complexity and space complexity's of both the algorithms differ from each others. K-means algorithm involves linear complexity while hierarchical clustering undergoes non-linear but a quadratic one. This could be shown by the m (number of vertices) in their time and space complexity. K-means algorithm uses m while used in hierarchical clustering. Besides that, hierarchical clustering is much more versatile and flexible than k-means algorithm which sensitive to noisy data and initial seed selection.

The analysis shows that both algorithms are complementary to each other. The weaknesses of one algorithm is the strength of the another. Hence, both algorithms are suit to be combined to improve the performance.

By analysing paper written by Li and Wang (2000) and Zhang (2002), that used k-means algorithm where the number of k is chosen by slowly increasing k until stopping criterion are met. As the initial cluster assignment is random for k-means algorithm, different runs of the k-means clustering algorithm do not give the same final clustering solution. To cope with this a good starting points are needed for the initial cluster assignment. Hence, hierarchical clustering algorithm is proposed by Aulia (2005) to provide the number of k and the initial cluster centers. This proved that hierarchical clustering algorithm can help in overcoming weakness in k-means algorithm.

Hierarchical Clustering and k-means algorithm are compared according to the following four factors[9]:

The size of the dataset

Number of the clusters

Type of dataset

Type of software

Table 2 explains how this two algorithms are compared. The total number of times the algorithms have been executed is 16. For each 4-runs group, the results of the executions are studied and compared.

Table 2 : The factors according to which the algorithms are compared

## 4.3 Analysis from Tables of Results

K-Means clustering algorithm require setting k in advance. To make comparisons easier, k is chosen to be equal to 8,16,32 and 64. The hierarchical clustering is cut at two different levels to obtain corresponding numbers of clusters(8,16,32,and64). As a result, performance of k-means algorithm is better than hierarchical clustering.

Number of Clusters (K)

Performance

HCA

K-Means

8

65

63

16

74

71

32

87

84

64

92

89

Table 3: Relationship between number of clusters and performance of algorithms

For accuracy, as the number of k increases, the accuracy of hierarchical clustering becomes better. k-means algorithm has less accuracy than hierarchical clustering.

Number of Clusters (K)

Performance

HCA

k-Means

8

1090

1112

16

960

1089

32

850

910

64

760

840

Table 4: The relationships between number of clusters and the quality of the algorithms

For size of dataset, a huge dataset consisting of 600 rows and 60 columns and a small dataset that uses 200 rows and 20 columns is used. The small dataset is extracted as a subset of the huge dataset. The quality of k-means algorithm is good when using huge dataset, while hierarchical clustering algorithm shows good result in small dataset.

K=32

Data Size

HCA

k-Means

36000

850

910

4000

91

95

Table 5: The affect of the data size on the algorithms

According to the type of dataset, a random dataset that extracted from internet and used for different jobs is used. An ideal dataset is used which is part of the software (LNKnet and Cluster and TreeView). It is ideal because it is designed to be suitable for testing and training the software itself and having less noisy data which leads to ambiguity. Hierarchical clustering gives better result than k-means when using random dataset. This indicates that k-means is very sensitive for noise in the dataset. The noise makes the algorithm difficult to include an object in a certain cluster.

K=32

Data Type

HCA

k-Means

Random

850

910

Ideal

829

810

Table 6: The affect of the data type on the algorithms

According to the type of software, LNKnet and cluster and TreeView is used to run the algorithms. Nevertheless, both of them yields almost the same result even when changing any of the other 3 factors : size of dataset, number of clusters and type of dataset.

## 5. Conclusion

After analyzing the results and efficiency of both hierarchical clustering and k-means algorithm by testing them under four factors, it is to be found that the performance of k-means is better than hierarchical clustering algorithm in terms of comparing number of clusters. However, in terms of accuracy comparison, the accuracy of hierarchical clustering is better than k-means algorithm. Besides that, hierarchical clustering has good results when small dataset is used, while on the other hand, k-means is good when huge dataset is used. Hence, it could be concluded that partitioning algorithms such as k-means are recommended for huge dataset using and hierarchical clustering for small dataset. In contrast, k-means yields better result in ideal dataset while hierarchical clustering shows good when using random dataset. k-means algorithm is sensitive towards noise in dataset which affects its result as the noise makes it difficult to cluster an object into its suitable cluster. However, both of them yields almost the same result even when changing any of the other 3 factors as it could be concluded that most software use the same procedures and idea in any algorithm implemented by them. Combination of both algorithms can enhance the performance of algorithms since both algorithms are complementary to each others.