Improved K Mean Algorithm In Data Mining Computer Science Essay

Published:

Abstract- K-means clustering is a popular clustering algorithm based on the partition of data. Clustering analysis method is one of the main analytical methods in data mining; the method of clustering algorithm will influence the clustering results directly. This proposed paper discusses the standard k-means clustering algorithm and analyzes the shortcomings of standard k-means algorithm, such as the k-means clustering algorithm has to calculate the distance between each data object and all cluster centers in each iteration, which makes the efficiency of clustering is not high. An improved k-means algorithm in order to solve this problem, requiring a simple data structure to store some information in every iteration, which is to be used in the next iteration. The improved method avoids computing the distance of each data object to the cluster centers repeatedly, saving the running time and the outstanding feature of our algorithm is its superiority in execution time.

Lady using a tablet
Lady using a tablet

Professional

Essay Writers

Lady Using Tablet

Get your grade
or your money back

using our Essay Writing Service!

Essay Writing Service

Keywords-Clustering Analysis, IK-Means Algorithm, Computational Complexity, Data Mining.

1. INTRODUCTION

Data clustering is frequently used in many fields, such as data mining, pattern recognition, decision support, machine learning and image segmentation [1-3]. As the most well known technique for performing non-hierarchical clustering, the K-means clustering [4] iteratively finds the k cancroids and assigns each sample to the nearest centroid, where the coordinate of each centroid is the mean of the coordinates of the objects in the cluster. Unfortunately, K-means clustering algorithm is known to be sensitive to the initial cluster centers and easy to get stuck to the local optimal solutions [5]. Moreover, when the size of data set is large, it takes enormous time to find the solution. In order to improve the performance of the K-means algorithm, a variety of methods have been proposed. There are a lot of variations of the K-means clustering algorithm. Here are some versions of them in recent years. Bradley and Fayyad [15] present a technique for initializing the K-means algorithm. They begin by randomly breaking the data into10, or so, subsets. They then perform a K-means clustering on each of the10 subsets, all starting at the same set of initial seeds, which are chosen randomly. The result of the 10 runs is 10K centre points. These 10K points are then themselves input to the K-means algorithm and the algorithm run 10 times, each of the 10 runs initialized using the K final centroid locations from one of the 10 subset runs. The resulting K centre locations from this run are used to initialize the K-means algorithm for the entire dataset. In this paper incremental approach to clustering that dynamically adds one cluster center at a time through a deterministic global each proced-ure consisting of N (with N being the size of the data set) executions of the K-means algorithm from suitable initial positions. Experiment results show that the IKM(Improved K-Mean) algorithm consider-ably outperforms the K-means algorithms. Khan and Ahmad [17] proposed an algorithm to compute initial cluster centers for K-means clustering. This algorithm is based on two observations that some of the patterns are very similar to each other and that is why they have same cluster membership irrespective to the choice of initial cluster centers. Also, an individual attribute may provide some information about initial cluster center. K-means is a numerical, unsupervised, non-deterministic, iterative method. It is simple and very fast, so in many practical applications, the method is proved to be a very effective way that can produce good clustering results. But it is very suitable for producing globular clusters. Several attempts were made by researchers to improve efficiency of the k-means algorithms [5]. There is an improved k-means algorithm based on weights. This is a new partitioning clustering algorithm, which can handle the datas of numerical attribute, and it also can handle the datas of symbol attribute. Meanwhile, this method reduces the impact of isolated points and the "noise", so it enhances the efficiency of clustering. However, this method has no improvement on the complexity of time. Hence this method can produce more accurate clustering results than the standard k-means algorithm, but this method does not have any improvements on the executive time and the time complexity of algorithm. This paper presents an improved k-means algorithm. Although this algorithm can generate the same clustering results as that of the standard k-means algorithm, the algorithm of this paper proposed is superior to the standard k-means method on running time and accuracy, thus enhancing the speed of clustering and improving the time complexity of algorithm. By comparing the experimental results of the standard k-means and the improved k-means, it shows that the improved method can effectively shorten the running time. How the next new cluster center is created by introducing some idea of K-mean clustering algorithm suggested by Park and Jun in [19].

II. THE K-MEANS CLUSTERING ALGORITHM

Lady using a tablet
Lady using a tablet

Comprehensive

Writing Services

Lady Using Tablet

Plagiarism-free
Always on Time

Marked to Standard

Order Now

A. The process of k-means algorithm

This part briefly describes the standard k-means algorithm. K-means is a typical clustering algorithm in data mining and which is widely used for clustering large set of datas. In 1967, MacQueen firstly proposed the k-means algorithm, it was one of the most simple, non- supervised learning algorithms, which was applied to solve the problem of the well-known cluster [2]. It is a partitioning clustering algorithm, this method is to classify the given date objects into k different clusters through the iterative, converging to a local minimum. So the results of generated clusters are compact and independent. The algorithm consists of two separate phases. The first phase selects k centers randomly, where the value k is fixed in advance. The next phase is to take each data object to the nearest center [5]. Euclidean distance is generally considered to determine the distance between each data object and the cluster centers. When all the data objects are included in some clusters, the first step is completed and an early grouping is done. Recalculating the average of the early formed clusters. This iterative process continues repeatedly until the criterion function becomes the minimum.

Supposing that the target object is x, xi indicates the average of cluster Ci, criterion function is defined as follows:

E is the sum of the squared error of all objects in database. The distance of criterion function is Euclidean distance, which is used for determining the nearest distance between each data objects and cluster center. The Euclidean distance between one vector x=(x1 ,x2 ,…xn) and another vector y=(y1 ,y2 ,…yn ),

The Euclidean distance d(xi, yi) can be obtained as follow:

The process of k-means algorithm as follow:

Input:

Number of desired clusters, k, and a database D={d1, d2,…dn}

Containing n data objects.

Output:

A set of k clusters

Steps:

1) Randomly select k data objects from dataset D as initial Cluster centers.

2) Repeat;

3) Calculate the distance between each data object di (1 <=i<=n) and all k cluster centers cj(1<=j<=k) and assign data object di to the nearest cluster.

4) For each cluster j (1<=j<=k), recalculate the cluster center.

5) Until no changing in the center of clusters.

The k-means clustering algorithm always converges to local minimum. Before the k-means algorithm converges, calculations of distance and cluster centers are done while loops are executed a number of times, where the positive integer t is known as the number of k-means iterations. The precise value of t varies depending on the initial starting cluster centers [8]. The distribution of data points has a relationship with the new clustering center, so the computational time complexity of the k-means algorithm is O(nkt). n is the number of all data objects, k is the number of clusters, t is the iterations of algorithm. Usually requiring k <<n and t <<n.

B. The shortcomings of k-means algorithm

We can see from the above analysis of algorithms, the algorithm has to calculate the distance from each data object to every cluster center in each iteration. However, by experi-ments we find that it is not necessary for us to calculate that distance each time. Assuming that cluster C formed after the first j iterations, the data object x is assigned to cluster C, but in a few iterations, the data object x is still assigned to the cluster C. In this process, after several iterations, we calculate the distance from data object x to each cluster center and find that the distance to the cluster C is the smallest. So in the course of several iterations, k-means algorithm is to calculate the distance between data object x to the other cluster center, which takes up a long execution time thus affecting the efficiency of clustering.

III. IMPROVED K-MEANS CLUSTERING ALGORITHM

The standard k-means algorithm needs to calculate the distance from the each date object to all the centers of k clusters when it executes the iteration each time, which takes up a lot of execution time especially for large-capacity databases. For the shortcomings of the above k-means algorithm, this paper presents an improved k-means method. The main idea of algorithm is to set two simple data structures to retain the labels of cluster and the distance of all the date objects to the nearest cluster during the each iteration, that can be used in next iteration, we calculate the distance between the current date object and the new cluster center, if the computed distance is smaller than or equal to the distance to the old center, the data object stays in it's cluster that was assigned to in previous iteration. Therefore, there is no need to calculate the distance from this data object to the other k-mean clustering centers, saving the calculative time to the k-1 cluster centers. Otherwise, we must calculate the distance from the current data object to all k cluster centers, and find the nearest cluster center and assign this point to the nearest cluster center. And then we separately record the label of nearest cluster center and the distance to its center. Because in each iteration some data points still remain in the original cluster, it means that some parts of the data points will not be calculated, saving a total time of calculating the distance, thereby enhancing the efficiency of the algorithm.

Lady using a tablet
Lady using a tablet

This Essay is

a Student's Work

Lady Using Tablet

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

Examples of our work

The process of the improved algorithm is described as follows:

Input:

The number of desired clusters k, and a database D={d1, d2,…dn} containing n data objects.

Output:

A set of k clusters

Steps:

1) Randomly select k objects from dataset D as initial cluster centers.

2) Calculate the distance between each data object di (1 <= i<=n ) and all k cluster centers cj (1<=j<=k) as Euclidean distance d(di , cj) and assign data object di to the nearest cluster.

3) For each data object di, find the closest center cj and assign di to cluster center j;

4) Store the label of cluster center in which data object di is and the distance of data object di to the nearest cluster and store them in array Cluster[ ] and the Dist[ ] separately. Set Cluster[i]=j, j is the label of nearest cluster. Set Dist[i]=d(di, cj), d(di, cj) is the nearest Euclidean distance to the closest center.

5) For each cluster j (1<=j<=k), recalculate the cluster center;

6) Repeat

7) For each data object di Compute it's distance to the center of the present nearest cluster;

a) If this distance is less than or equal to Dist[i], the data object stays in the initial cluster;

b) Else For every cluster center cj(1<=j<=k), compute the distance d(di, cj) of each data object to all the center, assign the data object di to the nearest center cj. Set Cluster[i]=j; Set Dist[i]= d(di, cj);

8) For each cluster center j(1<=j<=k), recalculate the centers;

9) Until the convergence criteria is met.

10) Output the clustering results;

The improved algorithm requires two data structure (Cluster[ ]and Dist[ ] )to keep the some information in each iteration which is used in the next iteration. Array cluster[ ] is used for keep the label of the closest center while data structure Dist[ ] stores the Euclidean distance of data object to the closest center. The information in data structure allows this function to reduce the number of distance calculation required to assign each data object to the nearest cluster, and this method makes the improved k-means algorithm faster than the standard k-means algorithm. This paper propose an improved k-means algorithm, to obtain the initial cluster, time complexity of the improved k-means algorithm is O(nk). Here some data points remain in the original clusters, while the others move to another cluster. If the data point retains in the original cluster, this needs O(1), else O(k). With the convergence of clustering algorithm, the number of data points moved from their cluster will reduce. If half of the data points move from their cluster, the time complexity is O(nk/2). Hence the total time complexity is O(nk). While the standard k-means clustering algorithm require O(nkt). So the proposed k-mean algorithm in this paper can effectively improve the speed of clustering and reduce the computational complexity. But the improved k-means algorithm requires the pre-estimated the number of clusters, k, which is the same to the standard k-means algorithm. If you want to get to the optimal solution, you must test the different value of k.

IV EXPERIMENTAL RESULT

This paper selects three different data sets from the UCI [4] repository of machine learning databases to test the efficiency of the improved k-means algorithm and the standard k-means. Two simulated experiments have been carried out to demonstrate the performance of the improved k-means algorithm in this paper. This algorithm has also been applied to the clustering of real datasets. In two experiments, time taken for each experiment is computed. The same data set is given as input to the standard k-means algorithm and the improved algorithm. Experiments compare improved k-means algorithm with the standard k-means algorithm in terms of the total execution time of clusters and their accuracy Experimental operating system is Window XP, program language is VC++ 6.0. This paper uses iris, glass, letter as the test datasets and gives a brief description of the datasets used in experiment evaluation. Table 1 shows some characteristics of the datasets.

TABLE I. CHARACTERISTIC OF THE DATASETS

Dataset

Number of attrib

Number of records

Iris

4

150

Glass

9

214

Letter

16

20000

A. Experiment 1

In experiment 1, datasets of Iiris and glass are selected because they are fit to clustering analysis and their clustering results are reliable. The number of cluster k sets 3. Clustering results for the standard k-means algorithm and the improved k-means algorithm proposed in this paper are listed in tableII. The results of experiment 1 show that the improved k-means algorithm can produce the final cluster results in shorter time than the standard k-means.

TABLE II. CLUSTERING RESULTS FOR IRIS, GLASS ON THE STANDARD KMEANS AND THE IMPROVED K-MEANS.

Dataset

k-means

Running

time (s)

Improved

k-means

Running

time(s)

k-means

Accuracy %

Improved

k-means

Accuracy %

Iris

0.0586

0.0559

84.3

91.6

Glass

0.0814

0.0778

78.9

89.3

At the same time the improved k-means can enhance the accuracy of algorithm.

B. Experiment 2

In the Experiment 2, the same dataset is used on different k values. This experiment uses dataset letter containing 20000 samples for testing, k sets respectively 40,60,80,100. Figure 1 depicts the performance of the improved k-means algorithm and the standard k-means algorithm in terms of the total execution time of clusters.

Figure 1. Excution time comparison of the improved k-means algorithm and the standard k-means algorithm

The results of Experiment 2 shows that, the improved k-means algorithm is able to fully demonstrate it's superiority on the running time comparing to the standard k-means algorithm, especially when it facts large-capacity database. The improved algorithm can generate the final clustering results in relatively short period of time, so it can enhance the speed of clustering. Results of two simulated experiments show that the improved k-means clustering algorithm significantly outperforms the standard k-means algorithm in the overall execution time. So the improved algorithm proposed in this paper is feasible.

CONCLUSION

K-means is a typical clustering algorithm and it is widely used for clustering large sets of data. This paper elaborates k-means algorithm and analyses the shortcomings of the standard k-means clustering algorithm. Because the computational complexity of the standard k-means algorithm is objectionably high owing to the need to reassign the data points a number of times during every iteration, which marks the efficiency of standard k-means clustering is not high. This paper presents a simple and efficient way for assigning data points to clusters. The proposed method in this paper ensures the entire process of clustering in O(nk) time without sacrificing the accuracy of clusters. Experimental results show the improved algorithm can improve the execution time of k-means algorithm. So the proposed k-means method is feasible.