# K-Means Data Mining Algorithm

Published:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

### An Agent oriented approach for the generation of initial Centroids in K-Means Data Mining Algorithm

### Abstract

*Intelligent agents are today accepted as powerful tools for data mining in a distributed environment. Artificial Intelligence (AI) algorithms are utilized to render software agents more intelligent. [1]* *Knowledge discovery and data mining methods have been applied to discover hidden patterns and relations in complex datasets where clustering is one of the most important approaches used for this purpose. [2] There exists a number of clustering algorithms among which K-means algorithm iscommonly used to find clusters due to its simplicity of implementation and fast execution. It appears extensively in the machine learning literature and in most data mining suite of tools. The algorithm is significantly sensitive to the initial randomly selected cluster centers. Unfortunately there exists no such method to select initial centroids.** **In this paper, we will present an agent oriented approach for generation of initial centroids in K-means data mining algorithm. We have tested this formula with both Euclidean and City Block (Manhattan) distances methods on a different number of real life datasets.*

### Keywords

Artificial Intelligence (AI), Centroids, Clusters, Data Points, Intelligent Mobile Agents, Learning Intelligent Agent (LIAgent).

### 1. Introduction

In this section we will discuss the nature of an agent, how it is made intelligent and its functioning based on a proposed formula for the generation of initial centroids in K-means data mining algorithm. The agent can be defined as: *"An Agent is a software thing that knows how to do things that you could probably do yourself if you had the time". *(Ted Seller of IBM Almaden Research Center)

*"A piece of software which performs a given task using information gleaned from its environment to act in a suitable manner so as to complete the task successfully. The software should be able to adapt itself based on changes occurring in its environment, so that a change in circumstances will still yield the intended results." *(With thanks to G.W.Lecky - Thompson for the definition)[3][4]

An intelligent agent can be further divided into a weak notation and a strong notation. Table 1 shows the properties for both notations.

*Table 1 - weak and strong agent*

Weak notation Strong notationAutonomy Mobility Social ability Benevolence Reactivity Proactivity Rationality Temporal continuity Adaptivity Goal oriented Collaboration

Actually, intelligence refers to the ability of the agent to capture and apply domain specific knowledge and processing to solve problems. An intelligent agent uses knowledge, information and reasoning to take reasonable actions in pursuit of a goal. It must be able to recognize events, determine the meaning of those events and then take actions on behalf of a user.

In order to construct an Intelligent Agent, we have to use the following topics of Artificial Intelligence: [5]

- Knowledge Representation
- Reasoning
- Learning

An intelligent and autonomous agent has the properties like Perception, Reasoning and Action which form the life cycle of an agent as shown in figure 1. [6] [7]

Knowledge Discovery in Databases (KDD) and Information Retrieval (IR) are today determinant to access the right information in a global world, just to mention the importance in tracking of epidemic diseases and movement of suspected criminals. Knowledge discovery and data mining methods have been applied to discover hidden and new patterns and relations in complex datasets.[8]

Data mining is a statistical term but in Information Technology (IT), it is: *"Discovery of useful summaries of data"*. [9] Data mining is known as "Machine Learning" in Artificial Intelligence (AI). Machine Learning is an area of Artificial Intelligence (AI), concerned with the development of techniques which allow the computer to 'learn'. It is a method of creating computer programs by the analysis of datasets. The agents must be able to learn to do classification, clustering and prediction using learning algorithms. From the above discussion, it is now clear that there is an explicit link between Artificial Intelligence (AI) and Learning Intelligent Agents (LIAgents). [10] [11] The following paragraph describes the functioning of K-means clustering algorithm:

The algorithm has as an input a predefined number of clusters that is the '*k' *from its name. Means stands for an average, an average location of all the members of a particular cluster. When dealing with clustering techniques, one has to adopt a notion of a high dimensional space, or space in which orthogonal dimensions are all attributes from the table of data we are analyzing. The value of each attribute of an example represents a distance of the example from the origin along the attribute axes. In order to use this geometry efficiently, the values in the data set must all be numeric (categorical data must be transformed into numeric ones) and should be normalized in order to allow fair computation of the overall distances in a multi-attribute space. The basic mechanism of all clustering algorithms is represented in figure 2: [12] [13]

K-means algorithm is an iterative procedure where by final results depend on the values selected for initial *centroid*. A *Centroid* is an artificial point in the space of records which represents an average location of the particular cluster. The coordinates of this point are averages of attribute values of all examples that belong to the cluster. Finding useful pattern in large datasets has attracted considerable interest recently and of the most widely studied problems in this area is the identification of clusters or densely populated regions, in a multi-dimensional dataset. Given the desired number of clusters 'k' and a dataset of 'N' points and a distance-based measurement function. The major problem with the K-mean algorithms is that of choice of initial centroids. The algorithm depends upon this selection. If the selection is correct or accurate then the result will be fine otherwise it may fail. It is most commonly used algorithm employing a square-error criterion. It has two primary steps first, the assignment step where the instances are placed in the closest class and the second one, re-assignment step where the class centroids are recalculated from the instances to the class. [14]

The interpretation offlow chart in figure 3 is as follows:

Enter the number of clusters and number of iterations, which are the required and basic inputs of this k-means algorithm, then compute the initial centroids by providing the formula shown in equations 1 and 2 below. Calculate the distance either by Euclidean's distance or City Block (Manhattan) distance formulae (equations 3 and 4). On the basis of these distances, generate the partition by assigning each sample to the closest cluster. Compute new cluster centers as centroids of the clusters, again compute the distances and generate the partition. Repeat this until the cluster memberships stabilizes.

The formula for initial centroids is given below:

Now the initial centroid will be C(*ci, cj*).Where: max X, max Y, min X and min Y represent maximum and minimum values of X and Y attributes respectively. k represents number of clusters and i,j and n vary from 1 to k where k is an integer.

In this way, we can calculate the initial centroids; this will be the starting point of this algorithm by using the square-errors.

Manhattan Distance Formula:

Where *d(xi, xj)* is the distance between *xi *and *xj*. *xi* and *xj* are the attributes of a given object, where i and j vary from 1 to N where N is total number of attributes of agiven object. i,j and N are integers. [14]

There are different techniques available to use the initial points (it is required as an input). The followings are the different methods to calculate the initial points:

*Corner:*This method consists in divide the space in bins and then takes random points inside each bin, this assures that the set of initial points are distributed covering the entire dataset. [15]*Bins:*This method consists of choosing all the starting clusters close to the mass centroid of the dataset. Each cluster center is calculated adding a small random perturbation to the centroid of the dataset. [15]*Centroid:*The cluster centers are distributing randomly and trying to cover the entire space. This method is similar to bins. [15]*Spread:*The data points are projected in the space of the principal component, a clustering procedure is applied to this one-dimensional set. The cluster centers are calculated depending on the obtained clusters in the one-dimensional space. [15]*PCA:*

Among the above mentioned methods, in order tocalculate the initial points, we will opt for the 'centroid' method because ithas been advocated throughout the literature thatthe centroid method produces the best results.

### 2- Deploying K-means in Agent

We have developed a multiagent system comprising two agents. The first agent is for K-means algorithm with proposed initial centroids formula and the second one is based on ID3 algorithm for the interpretation of its results, under 'kaariboga framework', an experimental Java-based mobile agent framework. [16] [17] Choosing a clustering algorithm, however, can be a difficult task, even finding just the most relevant approaches for a given dataset is not obvious. Finding a good similarity function depends strongly on the dataset and the determinant elements are the nature of the data and the desired number of clusters. Another element is the type of input and tools that the algorithm requires because some algorithms only handle the numeric inputs and others categorical. For simplicity the tests were performed on datasets which have numerical attributes values. The chosen datasets are:

- Iris dataset: This dataset contains information about the flowers. There are 3 different classes of flowers. It has 150 records with 5 attributes. After applying the process of normalization, only 100 are selected.
- Vote dataset: This dataset is about the polls in USA. There are 2 different classes of polls i.e. Republicans or Democrats. There are total 300 records and 17 attributes. We normalize this dataset and select the same number of records for clustering.
- Breast dataset: This is a medical dataset, containing information about the Breast's dieses. There are 2 different classes of Breast's dieses. It has total 700 records and 11 attributes. After applying the normalization process, only 233 records are selected for clustering.
- Diabetes dataset: This is also a medical dataset, containing information about Diabetes' dieses. We have divided this dataset into 5 different classes. It has total 513 records and 8 attributes. Similarly, we normalize this dataset and select only 256 records for clustering. [18]

We created the vertical partitions of each datasets by selecting the proper number of attributes which are suitable for this clustering algorithm. The clusters are created on the basis of the classes in these four datasets i.e. for dataset 'iris' 3 clusters, for dataset 'vote' 2 clusters, for dataset 'breast' 2 clusters and for dataset 'diabetes' 5 clusters; which is also the value of 'k' in this algorithm. Another required input is the number of iterations, which is 'n', for these datasets we used 'n = 100'.There is no predefined rule, how many clusters will be created and what will be the number of iterations. It always depends upon the user. If the result is not according to the requirements, the values of 'k' (number of clusters) and 'n' (number of iterations) can be changed every time until the required and better results are found. This is a weakness of this algorithm.Once the clusters are discovered, they have to be interpreted in order to have some value; this is another major issue in this algorithm. There are different ways to utilize clustering results; clusters membership can be used as a label for the separate classification problems, some descriptive data mining techniques like ID3 [19] (decision tree) can be used to find descriptions of clusters and clusters can be visualized using 2D or 3D scattered graphs.We have used 2D scattered graphs and ID3 for visualizing as well as interpreting the results of K-means clustering algorithm.

### 3- Results and Discussion

The K-means algorithm with proposed initial centroids formula was tested on both peer-to-peer and client-server network. There is no difference between results obtained from both modes.

We are demonstrating the results obtained from the dataset 'iris' in table 2 and 3.

We have noticed in all the four datasets the results obtained from this proposed formula are satisfactory and consistent. So, the further tests have been demonstrated, the formula is transcendental i.e. independent of size and nature of the data.

For the visualization of these obtained clusters we will first apply scattered graph technique only on dataset 'iris', by taking 'petal_length', 'petal_width' and 'class' attributes of this dataset. Figure 4 shows a scattered graph of all data points in 'iris' dataset before applying the K-means algorithm.

We can see there is a high density of data situated between 0 and 100. We will now apply K-means algorithm by using the formula of initial centroids and create 3 clusters for this dataset (after examining 'iris' dataset there are only 3 classes) with 100 iterations. Scattered Graphs shown in figures 5, 6 and 7 of each clusters are obtained by using City-Block (Manhattan) distance formula.

The distance is constant between 'petal_length' and 'petal_width' for all data points.

*Figure 6. The k-means algorithm applied to iris data set by initial centroids formula, generating cluster 2.*

As compared to the first cluster in figure 5, the distance between 'petal_length' and 'petal_width' is variable. This cluster has a structure close to figure 4 before applying K-means clustering algorithm.

*Figure 7. The k-means algorithm applied to iris data set by initial centroids formula, generating cluster 3.*

The distance between 'petal_length' and 'petal_width' is larger as compared to the other two clusters in figures 5 and 6. It is more or less constant up to 'petal_length' 20 and then it varies. There is a linear combination among these three clusters in figure 5, 6 and 7 to form the original dataset 'iris' before clustering shown in figure 4.

Scattered Graphs shown in figures 8, 9 and 10 of each cluster are obtained by using Euclidean's distance formula.

There is almost a constant distance between 'petal_length' and 'petal_width' up to 20 and then it varies.

As compared to the first cluster in figure 8, the distance between 'petal_length' and 'petal_width' is variable. There is a constant trend between the distances after 'petal_length' 40. This cluster has a structure more or less close to figure 4 before applying K-means clustering algorithm.

*Figure 10. The k-means algorithm applied to iris data set by initial centroids formula, generating cluster 3.*

There is a variable distance between 'petal_length' and 'petal_width' in this cluster.

We applied ID3 on each cluster in order to interpret the results obtained from the clusters. The decision rules for cluster 1 of 'iris' dataset using Euclidean's distance are shown in figure 11.

if petal_width = 0thenClass = Irissetosa

else

if petal_width = 2 then Class = Irissetosa , Irisversicolor

else

if petal_length = 3then Class = Irisversicolor

else

if petal_length=1 then Class = Irisvirginica, Iriscol

else

Class = Irisvirginica

*Figure 11- Decision rules for cluster 1 of dataset 'iris' (Euclidean's distance)*

There are four decision rules for cluster 1 of dataset 'iris' after applying Euclidean's distance formula.

The decision rules for cluster 1 of 'iris' dataset using City Block's distance are shown in figure 12.

if petal_length = 1 andpetal_width = 0then

Class= Irissetosa

There is only one decision rule for cluster 1 of dataset 'iris' after applying City Block's distance formula because the distance between 'petal_length' and 'petal_width' is constant as shown in figure 5.

Similarly the decision rules can be found for the other clusters of this dataset and the number of rules will vary from one cluster to another cluster depending on the values of data points in that cluster.As distribution of data points within a cluster depends upon the choice of initial centroids, therefore it is of paramount importance to apply the most appropriate values of initial centroids.

### 4- Conclusion

In this paper, we discussed K-means clustering algorithm using centroids method with a new formula for the generation of initial centroids. Instead of randomly selecting centroids or by applying the other different methods discussed in section 2, the most appropriate choice is to use this formula for initial centroids. Also we demonstrated this formula on 4 different datasets with both distance formulae (Euclidean and Manhattan).The results obtained from k-means clustering algorithm are interpreted by using ID3 (decision rules) and 2D scattered graphs. We adopted a multiagent approach, one agent is responsible for clustering of physical and original datasets and second one is for the interpretation of the information provided by the clusters. Our methodology is a prototype tested on peer-to-peer and local distributed networks. The study could be extended to large scale distributed databases so as to validate the effectiveness of the proposed methodology. For further investigation in this direction, one will undoubtedly has to take into account the parameters such as data caching and the validity of the agent framework.

### References

[1] Cawsey, Alison., *The Essence of Artificial Intelligence*. ISBN: 0135717795. 1997.

[2] Rich, E., and Knight, K., Artificial Intelligence. New York: McGraw-Hill. 1991.

[3] Hermans, Bjorn., *Intelligent Software Agents on the Internet: an inventory of currently offered functionality in the information society & a prediction of (near-) future developments*. The Netherlands, 9th July, 1996.

[4] Gray, R., Cybenko, G., Kotz, D., and Rus, D., *Mobile Agents: Motivations and State of the Art,* Handbook of 2002,URLhttp://citeseer.ist.psu.edu/context/1209666/0.

[5] Donovan, A. M., Knowledge Discovery in Databases and Information Retrieval in Knowledge Management Systems. *Knowledge Management System, LIS385T*, The University of Austin, April 22, 2003.

[6] Mohammadian (ed), Masoud., Intelligent Agents for Data Mining and Information Retrieval. ISBN: 1591401941 Idea Group Publishing Â© 2004.

[7] Shoham, Y., Agent-oriented programming: an overview of the framework and summary of recent research. In Knowledge representation and reasoning under uncertainty: Logic at work, ed. Masuch, M., and Polos, L. Berlin: Springer-Verlag. 1994.

[8] Mannila, Heikki., Data Mining: Machine Learning, Statistics and Databases, Proceedings of the 8th IEEE, SSDBM 96.

[9] Ullman, J. D. *Data Mining: A knowledge discovery in databases*. URL: http://www-db.stanford.edu/~ullman/mining.

[10] Bigus, J. P., *Constructing intelligent agents using JAVA - 2nd ed*. ISBN: 0-471-39601-X. 2001.

[11] Bigus, J.P., Agent Building and Learning Environment. In *Proceedings of The International Conference on Autonomous Agent 2000*, Association for computing Machinary, 108-109. Barcelona, Spain.

[12] Davidson, Ian., Understanding K-Means Non-hierarchical Clustering, SUNY Albany - Technical Report 02-2.2002.

[13] Milenova, B. L., Campos, M. M., Clustering Large Databases with Numeric and Nominal Values Using Orthogonal Projections. URL http://www.oracle.com/technology/products/bi/odm/pdf.

[14] Kantardzic, Mehmed., Data Mining: Concepts, Models, >Methods, and Algorithms. ISBN:0471228524 John Wiley & Sons Â© 2003 (343 pages).

[15] Fung, Glenn., *A Comprehensive Overview of Basic Clustering Algorithms*. June 22, 2001.

[16] Bezek, A., and Gams, Matjaz. An Agent Version of a Cluster Server, *IEEE/ACM, CCGRID 2003*.

[17] Kaariboga., *Mobile Agent Framework*. An Experimental framework URL http://www.projectory.de/kaariboga. 2004.

[18] US Census Bureau. Iris, Diabetes, Vote and Breast datasets. Publicly available at: www.sgi.com/tech/mlc/db.

[19] Quinlan, J.R. (1993). *C4.5: Programs for Machine Learning*. San Francisco: Morgan Kanfmann. ISBN: 1-55860-238-0.