# Knowledge Discovery Using Particle Swarm Optimization Accounting Essay

**Published:** **Last Edited:**

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

In recent years there has been an enormous increase in the amount of information stored in electronic format. It has been estimated that the amount of collected information in the world doubles every 20 months and the size and number of databases are increasing even faster and the ability to rapidly collect data has outpaced the ability to analyze it. We are becoming data rich but information poor. Information is crucial for decision making, especially in business operations. As a response to those trends, the term 'Data Mining' (or 'Knowledge Discovery') has been coined to describe a variety of techniques to identify nuggets of information or decision-making knowledge in bodies of data, and extracting these in such a way that they can be put to use in the areas such as decision support, prediction, forecasting and estimation.

Automated tools must be developed to help extract meaningful information from a flood of information. Moreover, these tools must be sophisticated enough to search for correlations among the data unspecified by the user, as the potential for unforeseen relationships to exist among the data is very high. A successful tool set to accomplish these goals will locate useful nuggets of information in the otherwise chaotic data space, and present them to the user in a contextual format.

An urgent need for creating a new generation of techniques is needed for automating data mining and knowledge discovery in databases (KDD). KDD is a broad area that integrates methods from several fields including statistics, databases, AI, machine learning, pattern recognition, machine discovery, uncertainty modeling, data visualization, high performance computing, optimization, management information systems (MIS), and knowledge-based systems.

The term "Knowledge discovery in databases" is defined as the process of identifying useful and novel structure (model) in data [95-99]. It could be viewed as a multi-stage process. Those stages are summarized as follows:

Data gathering, e.g., databases, data warehouses, Web crawling.

Data cleansing; eliminate errors, e.g., GPA = 7.3.

Feature extraction; obtaining only the interesting attributes of data

Data mining; discovering and extracting meaningful patterns.

Visualization of data.

Verification and evaluation of results; drawing conclusions.

Data mining is considered as the main step in the knowledge discovery process that is concerned with the algorithms used to extract potentially valuable patterns, associations, trends, sequences and dependencies in data [95-102]. Key business examples include web site access analysis for improvements in e-commerce advertising, fraud detection, screening and investigation, retail site or product analysis, and customer segmentation.

Data mining techniques can discover information that many traditional business analysis and statistical techniques fail to deliver. Additionally, the application of data mining techniques further exploits the value of data warehouse by converting expensive volumes of data into valuable assets for future tactical and strategic business development. Management information systems should provide advanced capabilities that give the user the power to ask more sophisticated and pertinent questions. It empowers the right people by providing the specific information they need.

Data mining techniques could be categorized either by tasks or by methods. Based on tasks they are Association Rule Discovery, Classification, Clustering, Sequential Pattern Discovery, Regression, Deviation Detection etc. Most researchers refer to the first three data mining tasks as the main data mining tasks. The rest are either related more to some other fields, such as regression, or considered as a sub-field in one of the main tasks, such as sequential pattern discovery and deviation detection.

Based on methods we have followings:

Algorithms for mining spatial, textual, image and other complex data

Incremental discovery methods and re-use of discovered knowledge

Integration of discovery methods

Data structures and query evaluation methods for data mining

Parallel and distributed data mining techniques

Issues and challenges for dealing with massive or small data sets

Fundamental issues from statistics, databases, optimization, and information processing in general as they relate to problems of extracting patterns and models from data.

In our work we have limited ourselves to mainly clustering task from first category and development of algorithms for mining complex data and image data from second category.

In this chapter we have presented unsupervised pattern classification (clustering) studies for numerical and image datasets. First part deals with the application of PSO for clustering. In this part classical K-means algorithm is discussed and also it is simulated for many sample datasets. Next we have used DE and a new adaptable DE (ADE) for clustering. Later, suitable hybridization of PSO with ADE also modeled and experimented for clustering problems.

## Approaches To Clustering

Clustering algorithms are divided into two broad classes:

Centroid approaches. The centroid or central point of each cluster is estimated, and points are assigned to the cluster of their nearest centroid.

Hierarchical approaches. Starting with each point as a cluster by itself, nearby clusters are repeatedly merged,

The clustering algorithms are classified according to:

whether or not they assume a Euclidean distance, and

whether they use a centroid or hierarchical approach.

The following three algorithms are examples of the above classification.

BFR: Centroid based; assumes Euclidean measure, with clusters formed by a Gaussian process in each dimension around the centroid.

GRGPF: Centroid-based, but uses only a distance measure, not a Euclidean space.

CURE: Hierarchical and Euclidean, this algorithm deals with odd-shaped clusters.

In the following subsections, we give some of the popular clustering algorithms.

## Centroid Clustering

## The k-Means Algorithm

The k-Means algorithm is a popular main-memory algorithm. k cluster centroids are picked and points are assigned to the clusters by picking the closest centroid to the point in question. As points are assigned to clusters, the centroid of the cluster may migrate.

Example 1: In the following figure, we have five points in the 2-dimensional space. For k=2, points 1 and 2 are assigned to the two clusters, and become their centroids for the moment.

5

1

2

3

4

b

a

c

For point 3, suppose it is closer to 2, so 3 joins the cluster of 2, whose centroid moves to the point indicated as a. Suppose that point 4 is closer to 1 than to a, so 4 joins 1 in its cluster, whose center moves to b. Finally, 5 is closer to a than b, so it joins the cluster { 2, 3}, whose centroid moves to c.

## The BFR Algorithm

Based on k-means, this algorithm [103] reads its data once, consuming a main-memory-full at a time. The algorithm works best if the clusters are normally distributed around a central point, perhaps with a different standard deviation in each dimension. Figure below suggests what the data belonging to a typical cluster in two-dimensions might look like. A centroid, marked by +, has points scattered around, with the standard deviation ³ in the horizontal dimension being twice what it is in the vertical dimension. About 70% of the points will lie within the 1³ ellipse; 95% will lie within 2³, 99.9% within 3³, and 99.9999% within 4³.

## +

1 ³

2 ³

A cluster consists of:

A central core, the Discard set (DS). This set of points is considered certain to belong to the cluster. All the points in this set are replaced by some simple statistics, described below. Although called "discarded" points, these points in truth have a significant effect throughout the running of the algorithm, since they determine collectively where the centroid is and what the standard deviation of the cluster is in each dimension.

Surrounding sub-clusters, the Compression set (CS). Each sub-cluster in the CS consists of a group of points that are sufficiently close to each other that they can be replaced by their statistics, just like the DS for a cluster is. However, they are sufficiently far away from any cluster's centroid, that we are not yet sure which cluster they belong to.

Individual points that are not part of a cluster or sub-cluster, the Retained set (RS). These points can neither be assigned to any cluster nor can they be grouped into a sub-cluster of the CS. They are stored in main memory, as individual points, along with the statistics of the DS and CS.

The statistics used to represent each cluster of the DS and each sub-cluster of the CS are:

The count of the number of points, N.

The vector of sums of the coordinates of the points in each dimension. The vector is called SUM, and the component in the ith dimension is SUMi.

The vector of sums of squares of the coordinates of the points in each dimension, called SUMSQ. The component in dimension i is SUMSQi .

For k dimensions, 2k +1 are needed to compute important statistics of a cluster or sub-cluster. The mean and variance in each dimension are

The coordinate i of the centroid of the cluster in dimension i is SUMi / N.

The variance in dimension i is

## Fastmap

Fastmap [5] picks k pairs of points (ai, bi), each of which pairs serves as the "ends" of one of the k axes of the k-dimension space. Using the law of cosines, we can calculate the "projection" x of any point c onto the line ab, using only the distances between points, not any assumed coordinates of these points in a plane. The diagram is

And the formula is

Having picked a pair of points (a, b) as an axis, part of the distance between any two points c and d is accounted for by the projections of c and d onto line ab, and the remainder of the distance is in other dimensions. If the projections of c and d are x and y, respectively, then in the future (as we select other axes), the distance Dcurrent(c, d) should be related to the given distance function D by

The Fastmap algorithm computes for each point c, k projections; c1, c2,â€¦, ck onto the k axes, which are determined by pairs of points (a1, b1), (a2, b2), â€¦, (ak, bk). For i = 1, 2, â€¦, k, do the following:

Using the current distance Dcurrent, pick ai and bi, as follows:

Pick a random point c.

Pick ai to be the point as far as possible from c, using distance Dcurrent.

Pick bi to be the point as far as possible from ai.

For each point x, compute xi , using the law-of-cosines formula described above.

Change the definition of Dcurrent to subtract the distance in the ith dimension as well as previous dimensions. That is

## Hierarchical Clustering

Hierarchical clustering [5][103] is a general technique that could take, in the worst case, O(n2) time to cluster n points. The General outlines of this approach are as follows.

Start with each point in a cluster by itself.

Repeatedly select two clusters to merge. In general, we want to pick the two clusters that are closest, but there are various ways we could measure "closeness." Some possibilities:

Distance between their centroids (or if the space is not Euclidean, between their clustroids).

Minimum distance between nodes in the clusters.

Maximum distance between nodes in the clusters.

Average distance between nodes of the clusters.

End the merger process when we have \few enough" clusters. Possibilities:

Use a k-means approach -- merge until only k clusters remain.

Stop merging clusters when the only clusters that can result from merging fail to meet some criterion of compactness, e.g., the average distance of nodes to their clustroid or centroid is too high.

## The GRGPF Algorithm

This algorithm [5][103] assumes there is a distance measure D, but no Euclidean space. It also assumes that there is too much data to fit in main memory. The data structure it uses to store clusters is like an R-tree. Nodes of the tree are disk blocks, and we store different things at leaf and interior nodes:

In leaf blocks, we store cluster features that summarize a cluster in a manner similar to BFR. However, since there is no Euclidean space, the features are somewhat different, as follows:

The number of points in the cluster, N.

The clustroid: that point in the cluster that minimizes the rowsum, i.e., the sum of the squares of the distances to the other points of the cluster.

If C is a cluster, C' will denote its clustroid.

Thus, the rowsum of the clustroid is

Notice that the rowsum of the clustroid is analogous to the statistic SUMSQ that was used in BFR. However, SUMSQ is relative to the origin of the Euclidean space, while GRGPF assumes no such space. The rowsum can be used to compute a statistic, the radius of the cluster that is analogous to the standard deviation of a cluster in BFR. The formula is radius =

The p points in the cluster that are closest to the clustroid and their rowsums, for some chosen constant p.

The p points in the cluster that are farthest from the clustroid.

In interior nodes, we keep samples of the clustroids of the clusters represented by the descendants of this tree node. An effort is made to keep the clusters in each subtree close. As in an R-tree, the interior nodes thus inform about the approximate region in which clusters at their descendants are found. When we need to insert a point into some cluster, we start at the root and proceed down the tree, choosing only those paths along which a reasonably close cluster might be found, judging from the samples at each interior node.

## CURE

The outlines of the CURE algorithm [103] are:

Start with a main memory full of random points. Cluster these points using the hierarchical approach.

For each cluster, choose c "sample" points for some constant c. These points are picked to be as dispersed as possible, then moved slightly closer to the mean, as follows:

Pick the first sample point to be the point of the cluster farthest from the centroid.

Repeatedly pick additional sample points by choosing that point of the cluster whose minimum distance to an already chosen sample point is as great as possible.

When c sample points are chosen, move all the samples toward the centroid by some fractional distance, e.g., 20% of the way toward the centroid. As a result, the sample points need not be real points of the cluster, but that fact is unimportant. The net effect is that the samples are "typical" points, well dispersed around the cluster, no matter what the cluster's shape is.

Assign all points, including those involved in steps (1) and (2) to the nearest cluster, where "nearest" means shortest distance to some sample point.

In our thesis we have emphasized only on centroid based approach. The K-means is the basis of our entire work. Although K-means is a very popular approach for clustering it has few demerits which make it not applicable to wide range of applications. Some of the major disadvantages of K-means are:

Difficulty in comparing quality of the clusters produced (e.g. for different initial partitions or values of K affect outcome).

Fixed number of clusters can make it difficult to predict what K should be.

Does not work well with non-globular clusters.

Different initial partitions can result in different final clusters.

There are many approaches in data mining literature to overcome some of above problems. In recent past many researchers have shown keen interest to use evolutionary computation techniques to address some of these difficulties. In our work we have contributed few such approaches using PSO and DE to overcome some disadvantages of K-means. Our main focus has been to produce clusters that do not get trapped in local optima and also do not depend on the initial random centroids. In next two sections of this chapter we highlight the PSO and DE based clustering for several datasets including image dataset.

## Clustering Using PSO

Research efforts have made it possible to view data clustering as an optimization problem. This view offers us a chance to apply PSO algorithm for evolving a set of candidate cluster centroids and thus determining a near optimal partitioning of the dataset at hand. An important advantage of the PSO is its ability to cope with local optima by maintaining, recombining and comparing several candidate solutions simultaneously. In contrast, local search heuristics, such as the simulated annealing algorithm only refine a single candidate solution and are notoriously weak in coping with local optima. Deterministic local search, which is used in algorithms like the K-means, always converges to the nearest local optimum from the starting position of the search.

PSO-based clustering algorithm was first introduced by Omran et al. in [40]. The results of Omran et al. [40] showed that PSO based method outperformed K-means, FCM and a few other state-of-the-art clustering algorithms. In their method, Omran et al. used a quantization error based fitness measure for judging the performance of a clustering algorithm. The quantization error is defined as:

(4.1)

where Ci is the i-th cluster center and ni is the number of data points belonging to the i-th cluster. Each particle in the PSO algorithm represents a possible set of K cluster centroids as:

## =,,â€¦.,

where refers to the p-th cluster centroid vector of the i-th particle. The quality of each particle is measured by the following fitness function:

(4.2)

In the above expression, Rmax is the maximum feature value in the dataset and Mi is the matrix representing the assignment of the patterns to the clusters of the i-th particle. Each element mi, k, p indicates whether the pattern belongs to cluster Ck of i-th particle. The user-defined constants w1, w2, and w3 are used to weigh the contributions from different sub-objectives. In addition,

(4.3)

and, (4.4)

is the minimum Euclidean distance between any pair of clusters. In the above, ni,k is the number of patterns that belong to cluster Ci,k of particle i. he fitness function is a multi-objective optimization problem, which minimizes the intra-cluster distance, maximizes inter-cluster separation, and reduces the quantization error. The PSO clustering algorithm is summarized below as presented in.

Step 1: Initialize each particle with K random cluster centers.

Step 2: repeat for iteration_count = 1 to maximum_iteration

repeat for each particle i

repeat for each pattern in the dataset

calculate Euclidean distance of with all cluster centroids

assign to the cluster that have nearest centroid to

calculate the fitness function

find the personal best and global best position of each particle.

Update the cluster centroids according to velocity updating and coordinate updating formula of PSO.

Van der Merwe and Engelbrecht hybridized this approach with the k-means algorithm for clustering general dataets. A single particle of the swarm is initialized with the result of the k-means algorithm. The rest of the swarm is initialized randomly. In 2003, Xiao et al used a new approach based on the synergism of the PSO and the Self Organizing Maps (SOM) for clustering gene expression data. They got promising results by applying the hybrid SOM-PSO algorithm over the gene expression data of Yeast and Rat Hepatocytes. Paterlini and Krink have compared the performance of K-means, GA, PSO and Differential Evolution (DE) or a representative point evaluation approach to partitional clustering. The results show that PSO and DE outperformed the K-means algorithm. In [104] authors proposed a PSO based hybrid algorithm for classifying the text documents. They applied the PSO, K-means and a hybrid PSO clustering algorithm on four different text document datasets. The results illustrate that the hybrid PSO algorithm can generate more compact clustering results over a short span of time than the K-means algorithm.

In our work we have simulated few datasets to reinforce the above facts. The comparisons are made with classical K-means algorithm. We have chosen basic gbest PSO and some lbest PSO models for simulation. In lbest PSO we have chosen lbest_ring and Von_neuuman lbest model. We have also proposed a hybrid PSO model with K-means. In this approach first the K-means is run and the resultant best centroid is taken as one of the particle in the swarm of particles for PSO to continue its simulation.

## Dataset Description

For this work we have investigated five datasets, out of which four are real world datasets collected from UCI machine repository (www.ics.uci.edu) and fifth one is the artificial dataset created using a function.

Iris plants database: This is a well-understood database with 4 inputs, 3 classes and 150 data vectors.

Wine: This is a classification problem with "well behaved" class structures. There are 13 inputs, 3 classes and 178 data vectors.

Hayes Roth which has 132 data vectors with 3 classes and 5 inputs.

Diabetes data set has 768 data vectors having 2 classes and 8 inputs.

Artificial: This problem follows the following classification rule;

if

Otherwise

A total of 400 data vectors are randomly created between (-1,1).

## Results Of Simulations

This section compares the results of the K-means and PSO algorithms on five clustering problems. The main purpose is to compare the quality of the respective clustering, where quality is measured according to the

Following two criteria:

the quantization error as defined in equation (4.1);

the inter-cluster distances, i.e. the distance between the centroids of the clusters, where the objective is to maximize the distance between clusters.

For all the results reported, averages over 20 simulations are given. The PSO algorithms used 20 particles. The Hybrid PSO takes the seed from result of K-means clustering. This seed is considered as one particle in swarm of particles in PSO. For PSO, is varying as per the paper [31]. The initial weight is fixed at 0.9 and the final weight at 0.4. The acceleration coefficients and are fixed at 1.042 to ensure good convergence.

Table 4.1 summarizes the results obtained from the five clustering algorithms for the problems cited above. The values reported are averages over 20 simulations, with standard deviations to indicate the range of values to which the algorithms converge. First, consider the fitness of solutions, i.e. the quantization error, for all data sets PSO based clustering is better than K-means. However, lbest_VonNuumann provides better fitness values in terms of quantization error and inter_cluster distance for all data sets except for Wine. For Wine and Hayes Roth, Hybrid PSO gives good result. The lbest_VonNeuumann gives worst quantization error but comparatively good inter_cluster distance measure for these data sets. The standard deviations (std) found to be very close, thereby indicating the convergence of algorithms to better results.

Data Sets

Algorithm

Quantization error, std

Inter - cluster distance, std

Iris

K means

0.834, 0.21075

9.3238, 1.6821

PSO_gbest

0.026209, 0.017964

16.41, 2.4693

lbest_ring

0.026615, 0.014664

## 21.079, 3.8171

lbest_vonNneumann

## 0.011477, 0.019458

20.278, 02204

Hybrid PSO

0.69743, 0.019081

18.598, .65266

Hayes Roth

K means

11.961, 1.573

8.9353, 1.2419

PSO_gbest

0.77086, 0.0408

324.25, 5.7895

lbest_ring

3.99,3.3429

313.1, 3.4562

lbest_vonNeumann

3.8265, 0.98856

## 350.73, 23.272

Hybrid PSO

## 0.55914, 1.6488

325.51, 66.738

Wine

K means

116.29, 0.83715

2019.2,0.234

PSO_gbest

10.765, 3.7278

3272.8, 292.89

lbest_ring

33.622, 7.6328

2859.7, 339.91

lbest_vonNeumann

11.709, 1.6749

3450.8, 222.42

Hybrid PSO

## 4.9763, 3.3043

## 3598.7, 483.11

Diabetes

K means

78.984, 7.6654

20.92, 3.332

PSO_gbest

69.222, 2.4839

30.12,2.719

lbest_ring

36.98, 2.397

35.108, 2.475

lbest_vonNeumann

## 30.205, 2.6501

## 38.1074,2.4714

Hybrid PSO

48.545,3.097

32.958,3.471

Artificial

K means

0.64152, 0.011867

1.3772, .02580

PSO_gbest

0.54338, 0.0057227

1.2678, .72643

lbest_ring

0.56021, 0.004647

1.482, 0.13267

lbest_vonNeumann

## 0.5317,0.00121

## 1.662,0.11

Hybrid PSO

0.55086,0.00056684

0.9086, .16526

Table: 4.1: Comparison of clustering Results

The results discussed above presents a strong point in favor of PSO based clustering. This is the reason many researchers prefer to use PSO based clustering. However, to our knowledge there is no such PSO based clustering tool available for researchers and academicians to use and investigate. In our work we have developed a very comprehensive tool box for clustering using MATLAB for use by beginners in the research areas of clustering. Next sub-section presents a brief of the tool box developed by us.

## PSO Clustering Tool Box

Both in the academia and in the industry, the scope of research in the area PSO is expanding rapidly. It is therefore worthwhile to consider giving good quality learning to the beginners. This work is done using MATLAB (Matrix Laboratory). MATLAB is a computational environment for modeling, analyzing and simulating a wide variety of intelligent systems. It also provides a very good access to the students by providing a numerous design and analysis tools in Fuzzy Systems, Neural Networks and Optimization tool boxes.

In this work, a software tutorial for PSO has been developed to aid the teaching of PSO concepts and its applications to data clustering. The software developed offers facilities to simulate classical K-means clustering algorithm, PSO clustering and overall performance can be improved by suitable hybridization of K-Means and PSO. Our software provides the scope of experimenting with various possibilities of hybridization and the learner can choose different tuning parameters for PSO along with suitable particle sizes and iterations to obtain better clustering performances. The software is GUI based and supported by various plots and graphs for reinforcing the derived results. The ease of operation of the software is the hall mark of it.

In this work we tried hybridization in two ways. The first one is K-Means + PSO technique, where in the K-Means clustering algorithm is executed once ; the result of this algorithm is used to seed the initial swarm, while the rest of the swarm is initialized randomly. PSO algorithm is then executed as discussed in this thesis before.

The second one is PSO + K-Means technique. In this, first PSO algorithm is executed once; the resultant gbest is used as one of the centroids for K-Means, while the rest of the centroids are initialized randomly. K-Means algorithm is then executed. Our software offers the facilities of exploring these possibilities with various options of choosing parameters and number of iterations to investigate the ideas.

As the name says, this software tutorial is for learning, how PSO can be applied in the area of clustering. The main idea is to involve the user for setting the swarm size and other parameters of PSO clustering algorithm. This particular tutorial has been built using MATLAB, version 7.0 of MathWorks. For this application, three data sets have been taken namely Iris, wine and breast cancer( collected from UCI machine repository). As the data sets considered for this application are pre classified, the number of clusters taken is same as that of their number of classes.

The interface for the K-Means clustering algorithm is given below:

Fig.1: Results of K-means clustering on three datasets

Here, for this application, the result of clustering is shown in terms of intra class and inter class similarities and also quantization error, which are tabulated in table 1. A confusion matrix is also given where; an accuracy test can be made between the expected clusters and actual clusters. The time taken by the algorithm to cluster the data is also given. The screen shot is given below in fig2 for iris, wine and breast cancer datasets respectively.

## Measures/ datasets

## Iris

## Wine

## Cancer

## Intra cluster distance

1.94212

293.617

671.53

## Inter cluster distance

10.167

1474.22

1331.33

## Quantization error

0.647374

97.0723

335.765

## Time ( in sec)

24.305

1.7562

6.75965

Table1: Results of K-Means Clustering

Fig2: Results of PSO based Clustering

Here, in fig 2, scope is given for the user, to specify all the PSO parameters like swarm size, inertia of weight, and acceleration coefficients. The results of clustering are shown in the same way as in K-Means clustering and it is tabulated in table 2. As a sample results are computed taking swarm size as 3 , inertia weight as 0.72 and c1 and c2 both 1. However, the user can play with this software giving any values to see how the PSO clustering algorithm performs.

## Swarm size = 3, Inertia of weight = 0.72, c1 = 1 and c2 = 1

## Measures/datasets

## Iris

## Wine

## Cancer

## Intra cluster distance

0.648096

145.849

222.833

## Inter cluster distance

3.37355

749.14

432.382

## Quantization error

0.216032

48.6163

111.416

## Time in sec

19.9997

12.9741

76.1937

Table2: Results of gbest PSO clustering

The interface for hybridized clustering is given below:

Fig.3: Results of K-Means+PSO

Fig.4: Results of PSO+K-Means

## Swarm size = 3, Inertia of weight = 0.72, c1 = 1 and c2 = 1

## Measures/datasets

## Iris

## Wine

## Cancer

## Intra cluster distance

0.986767

148.68

334.202

## Inter cluster distance

4.95916

811.311

640.836

## Quantization error

0.328922

49.5601

167.101

## Time in sec

12.6541

14.3183

43.847

Table3: Results of K-Means + PSO Clustering algorithm

## Swarm size = 3, Inertia of weight = 0.72, c1 = 1 and c2 = 1

## Measures/datasets

## Iris

## Wine

## Cancer

## Intra cluster distance

0.621062

142.808

220.765

## Inter cluster distance

5.08348

737.112

665.667

## Quantization error

0.223687

47.9361

111.882

## Time in sec

8.75372

10.8275

38.1585

Table4: Results of PSO + K-Means Clustering algorithm

This tool gives a very good scope for the students who wants to experiment PSO clustering algorithm, by taking different sets of parameters.

## Comparative Analysis With Experimental Results

This software tutorial gives a comparative study on all the four clustering algorithms, K-Means, PSO, K-Means + PSO and PSO + K-Means. According to the experimental results obtained, it is proven that the accuracy rate of PSO + K-Means (Hybrid algorithm) is high when compared to that of other three algorithms. The comparative results are tabulated in table5. The graph is given below in fig 5

Table5: Comparative results of the 4 clustering algorithms

## Results of K-Means Clustering

## Measures/ datasets

## Iris

## Wine

## Cancer

## Intra cluster distance

1.94212

293.617

671.53

## Inter cluster distance

10.167

1474.22

1331.33

## Quantization error

0.647374

97.0723

335.765

## Time in sec

24.305

1.7562

6.75965

## Results of gbest PSO clustering

## Swarm size = 3, Inertia of weight = 0.72, c1 = 1 and c2 = 1

## Measures/ datasets

## Iris

## Wine

## Cancer

## Intra cluster distance

0.648096

145.849

222.833

## Inter cluster distance

3.37355

749.14

432.382

## Quantization error

0.216032

48.6163

111.416

## Time in sec

19.9997

12.9741

76.1937

## Results of K-Means + PSO Clustering algorithm

## Swarm size = 3, Inertia of weight = 0.72, c1 = 1 and c2 = 1

## Measures/ datasets

## Iris

## Wine

## Cancer

## Intra cluster distance

0.986767

148.68

334.202

## Inter cluster distance

4.95916

811.311

640.836

## Quantization error

0.328922

49.5601

167.101

## Time in sec

12.6541

14.3183

43.847

## Results of PSO + K-Means Clustering algorithm

## Swarm size = 3, Inertia of weight = 0.72, c1 = 1 and c2 = 1

## Measures/ datasets

## Iris

## Wine

## Cancer

## Intra cluster distance

0.621062

142.808

220.765

## Inter cluster distance

5.08348

737.112

665.667

## Quantization error

0.223687

47.9361

111.882

## Time in sec

8.75372

10.8275

38.1585

1

Fig 5: Fitness Curves

Here, the intra and inter cluster distances, quantization error and the time are marked with blue, red, green and black colors respectively. The graph shows the variations of intra and inter cluster distances and quantization error fitness curves among the four clustering algorithms: K-Means, PSO, K-means + PSO and PSO + K-Means respectively. For the graph, the intra and inter cluster distances and quantization error are marked with blue, red and green colors respectively.

## Clustering Using DE

Data clustering is a process of grouping a set of data vectors into a number of clusters or bins such that elements or data vectors within the same cluster are similar to one another and are dissimilar to the elements in other clusters. Clustering algorithms can be grouped into two main classes of algorithms, namely Supervised and Unsupervised. With Supervised clustering, the learning algorithm has an external teacher that indicates the target class to which the data vector should belong. For unsupervised clustering, a teacher does not exist, and data vectors are grouped based on distance from one another. Data clustering can be hierarchical or partitional. In our work we have confined our discussion to partitonal clustering only.

Partitional clustering algorithms, attempt to decompose the dataset into a set of disjoint clusters by optimizing a criteria that minimize the intra cluster distance and maximize inter cluster distance The overall objective is to decompose the given dataset into predefined clusters having good compactness of data objects belonging to the cluster and separable from other clusters. Hence clustering can also be treated as an optimization problem and we can apply the evolutionary techniques to get the optimum solutions. In our work we have implemented standard DE and our proposed adaptive DE for clustering problems.

K-Means algorithm falls under partitional based clustering technique. It was introduced by MacQueen'67. K in K-Means signifies the number of clusters into which data is to be partitioned. This algorithm aims at assigning each pattern of a given dataset to the cluster having the nearest centroid. K-Means algorithm uses similarity measure to determine the closeness of two patterns. Similarity can be measured using Euclidean Distance or Manhattan Distance or Minkowski Distance. In this paper, Euclidean Distance is considered as the similarity measure.

The algorithm for K-Means is as follows:

Initialize randomly cluster centroid vectors

Repeat

Assign each data vector to the cluster with closest centroid vector. The distance from the data vector to the centroid vector is determined using the equation

Recalculate the cluster centroid vectors, using

, where is the number of data vectors in cluster j Until a stopping criterion is reached

The process of K-Means clustering can be stopped when one of the following criteria are satisfied:

When the maximum number of iterations has reached

When there is no change in the newly obtained centroid vectors

For the purpose of this study, the second criterion is used.

The demerit of K-means lies in the fact that its performance is largely dependent on the initial centroid values chosen. At times due to improper initial centroid values the K-means is trapped in local optima by providing not so good clustering results. There have been considerable amount of research work done to alleviate this problem by data mining researchers. Due to space limitation details of those studies are omitted here. However, it may be worth noting here that many of the suggested approaches to overcome the above problem make use of evolutionary algorithms (EA). EAs are particularly very suitable for this kind of problems due to the population based approach wherein instead of a single initialization of centroids, a set of candidate centroids can be initialized at the beginning of simulation. As a result of which, the candidate solutions are evolved to be getting fitter and fitter due to the EA's process, thereby offering optimal solution at the end. In our work we have used DE as the EA tool to attack this problem of K-means. We have done function optimization by DE and also compared with PSO to investigate the superiority of DE. Finally we have implemented DE for clustering and also compared with PSO clustering. In following sections all comparisons are presented. But before that for sake of continuity we present briefs of DE and its mutation in the next section.

## Basics Of DE And Its Mutations Variants

Differential Evolution (DE) is a parallel direct search method developed by Storn and Price in 1997 which is a population-based global optimization algorithm .It uses a real-coded representation [58]. This approach for numerical optimization is simple to implement and requires little or no parameter tuning, but gives a remarkable performance. Like all other evolutionary algorithms, the initial population is chosen randomly.

## Classical DE

Like all other evolutionary algorithms, DE method also consists of three basic steps:

Generation of population with N individuals in the d-dimensional space, randomly distributed over the entire search domain

, where t=0,1,2,â€¦.t,t+1

Replacement of this current population by a better fit new population, and

Repetition of this replacement until satisfactory results are obtained or certain criteria of termination is met.

## Initialization

## Mutation

## Recombination

## Selection

The basic scheme of evolutionary algorithms is given below:

## Mutation

After the random generation of population, in each generation, a Donor vector is created for each .This donor vector can be created in different ways (see section 3.2).

## Recombination

Now a trial offspring vector is created by combining components from the Donor vector and the target vector . This can be done in the following way

if randi,j(0,1)<=cr

otherwise

## Selection

Selection in DE adopts Darwinian principle "Survival Of the Fittest". Here if the trail vector yields a better fitness value, it replaces its target in the next generation; otherwise the target vector is retained in the population. Hence the population either gets better (w.r.t. the fitness function) or remains constant but never deteriorates.

if

if

## DE Mutation Schemes

The five different mutation schemes suggested by Price [58] is as follows:

Scheme 1-DE/rand/1

In this scheme, to create a donor vector for each ith member, three other parameter vectors (say the o1, o2, and o3th vectors) are chosen randomly from the current population. A scalar number F is taken. This number scales the difference of any two of the three vectors and the resultant is added to the third one. For the ith donor vector, this process can be given as

Scheme 2-DE/rand to best/1

This scheme follows the same procedure as that of the Scheme1. But the difference is, now the donor vector is generated by randomly selecting any two members of the population (say the , and vectors) and the best vector of the current generation (say ). For the ith donor vector, at time t=t+1, this can be expressed as

where is a control parameter in DE and ranges between [0, 2].

To reduce the number of parameters, we consider =F.

Scheme 3-DE/best/1

This scheme is identical to Scheme 1 except that the result of the scaled difference is added to the best vector of the current population. This can be expressed as

Scheme 4-DE/best/2

In this scheme, the donor vector is formed by using two difference vectors as shown below

Scheme 5-DE/rand/2

Here totally five different vectors are selected randomly from the population, in order to generate the donor vector. This is shown below

Here F1 and F2 are two weighing factors selected in the range from 0 to 1. To reduce the number of parameters we may choose F1 = F2 = F.

The experiment we conducted in this study uses Scheme 1-DE/rand/1(equation 4).

## Procedure For DE

Randomly initialize the position of the particles

Evaluate the fitness for each particle

For each particle, create Difference-Offspring

Evaluate the fitness of the Difference-Offspring

If an offspring is better than its parent then replace the parent by offspring in the next generation;

Loop to step 2 until the criterion is met, usually a sufficiently good fitness or a maximum number of iterations.

## Function Optimization With DE And Comparison With PSO

The aim of an optimization problem is to determine a best-suited solution under a given set of constraints. Mathematically an optimization problem involves a fitness function, which needs to be maximized or minimized. Most of the traditional optimization techniques employ derivative approach to locate the optima. This approach fails to determine the global optima for multimodal optimization problems. More recently, the optimization problem is being represented as an intelligent search problem, where one or more agents are employed to determine the optima on a search space. This technique of searching is called as evolutionary based searching

All evolutionary based algorithms follow a single approach for solving an optimization problem. The approach is given below:

Initialize the population with particles or vectors

Calculate the fitness function for each particle or vector

Based on the best fitness value, update the position of the particles or vectors

The difference among different evolutionary based algorithms lies in the way they update the particles by changing the parameters.

In our experiment, we have taken two such evolutionary based search techniques PSO (Kennedy and Eberhart 1995) and DE(Storn and Price 1997). These two techniques are first tested on standard four benchmark functions [Table 1] and the comparative analysis is given with plots.[See Table 2 and Fig1]. We have coded all the programs in MATLAB 7.5 software and run with 1.86 Core2 DUO processor, 1GB RAM.

Table 1 Descriptions of Benchmark functions used in the experiment

Table 2

Optimized results of PSO and DE on the afore mentioned four Benchmark functions

Number of particles/vectors taken=10

Number of variables taken = 2

Number of Iterations = 200

Number of runs = 2

Fig1. Diagram of convergence for the two algorithms applied to all four benchmark functions.

rosenbrock1sphere1

a. Sphere b. Rosenbrock

c. Rastrigin d.Griewankgriewank2rastrigin2

## Basic Framework For DE Clustering

Data vectors can be clustered using classical DE as follows:

Initialize each vector to contain K number of randomly selected cluster centers

For I=1 to Imax do

For each vector i do

For each object in the data set Zp

Calculate the Euclidean distance to all cluster centroids Cij using equation 3

Assign Zp to the cluster Cij such that =

Change the population members according to the DE algorithm outlined in(section 3.3). Use the vectors fitness to guide the evolution of the population.

Report cluster centers and the partition obtained by globally best vector at time I=Imax

## Dataset Descriptions And Results

Three well-known real-world datasets from the machine learning repository have been considered for this experiment. They are:

Fisher's iris dataset (n=150,p=4,c=3), which consists of n objects characterized by p features (sepal length, sepal width, petal length, petal width). There are c categories in this data: Iris setosa (50), Iris versicolor (50), Iris virginica (50)

Wisconsin breast cancer dataset (n=683,p=9,c=2), which consists of n objects characterized by p features (clump thickness, cell size uniformity, cell shape uniformity, marginal adhesion, single epithelial cell size, bare nuclei, bland chromatin, normal nucleoi and mitoses). There are c categories in the data: malignant (444 objects) and benign (239 objects)

Wine recognition dataset (n=178,p=13,c=3), which consists of n objects, characterized by p features(Alcohol, Malicacid, Ash, Alcalinity of ash, Magnesium, Total Flavanoids, Nonflavanoid phenols, Proanthocyanins, Color intensity,Hue,OD280/OD315 of diluted wines, Proline) with c categories in the data: class 1(59 objects), class 2 (71 objects), class 3 (48 objects)

For PSO, we have set the inertia of weight, w=0.7 and c1=c2=2 for our experiment

For DE, we have set the cross over rate cr =0.9, the weighting factor is chosen as F=0.8[6].

These algorithms are applied for clustering three real world data sets described above. All these data sets are available at www.ics.uci.edu/~mlearn/MLRepository.html. Fitness comparison of the two algorithms, PSO and DE has been reported in Table3. Fig 2 and Fig 3 shows the three-dimensional plots of observations for PSO and DE algorithms respectively on Iris dataset.

Table 3 Mean Intracluster Distances of PSO and DE on the given three real-world data sets

Number of iterations=50

Number of vectors/particles =5

Fig2. PSO generated Three-Dimensional Clusters of IRIS dataset

psoiris-sl-sw-pl-pwpsoiris-sl-sw-pw

psoiris-sl-pl-pwpsoiris-sw-pl-pw

Fig 3:DE generated Three-Dimensional Clusters of IRIS dataset

deiris-sl-sw-pwdeiris-sl-sw-pl

deiris-sw-pl-pwdeiris-sl-pl-pw

## Conclusion

From the discussion of PSO and DE based clustering it is clearly evident that they are very good candidates for clustering. Results shown above clearly suggest that they are able to overcome the local optima problem of K-means. In this chapter we have discussed various models of PSO for clustering ands also compared with DE based clustering. In the next chapter we will discuss the Adaptive DE (ADE) suggested by us for clustering and will explore the hybridization of ADE and PSO for clustering purposes.