# Clustering Of High Dimensional Dataset Based On Pso Computer Science Essay

**Published:** **Last Edited:**

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

The Particle Swarm Optimization clustering algorithm can generate more compact clustering results than the traditional K-means clustering algorithm. However, when clustering on high dimensional datasets, the PSO clustering algorithm is notoriously slow because its computation cost increases exponentially with the size of the dataset dimension. Dimensionality reduction techniques offer solutions that both significantly improve the computation time, and yield reasonably accurate clustering results in high dimensional data analysis. In this paper, we introduce research for developing an algorithm that combines dimensionality reduction techniques of weighted PCs with the PSO for clustering in order to reduce the complexity of high dimensional datasets and speed up the PSO clustering process. We report significant improvements in total runtime. Moreover, the clustering accuracy of the dimensionality reduction PSO clustering algorithm is comparable to the one that uses full dimension space.

The clustering of high dimensional data sets is a process that is needed in many application areas. Because traditional data clustering algorithms tend to be biased towards local optimum when applied to high dimensional data sets, Particle Swarm Optimization (PSO) has been used for solving data clustering problems in recent years [1-7]. Many researchers [1, 5, 6] have indicated that when utilizing the PSO algorithm's optimal ability, and given sufficient time, PSO generates a more compact clustering result from low dimensional data than the traditional K-means clustering algorithm as K-MEAN clustering needs prior knowledge of the data to be classified. PSO has emerged as one of the fast, robust, and efficient global search heuristics of current interest. The PSO with the automatic clustering of large unlabeled data sets requires no prior knowledge of the data to be classified. Rather, it determines the optimal number of partitions of the data "on the run." [7].However, when clustering high dimensional datasets, the PSO clustering algorithm is notoriously slow because the algorithm needs to repeatedly compute high dimension data vector similarities. Researchers have found that dimensionality reduction techniques offer solutions that both improve the computation time, and still yield accurate results in some high dimensional data analysis [8-12]. It is therefore highly desirable to reduce the dimensionality of a high dimension dataset before clustering in order to maintain tractability. Since reducing dataset dimensionality may result in a loss of information, the lower dimension representation generated by a dimensionality reduction algorithm must be a good approximation of the full high dimensional dataset.

One approach is to determine the relative importance of the features for dimension reduction and then select a subset of important features. This can be achieved in various ways, for example, by feature selection or feature extraction. There are a number of feature selection methods that determine their relative importance of the features before selecting a subset of important ones. A typical feature selection method tries to choose a subset of features from the original set that ideally is necessary and sufficient to describe the target concept. Target concepts are denoted by class labels. Data with class labels are called supervised data; data without class labels unsupervised. Although original features are selected based on some evaluation function, these methods require class labels. Hence, features selection methods largely fail for unsupervised data. Feature extraction methods create new features which are uncorrelated and retain as much variation as possible in the database. One of the most widely used methods for dimensional reduction includes principal component analysis (PCA)[13]. Despite its popularity, PCA suffers from a lack of interpretability of the original feature because the reduced dimensions are linear combinations of a large number of original features. Traditionally, two or three dimensional loading plots provide information to identify important original features in the first few principal component dimensions. However, the interpretation of what constitutes a loading plot is frequently subjective, particularly when large numbers of features are involved. In this regards the weighted principal components (PCs) with a thresholding algorithm is good one, where the weighted PC is obtained by the weighted sum of the first k PCs of interest which are found from PCA algorithm [14]. Each of the k loading values in the weighted PC reflects the contribution of each individual feature and thresholding algorithm that identifies the significant features. We use some real life datasets both small as well as high dimensional datasets as our experimental data. In the following, an introduction to Particle Swarm optimization(PSO), the automatic clustering PSO algorithm, the Dimensionality reduction technique Weighted PC's and Moving Range-Based thresholding Algorithm is discussed.

## Dimensionality reduction technique Weighted PC's

PCA is one of the most widely used multivariate data analysis techniques and is employed primarily for dimensional reduction and visualization (Jolliffe, 2002). PCA extracts a lower dimensional feature set that can explain most of the variability within the original data. The extracted features, PCi's (Yi) are each a linear combination of the original features with the loading values (, i, j=1,2,â€¦,p). The Yi's can be represented as follows:

â€¦â€¦â€¦â€¦ (1)

## â€¦â€¦â€¦

## â€¦â€¦â€¦.

The loading values represent the importance of each feature in the formation of a PC. For example, indicates the degree of importance of the jth feature in the ith PC. A two-dimensional loading plot (e.g., PC1 vs PC2 loading plot) may provide a graphical display for identification of important features in the first and second PC domains. However, the interpretation of a two-dimensional loading plot is frequently subjective, particularly in the presence of a large number of features. Moreover, in some situations, consideration of only the first few PCs may be insufficient to account for most of the variability in the data. Determination of the appropriate number of PCs (=k) to retain can be subjective. One can use a scree plot that visualizes the proportion of variability of each PC to determine the appropriate number of PCs .If a PCA loading value for the jth original feature can be computed from the first k PCs, the importance of the jth feature can be represented as follows:

,j=1,2â€¦â€¦â€¦â€¦â€¦,p â€¦â€¦â€¦â€¦â€¦â€¦â€¦â€¦â€¦â€¦â€¦â€¦..(2)

where k is the total number of features of interest and ï€ represents the weight of ith PC. The typical way to determine ï€ is to compute the proportion of total variance explained by the ith PC. ï€ can be called a weighted PC loading for the feature j. A feature with a large value of indicates a significant feature. In the next section, we will present a systematic way to obtain a threshold that determines the significance of each feature.

## Moving Range-Based Thresholding Algorithm

A moving range-based thresholding algorithm as a way to identify the significant features from the weighted PC loadings discussed in the above. The main idea of a moving range-based thresholding algorithm comes from a moving average control chart that has been widely used in quality control (Vermaat et al. 2003). A control chart provides a comprehensive graphical display for monitoring the performance of a process over time so as to keep the process within control limits (Woodall and Montgomery, 2001). A typical control chart comprises monitoring statistics and the control limit. When the monitoring statistics exceed (or fall below) the control limit, an alarm is generated so that proper remedial action can be taken. A moving range control chart is useful when the sample size used for process monitoring is one. Moreover, the average moving range control charts perform reasonably well when the observations deviate moderately from the normal distribution (Vermaat et al. 2003).

In this problem, it can consider the weighted PC loading values as the monitoring statistics. Thus, it plot these loading values on the moving range control chart and identify the significant features when the corresponding weighted PC loading exceeds the control limit (threshold). Given a set of the weighted PC loading values for individual features), ( the threshold ï€ can be calculated as follows (Vermaat et al. 2003):

ï€ ï€ ï€ ï€ ï€ ï€ ï€ ï€ ï€ ï€ ï€ ï€ ï€ ï€ ï€ ï€ ï€ ï€ (3)

Where, = is the inverse standard normal cumulative distribution function, and Î± is the Type I error rate that can be specified by the user. The range of Î± is between 0 and 1. In typical moving range control charts, can be estimated by , calculated by the average of the moving ranges of two successive observations.

â€¦â€¦â€¦â€¦â€¦â€¦â€¦â€¦..(4)

However, in our feature selection problems, because the weighted PC loading values for individual features are not ordered, we cannot simply use (4). To address this issue, we propose a different way of computing that can properly handle a set of unordered observations. Given the fact that there is no specific order of observations, they are randomly reshuffled, and are recalculated.

Therefore, we obtain a set of ,â€¦â€¦..,, where N is the number of features. The for unordered observations is calculated by

â€¦â€¦â€¦â€¦â€¦â€¦â€¦â€¦â€¦ (5)

Finally, the threshold of the feature selection method can be obtained by the following equation:

â€¦â€¦â€¦â€¦â€¦â€¦â€¦.â€¦..(6)

A feature is reported as significant if the corresponding weighted PC loading exceeds the threshold.

## Particle swarm optimization

PSO can be considered as a swarm-based learning scheme. In PSO learning process, each single solution is a bird referred to as a particle. The individual particles fly gradually towards the positions of their own and their neighbors' best previous experiences in a huge searching space. It shows that the PSO gives more opportunity to fly into desired areas to get better solutions. Therefore, PSO can discover reasonable solutions much faster. PSO define a proper fitness function that evaluates the quality of every particle's position. The position, called the global best (gbest), is the one which has the highest value among the entire swarm. The location, called it as personal best (pbest), is the one which has each particle's best experience. Based on every particle's momentum and the influence of both personal best (pbest) and global best (gbest) solutions, every particle adjusts its velocity vector at each iteration. The PSO learning formula is described as follows,

â€¦â€¦â€¦â€¦â€¦â€¦â€¦â€¦â€¦â€¦â€¦â€¦â€¦â€¦..(7)

â€¦â€¦â€¦â€¦â€¦â€¦â€¦â€¦â€¦â€¦â€¦â€¦â€¦â€¦â€¦â€¦â€¦(8)

Where m is the dimensional number, i denote the ith particle in the population, V is the velocity vector, X is the position vector and is the inertia factor, and are the cognitive and social lerning rates respectively. These two rates control the relative influence of the memory of particle and neighborhood.

## Auto-PSO clustering algorithm

Suppose that we are given a data set of N data points {}, where is the ith data point in n-dimensional space. The detailed process of the AUTO-POS clustering algorithm is described below.

The initial population P=[] is the made up of pop_size possible particles(i.e. solutions), and each string is a sequence of real numbers representing K cluster centers of the candidates cluster centers. In an n-dimensional space, the length of a particle is determined as (T X n)+T. the formula of particle is:

i=1,2,â€¦â€¦..,n, p {1,2,â€¦â€¦â€¦â€¦,pop_size} â€¦â€¦â€¦.(9)

Where T is the user-defined positive number, which denotes the maximum cluster number in the cluster set generated by the individual pth particle, and pop_size denotes the population size. it is noted that the selected cluster number will be located between 2 and T. [0,1] indicate the selected threshold value for the associated jth candidate cluster centers, i.e., . In this case, the proposed cluster centers selected rules with its related threshold valure are used to determine the active cluster center in initial populations, which is defined by

IF â‰¥ 0.5 THEN the jth candidate cluster center is ACTIVE

ELSE IF 0.5 THEN the jth candidate cluster center is INACTIVE.

â€¦â€¦â€¦â€¦â€¦..(10)

The fitness of a particle is computed with the CS measure. The CS measure is defined as

CS(K) =

= â€¦â€¦â€¦â€¦â€¦ (11)

, i=1,2,â€¦â€¦,K â€¦â€¦â€¦â€¦â€¦â€¦â€¦â€¦â€¦ (12)

Where is the cluster center of , and is the set whose elements are the data points assigned to the ith cluster, and is the number of elements in , d denotes a distance function. This measure is a function of the ratio of the sum of within-cluster scatter to between-cluster separation.

The objective of the PSO is to minimize the CS measure for achieving proper clustering results. The fitness function for each individual particle is computed by

F = â€¦â€¦â€¦â€¦â€¦â€¦â€¦ (13)

Where is the CS measure computed for the ith particle, and eps is a very small-valued constant.

Generally, the number of cluster centers is known prior and then applied with a traditional clustering algorithm to verify the correctness of the cluster validity measure. Here some candidate solutions are computed and generated by the population based evolutional PSO learning process to meet unconsidered situations when executing the formula of CS measure. In order to avoid such unconsidered condition happening, we first calculate the counting number is smaller than 2. If so , the new cluster center's position of this special particle should be updated by using the concept of average computation.

## PROBLEM STATEMENT:

In real-world applications the size of a database is usually large. The largeness can be due to the excessive number of features (variables), the huge number of instances (records),or both. In this paper we focus on clustering on that high dimensional datasets by reducing the size by decreasing the number of features. One approach is Particle Swarm Optimization (PSO) has been used for solving data clustering problems. When utilizing the PSO algorithm's optimal ability, and given sufficient time, PSO generates a more compact clustering result from low dimensional data than the traditional K-means clustering algorithm. However, when clustering high dimensional datasets, the PSO clustering algorithm is notoriously slow because the algorithm needs to repeatedly compute high dimension data vector similarities. Researchers have found that dimensionality reduction techniques offer solutions that both improve the computation time, and still yield accurate results in some high dimensional data analysis. It is therefore highly desirable to reduce the dimensionality of a high dimension dataset before clustering in order to maintain tractability. Since reducing dataset dimensionality may result in a loss of information, the lower dimension representation generated by a dimensionality reduction algorithm must be a good approximation of the full high dimensional dataset. In this paper, we introduce our research for developing an algorithm that combines dimensionality reduction techniques as data preprocessing for the PSO clustering algorithm in order to reduce the complexity of the high dimensional dataset and speed up the clustering computation time.

Describe the problem, for which you are supposed to investigate the solution

## APPROACH:

## Proposed ACDRDPSO algorithm (Automatic clustering on dimensional reduced data using PSO)

We first need to introduce the basic principles:

The proposed algorithm works in two stages:

First stage: As we are using dimensionality reduction techniques as data preprocessing, the first stage will be the dimensional reduction technique. Here weighted PC's based on moving ranged base thresholding algorithm is the dimensional reduction technique.

Second stage: In this stage Auto_PSO clustering algorithm is used for clustering reduced datasets.

Then the Pseudocode for the complete ACDRDPSO algorithm is given here

## Pseudocode for first stage

PCA extracts a lower dimensional feature set that can explain most of the variability within the original data. The extracted features, PCi's (Yi) are each a linear combination

of the original features with the loading values (, i, j=1,2,â€¦,p). The Yi's can be

represented as equation (1). The loading value represent the importance of each feature

in the formation of PC

Determination of actual number of PCs(=k) to retain , use scree plot.

PCA loading value for the jth original feature can be computed from the first k PCs, the importance of the jth feature can be represented as follows:

,j=1,2â€¦â€¦â€¦â€¦â€¦â€¦â€¦.,p

where k is the total number of features of interest and ï€ represents the weight of ith PC . The ï€ is to compute the proportion of total variance explained by the ith PC. ï€ can be called a weighted PC loading for the feature j.

Given a set of the weighted PC loading values for individual features, ( the threshold ï€ can be calculated using equation (6), where can calculated using equation (5).

Identify the significant features when the corresponding weighted PC loading exceeds the control limit(threshold). Using those significant features makes a reduced data set.

At the end of this stage we will get reduced data set, which will be input for second stage of our proposed algorithm.

## Pseudocode for second stage

Set the number of particles (pop_size), the maximum mumber of clusters T, and the constants for the PSO algorithm () and then randomly initialize position vector =[]=[], where =[] and =[]. are randomly generated by

, j=1,2 â€¦â€¦â€¦.,T, i=1,2,â€¦â€¦,n

The range of the is defined as [], where is the

maximum and minimum values of the ith dimension for a given data set , respectively.

{0,1}, j=1,2, 3,â€¦.,T is generated with random function. The form of each velocity

vector is expressed as , where

and are randomly generated by

, j=1,2 â€¦â€¦â€¦.,T, i=1,2,â€¦â€¦,n â€¦â€¦â€¦â€¦.(9)

, j=1,2,â€¦â€¦â€¦,T â€¦â€¦â€¦â€¦â€¦â€¦â€¦â€¦â€¦â€¦â€¦.(10)

Where is a positive number and experimental value is

Select the active solution (i.e. active cluster centers) of every particle from initial population with equation (10).

Determine unreasonable solutions and refine them by the simple average computation.

Evaluate the fitness function for every individual particle.

Compare every particle's fitness value with the previous particle's best solution {, p, and then close the new particle's best solution which has the highest fitness value. The formula is as follows:

Compare the fitness value of every particle's best solution with the previous overall best solution gbest(t),then replace the gbest(t) with the current particle's solution by

gbest(t+1)=

Regulate velocities and positions for every particle by equation (7) and (8).

Repeat step 2 to 7 until the predefined number of iteration is completed.

Determine the optimal cluster number, real cluster centers and final classification result.

## EXPECTED OUTCOME:

In our study,

On using dimensional reduction technique we will find reduce dimensional dataset which behaves same with original dataset with less error.

On less computation time we can get good cluster by applying proper clustering algorithm on that reduce dataset.

## CONCLUSION AND FUTURE SCOPE:

We have presented a new method of clustering in high-dimensional datasets. The proposed method combines PCA techniques and a moving range-based thresholding algorithm with automatic clustering of PSO. We first obtained the weighted PC, which can be calculated by the weighted sum of the first k PCs of interest. Each of the k loading values in the weighted PC reflects the contribution of each individual feature. To identify the significant features, we proposed a moving-range thresholding algorithm. Features are considered to be significant if the corresponding weighted PC loadings exceed the threshold obtained by a moving-range thresholding algorithm. Using this technique we get reduced dataset of original high dimensional data set. On that reduced data we applied automatic clustering PSO to get cluster. Our experimental results with real datasets will demonstrate that the proposed method could successfully find the true clusters of high dimensional data set on less computation.

Our study extends the application scope of ACDRDPSO algorithm. We hope that the procedure discussed here stimulates further investigation into development of better procedures for clustering of high dimensional datasets.