# Clustering Techniques In Data Mining Computer Science Essay

**Published:** **Last Edited:**

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

Clustering is the process of grouping the data into classes or clusters, so that objects within a cluster have high similarity in comparison to one another but very dissimilar to objects in other clusters.

The thesis includes the details about the Clustering techniques in the Data Mining. So, first of all we will take the brief idea about the Data Mining and then we will move on to one of the most widely used functionality of the data mining as clustering and the techniques used for the cluster analysis. Further we will propose a new algorithm by comparing different pre-existing algorithms for handling huge amount of data in a reasonable processing time.

## Data Mining: -

Data mining is the task of discovering interesting patterns from large amounts of data, where the data can be stored in databases, data warehouses, or other information repositories. It is young interdisciplinary field, drawing from areas such as database system, data warehousing, statistics, machine learning, data visualization, information retrieval, and high-performance computing. Other contributing areas include neutral networks, pattern recognition, spatial data analysis, image databases, signal processing, and many application fields, such as business, economics, and bioinformatics.

Now let us first of all see the various frequently asked questions about the Data Mining: -

## Q 1. What motivated Data mining? Why is it important?

Data mining has attracted a great deal of attention in the information industry due to the wide availability of huge amount of data and the imminent need for turning such data into useful information and knowledge.

Data collected in large data repository become "data tombs"-data archives that are seldom visited. Consequently, important decisions are often made based not on the information-rich data stored in the data repositories, but rather on the decision maker's intuition, simply because the decision maker does not have the tool to extract the valuable knowledge embedded in the vast amounts of data.

Data can now be stored in many different kind of databases and information repositories. One data repository architecture that has emerged is the data warehouse, a repository of multiple heterogeneous data sources organized under a unified schema at a single site in order to facilitate management decision making.

Data warehouse technology include data cleaning, data integration, and on-line analytical processing (OLAP), that is, analysis techniques with functionalities such as summarization, consolidation, and aggregation as well as the ability to view information from different angles. Although OLAP tool support multidimensional analysis and decision making, additional data analysis tools are required for in-depth analysis, such as data classification, clustering, and the characterization of data changes over time.

Data Collection and Database Creation

Database Management Systems

Web-based database

Advanced Database Systems

Advanced Data Analysis: Data warehousing and Data mining

New generation of integrated data and Information Systems

## Q 2. What is Data Mining?

Data mining refers to extracting or "mining" knowledge from amounts of data. Remember that the mining of gold from rocks or sand is referred to as gold mining rather than rock or sand mining. Data mining should have been more appropriately named "knowledge mining from data" which is unfortunately somewhat long. "Knowledge mining," a shorter term, may not reflect the emphasis on mining on large amounts of data.

Data mining involves an integration of techniques from multiple disciplines such as database and data warehouse technology, statistics, machine learning, high-performance computing, pattern recognition, neutral networks, data visualization, information retrieval, image and signal processing, and spatial or temporal data analysis.

By performing data mining, interesting knowledge, regularities, or high-level information can be extracted from databases and viewed or browsed from different angles. The discovered knowledge can be applied to decision making, process control, information management, and query processing. Therefore, data mining is considered one of the most important frontiers in database and information systems and one of the most promising interdisciplinary development in the information technology.

## Q 3. Which kind of data uses Data mining?

In principle, data mining should be applicable to any kind of data repository, as well as to transient data, such as data streams. Thus the scope of our examination of data repositories will include: Relational databases, Data warehouses, Transactional databases, Advanced database systems, Flat files, Data streams, World Wide Web. Advanced database systems include: Object-relational databases, Specific application-oriented databases, Spatial databases, Time-series databases, Text databases, Multimedia databases.

The challenges and techniques of mining may differ for each of the repository systems.

## Q 4. What Kind of Patterns Can Be Mined?

Data mining functionalities are used to specify the kind of patterns to be found in the data mining tasks. In general, data mining tasks can be classified into two categories: descriptive and predictive. Descriptive mining tasks characterize the general properties of the data in the data base. Predictive mining tasks performs inference on the current data in order to make predictions.

In some cases, user may have no idea regarding what kind of patterns in their data may be interesting, and hence may like to search for several different kind of Patterns in parallel. Furthermore, data mining systems should be able to discover patterns at various granularity (i.e. ,different levels of abstraction ).

Now let us see what is clustering all about? Where it is used? What are its uses? Etc. and then we will provide the brief idea about the Data Types, Swarm Intelligence and the Automated Clustering.

## Clustering: -

Clustering is the process of grouping the data into classes or clusters, so that objects within a cluster have high similarity in comparison to one another but very dissimilar to objects in other clusters. It has its roots in many areas, including data mining, statics, biology, and machine learning. Cluster analysis can be performed on all Electronic customer data in order to identify homogeneous subpopulations of customers.

First we partition the set of data into groups based on data similarity, and then assign labels to the relatively small number of groups. Additional advantages of such a clustering based process are that it is adaptable to changes and helps single out useful features that distinguish different groups.

Cluster analysis in an important human activity. By automated clustering, we can identify dense and sparse regions in object space and, therefore, discover overall distribution patterns and interesting correlation among data attributes.

Clustering analysis have been widely used in numerous applications, including market research, business purpose, biology, pattern recognition, data analysis, and image processing. Clustering may also help in the identification of area of similar land use in an earth observation. It can also be used to help classify documents on the web for information discovery. It can also be used as a stand-alone data mining tool to gain insight into the data distribution or can serve as a preprocessing step for other data mining algorithms operating on the detected clusters.

## Data types: -

With many data mining system product available on the market , we may ask, "What kind of system should I choose ?" Some people may be under the impression that data mining system, like many commercial relational database system, share the same well defined operations and a standard query language, behave similarly on common functionalities. If such were the case, the choice would depend more on the systems hardware platform, compatibility, robustness, scalability, price and service. Unfortunately, this is far from the reality.

To choose a data mining system that is appropriate for your task, it is important to have multidimensional view of data mining system.

Data types: Most data mining system that are available on the market handle for matted, record-based, rational-like data with numerical, categorical, and symbolic attributes. The could be in the form of ASCII text, relational database data, or data warehouse data. It is important to check what exact format each system you are considering can handle. Some kind of data or application may require specialized algorithm to search for the patterns, and so their requirement may not be handled by of-the-self, generic data mining system. Instead, specialized data mining system may be used, where mine either text document, geospatial data , multimedia data, stream data, time-series data, biological data ,or web data, or are dedicated to specific application.

## An introduction to Swarm Intelligence: -

Swam Intelligence (SI)- This expression was introduced by 'Gerardo Beni' and 'Jing Wang' in 1989, in the context of cellular robotic systems. Swarm intelligence describes the collective behavior of decentralized, self-organized systems, natural or artificial. The concept is employed in work on artificial intelligence.

SI systems are typically made up of a population of simple agents or boids interacting locally with one another and with their environment. The agents follow very simple rules, and although there is no centralized control structure dictating how individual agents should behave, local, and to a certain degree random, interactions between such agents lead to the emergence of "intelligent" global behavior, unknown to the individual agents.

Natural examples of SI include ant colonies, bird flocking, animal herding, bacterial growth, and fish schooling. These topics have been explained little bit in the further discussion.

The application of swarm principles to robots is called swarm robotics, while 'swarm intelligence' refers to the more general set of algorithms. 'Swarm prediction' has been used in the context of forecasting problems also.

Swarming theory has been used in system design. Examples of aerospace systems utilizing swarming theory include formation flying of aircraft and spacecraft.

Flocks of birds fly in V-shaped formations to reduce drag and save energy on long migrations.

## Swarming in System Design: -

â€¢ A space system in formation flying combines data from several

spacecraft rather than flying all the instruments on one costly satellite.

â€¢ It also allows for the collection of data unavailable from a single

satellite, such as stereo views or simultaneously collecting data of the

same ground scene at different angles.

It can be viewed as the following figure-

Various algorithms in which we can see the application of swarm intelligence are listed and discussed below

Ant Colony Optimization.

Particle Swarm Optimization.

Stochastic Diffusion Search.

Gravitational Search Algorithm

These algorithms will be explained in the Review of the Literature.

## An introduction of Automated Software Clustering: -

Clustering techniques have shown promising results for the architecture recovery and re-modularization of legacy software systems. Clusters that are obtained as a result of the clustering process may not be easy to interpret until they are assigned appropriate labels.

Automatic labeling of clusters reduces the time required to understand them and can also be used to evaluate the effectiveness of the clustering process, if the assigned labels are meaningful and convey the purpose of each cluster effectively.

As the clustering process proceeds, keywords within identifiers are ranked using two ranking schemes: frequency and inverse frequency. A comparison between the ranking schemes reveals the inverse frequency scheme to form more meaningful labels, especially for small clusters.

A comparison of clustering results of the complete and weighted combined algorithms based on labels of the clusters produced by them during clustering shows that the latter produces a more understandable cluster hierarchy with easily identifiable software sub-systems.

Automated Cluster Management is a very powerful feature for cluster administrators. It allows you to set a threshold for any metric and define any action to be taken when that threshold is exceeded. Any of the built-in or custom metrics supported by Cluster Manager can be used and any cluster management shell or Linux command or script can be used as an action.

Visibility of the present invention provide a method, system and computer program product for automated cluster member management based on node capabilities. In one embodiment of the invention, a method for automated cluster member management based on node capabilities can be provided.

The method can include defining a membership policy for a cluster, the membership policy specifying a nodal configuration required for a node in a cluster. The method further can include evaluating different nodes in a computing environment against the membership policy for the cluster. Finally, the method can include associating cluster members in the cluster to only those of the nodes having respective configurations meeting the nodal configuration of the membership policy.

## Likewise, the method can include evaluating nodes already in the cluster, and disassociating cluster members in the cluster from those of the nodes having respective configurations failing to meet the nodal configuration of the membership policy.

## Fact: - What is the difference between partially automated and fully automated clusters?

## Fully automated clusters migrate virtual machines to optimize resource usage.

## State Flapping: -

The Automated Cluster Management system is sophisticated and highly configurable. One example is its ability to deal with so-called "state flapping", which is a situation where a threshold is exceeded repeatedly within a short time frame. This can, for example, happen when a CPU temperature fluctuates around a configured threshold, potentially causing the system to send out many emails in a short time frame. The system is able to detect such a situation and can be configured exactly how to deal with it.

## REVIEW OF LITERATURE: -

The review of literature will include the major clustering methods such as: -

Partition methods.

K-Means Method.

K-Medoids Method.

Hierarchical methods.

Agglomerative.

Divisive.

BIRCH.

Chameleon.

Density Based methods.

Grid Based methods.

STING.

WaveCluster.

Model Based methods.

Clustering High-dimensional Data.

Constraint Based Clustering.

Outliner Analysis.

Statistical Distribution.

Distance Based.

As well as this section will include some more topics such as: -

Swarm Intelligence Algorithms for Data Clustering.

Particle Swarm Optimization.

The Ant Colony Based Clustering Algorithms.

The PSO Based Clustering Algorithms.

Data Clustering Using the Bees Algorithm.

It is difficult to provide the categorization of clustering methods because these categories may overlap, so that a method may have features of several categories.

The major clustering methods can be classified into the following categories.

## Partition Methods:

Given a database of n objects or data tuples, a partitioning method constructs k partitions of the data, where each partition represents a cluster and k<=n. That is, it classifies the data into k groups, which together satisfy the following requirements:

(1). Each group must contain at least one object.

(2). Each object must belong to exactly one group.

The general criteria of a good partitioning is that objects in the same clusters are "close" or related to each other, whereas objects of different clusters are "far apart" or very different.

The partition method construct a partition of a database D of n objects into a set of k clusters, such that, there is minimum sum of squared distance:

Most applications adopt one of a few popular heuristic methods, such as (1) the k-means algorithm, where each cluster is represented by mean value of the objects in the cluster, and (2) the k-medoids algorithm, where each cluster is represented by one of the objects located near the center of the cluster.

These heuristic clustering methods works well for finding spherical-shaped clusters in small to medium-sized databases. To find clusters with complex shapes and for clustering very large data sets, partition-based methods need to be extended.

## K-means method: -

The k-means algorithm takes the input parameter, k, and partitions the set of n objects into k clusters so that the resulting intra-cluster similarity is high but the inter-cluster similarity is low. Cluster similarity is measured in regards to the mean value of the objects in a cluster, which can be viewed as the cluster's centroid or center of gravity.

## K-means algorithm: -

Partition objects into k nonempty subsets.

Compute seed points as the centroids of the clusters of the current partition (the centroid is the center, i.e., mean point, of the cluster)

Assign each object to the cluster with the nearest seed point

Go back to Step 2, stop when no more new assignment

Below given is an example to explain the K-means partition algorithm.

There are several weakness of the K-means partition methods: -

Applicable only when mean is defined, then what about categorical data?

Need to specify k, the number of clusters, in advance

Unable to handle noisy data and outliers

Not suitable to discover clusters with non-convex shapes

The k-means algorithm is sensitive to outliers.

Since an object with an extremely large value may substantially distort the distribution of the data

## K-medoid method: -

Instead of taking the mean value of the objects in a cluster as a reference point, we can pick actual objects to represent the clusters, using one representative object to which it is the most similar. The partitioning method is then performed based on the principle of minimizing the sum of the dissimilarities between each object and its corresponding reference point. i.e. Instead of taking the mean value of the object in a cluster as a reference point, medoids can be used, which is the most centrally located object in a cluster.

K-medoid algorithm: -

PAM (Partition Around Medoids) was one of the first k-medoids algorithms introduced, algorithm says: -

Select k representative objects arbitrarily

For each pair of non-selected object h and selected object i, calculate the total swapping cost TCih

For each pair of i and h,

If TCih < 0, i is replaced by h

Then assign each non-selected object to the most similar representative object

Repeat steps 2-3 until there is no change

There are several weakness of PAM: -

Pam is more robust than k-means in the presence of noise and outliers because a medoid is less influenced by outliers or other extreme values than a mean

Pam works efficiently for small data sets but does not scale well for large data sets.

## Hierarchical Methods:

A hierarchical method creates a hierarchical decomposition of the given set of data objects. A hierarchical method can be classified as being either agglomerative or divisive, based on how the hierarchical decomposition is formed.

The agglomerative approach, also called the bottom-up approach, starts with each object forming a separate group. The divisive approach, also called the top-down approach, starts with all the objects in same cluster.

Hierarchical method suffers from the fact that once a step (merge or split) is done, it can never be undone. This rigidity is useful in that it leads to smaller computation costs by not having to worry about a combinatorial number of different choices.

There are two approaches to improving the quality of hierarchical clustering: (1) perform careful analysis of object "linkage" at each hierarchical partitioning, such as in Chameleon, or (2) integrate hierarchical agglomeration and other approaches by first using a hierarchical agglomerative algorithm to group objects into micro clusters, and then performing macro clustering an the micro cluster using another clustering method such as iterative relocation, as in BIRCH.

## Agglomerative hierarchical clustering: -

This bottom-up strategy starts by placing each object in its own cluster and then merges these atomic clusters into larger and larger clusters, until all of the objects are in a single or until certain termination conditions are satisfied. Most hierarchical methods belong to this category. They differ only in their definition of inter-cluster similarity.

## Divisive hierarchical clustering: -

This top-down strategy does the reverse of agglomerative hierarchical clustering by starting with all objects in one cluster. It sub divides the cluster into smaller and smaller pieces, until each object forms a cluster on its own or until it satisfies certain termination condition, such as a desired number of clusters is obtained or the diameter of each cluster is within a certain threshold.

The following figure will make the above two hierarchical clustering methods more clear: -

A tree structure called a dendrogram is commonly used to represent the process of hierarchical clustering.

-:Dendrogram representation for hierarchical clustering of data objects {a,b,c,d,e}:-

Major weakness of agglomerative clustering methods

It do not scale well: time complexity of at least O(n2), where n is the number of total objects

It can never undo what was done previously.

## BIRCH: Balanced Iterative Reducing and Clustering Using Hierarchies: -

BIRCH is designed for clustering a large amount of data integration of hierarchical clustering (at the initial micro clustering stage) and other clustering methods such as iterative partitioning (at the later macro clustering stage). It overcomes the two difficulties of agglomerative clustering methods as scalability and inability to undo what was done in the previous step.

BIRCH introduces two concepts, clustering feature and clustering feature tree (CF tree).

Weakness: It handles only the numeric data, and is sensitive to the order of the data record.

A CF tree is a height-balanced tree that stores the clustering features for a hierarchical clustering: -

A non-leaf node in a tree has descendants or "children"

The non-leaf nodes store sums of the CFs of their children

A CF tree has two parameters

Branching factor: specify the maximum number of children.

threshold: max diameter of sub-clusters stored at the leaf nodes

- :A CF Tree Structure: -

## Chameleon: a Hierarchical Clustering Algorithm Using Dynamic Modeling: -

This method follows simple rules i.e. two clusters are merged only if the interconnectivity and closeness (proximity) between two clusters are high relative to the internal interconnectivity of the clusters and closeness of items within the clusters. It can be understood by the simple figure given below-

Overall framework of Chameleon: Hierarchical clustering based on k-nearest neighbors and dynamic modeling.

## Density-Based Methods:

These methods can find only spherical-shaped clusters and encounter difficulty at discovering clusters of arbitrary shapes. Other clustering methods have been developed based on the notion of density. Their general idea is to continue growing the given cluster as long as the density (number of objects or data points) in the "neighborhood" exceeds some threshold.

Major features of density based methods are:

Discover clusters of arbitrary shape

Handle noise

One scan

Need density parameters as termination condition

DBSCAN grows clusters according to a density-based connectivity analysis. OPTICS extends DBSCAN to produce a cluster ordering obtained from a wide range of parameter setting. DENCLUE cluster objects based on a set of density distribution functions.

## Grid-Based Methods:

Grid-Based methods quantize the object space into a finite number of cells that form a grid structure. All of the clustering operations are performed on the grid structure. The main advantage of this approach is its fast approaching time, which is typically independent of the number of data objects and dependent only on the number of cells in each dimension in the quantized space.

The grid-based clustering approach uses a multi-resolution grid data structure. It quantizes the object space into a finite number of cells that form a grid structure on which all of the operations for clustering are performed.

Some examples of the grid-based approach includes STING, which explores statistical information stored in the grid cells; WaveCluster, which clusters objects using a wavelet transform method; and CLIQUE, which represents a grid-and density-based approach for clustering in high dimensional data space.

## STING: Statistical information grid: -

STING is a grid based multi-resolution clustering technique in which the spatial area is divided into rectangular cells. There are usually several levels of such rectangular cells corresponding to different levels of resolution, and these cells form a hierarchical structure: each cell at a high level is partitioned to form a number of cells at the next lower level. Statistical information regarding the attributes in each grid cell (such as mean, maximum, and minimum values) is precomputed and stored. It can be shown by the figure-

Disadvantages:

All the cluster boundaries are either horizontal or vertical, and no diagonal boundary is detected.

## WaveCluster: Clustering Using Wavelet Transformation: -

It is a multi-resolution clustering algorithm that first summarizes the data by imposing a multidimensional grid structure onto the data space. It then uses a wavelet transformation to transform the original feature space, finding dense regions in the transformed space.

We can apply wavelet transform to find clusters as: -

Summarizes the data by imposing a multidimensional grid structure onto data space

These multidimensional spatial data objects are represented in a n-dimensional feature space

Apply wavelet transform on feature space to find the dense regions in the feature space

Apply wavelet transform multiple times which result in clusters at different scales from fine to coarse

## Model-Based Methods:

The model-based methods hypothesize a model for each of the clusters and find the best fit of the data to the given model. A model-based algorithm may locate clusters by constructing a density function that reflects the spatial distribution of the data points.

Model-based clustering methods attempt to optimize the fit between the given data and some mathematical model. Such methods are often based on the assumption that the data are generated by a mixture of underlying probability distributions.

It also leads to the way of automatically determining the number of clusters based on standard statistics, taking "noise" or outlier into account and thus yielding robust clustering methods.

The choice of the clustering algorithm depends both on the type of the data available and on a particular purpose of the application. If cluster analysis is used as a descriptive or exploratory tool, it is possible to try several algorithms on the same data to see what data may disclose.

Some clustering algorithms integrate the ideas of several clustering methods, so that it is something difficult to classify a given algorithms as uniquely belonging to only one clustering methods category. Furthermore, some applications may have clustering criteria that require the integration of several clustering techniques.

There are two classes of clustering tasks that require special attention, one is clustering high-dimensional data, and the other is constraint-based clustering.

## Clustering High Dimensional Data: -

Clustering high-dimensional data is a particularly important task in cluster analysis because many applications require the analysis of objects containing a large number of features or dimensions. For example, text document may contain thousands of terms or keywords as features, and DNA microarray data may provide information on the expression levels of genes under hundreds of conditions.

Many dimensions may not be relevant. As the number of dimensions increases, the data becomes increasingly sparse so that the distance measurement between pairs of points become meaningless and the average density of points anywhere in the data is likely to be low.

The figures given below show the clustering high dimensional data: -

Data in only one dimension is relatively packed

Adding a dimension "stretch" the points across that dimension, making them further apart

Adding more dimensions will make the points further apart-high dimensional data is extremely sparse

## Constraint-based Clustering: -

It is the clustering approach that performs clustering by incorporation of user-specified or application-oriented constraints. A constraint expresses a user's expectation or describes "properties" of the desired clustering results, and provides an effective means for communicating with the clustering process. Various kinds of constraints can be specified, either by a user or as per application requirements.

Constraint-based clustering finds clusters that satisfy user-specified preferences or constraints. Depending on the nature of the constraints, constraint-based clustering may adopt rather different approaches as: -

Constraints on individual objects.

Constraints on the selection of clustering parameters.

Constraints on distance or similarity functions.

User specified constraints on the properties of individual clusters.

Constraint-based clustering method groups objects based on application dependent or user-specified constraints. For example, clustering with the existence of obstacle objects and clustering under user-specified constraints, and semi-supervised clustering based on "weak" supervision (such as pairs of objects labeled as belonging to the same or different cluster).

## Outlier Analysis: -

One person's noise could be other persons signal. Outlier detection and analysis are very useful for the fraud detection, customized marketing, medical analysis, and many other tasks. Computer-based outlier analysis method typically follow either a statistical distribution-based approach, a distance-based approach, a density-based local outlier detection approach, or a deviation-based approach.

## Statistical Distribution-Based Outlier Detection: -

It uses discordancy tests depending on the-

data distribution

distribution parameter (e.g., mean, variance)

number of expected outliers

There are some drawbacks of this method-

most tests are for single attribute

In many cases, data distribution may not be known

## Distance-Based Outlier Detection-

This method was introduced to counter the main limitations imposed by statistical methods. We need multi-dimensional analysis without knowing data distribution.

Distance-based outlier: A DB(p, D)-outlier is an object O in a dataset T such that at least a fraction p of the objects in T lies at a distance greater than D from O

There are several other algorithms for mining distance-based outliers such as: Index-based algorithm, Nested-loop algorithm and Cell-based algorithm.

The various applications of the swarm intelligence are

Ant Colony Optimization.

Particle Swarm Optimization.

Stochastic Diffusion Search.

Gravitational Search Algorithm.

(1). Ant Colony Optimization

Ant colony optimization (ACO) is a class of optimization algorithms modeled on the actions of an ant colony. ACO methods are useful in problems that need to find paths to goals. Artificial 'ants' - simulation agents - locate optimal solutions by moving through a parameter space representing all possible solutions. Real ants lay down pheromones directing each other to resources while exploring their environment. The simulated 'ants' similarly record their positions and the quality of their solutions, so that in later simulation iterations more ants locate better solutions. One variation on this approach is the bees algorithm, which is more analogous to the foraging patterns of the honey bee.

## (2). Particle Swarm Optimization

Particle swarm optimization (PSO) is a global optimization algorithm for dealing with problems in which a best solution can be represented as a point or surface in an n-dimensional space. Hypotheses are plotted in this space and seeded with an initial velocity, as well as a communication channel between the particles.

Particles then move through the solution space, and are evaluated according to some fitness criterion after each timestep. Over time, particles are accelerated towards those particles within their communication grouping which have better fitness values.

## (3). Stochastic Diffusion Search

Stochastic Diffusion Search (SDS) is an agent-based probabilistic global search and optimization technique best suited to problems where the objective function can be decomposed into multiple independent partial-functions.

Each agent maintains a hypothesis which is iteratively tested by evaluating a randomly selected partial objective function parameterized by the agent's current hypothesis. In the standard version of SDS such partial function evaluations are binary, resulting in each agent becoming active or inactive. Information on hypotheses is diffused across the population via inter-agent communication.

Unlike the communication used in ACO, in SDS agents communicate hypotheses via a one-to-one communication strategy analogous to the tandem running procedure observed in some species of ant. A positive feedback mechanism ensures that, over time, a population of agents stabilize around the global-best solution.

## (4). Gravitational Search Algorithm

Gravitational search algorithm (GSA) is constructed based on the law of Gravity and the notion of mass interactions. The GSA algorithm uses the theory of Newtonian physics and its searcher agents are the collection of masses. In GSA, we have an isolated system of masses. Using the gravitational force, every mass in the system can see the situation of other masses. The gravitational force is therefore a way of transferring information between different masses.

## Various applications of swarm intelligence-

Swarm Intelligence-based techniques can be used in a number of applications.

The U.S. military is investigating swarm techniques for controlling unmanned vehicles.

Crowd Simulation.

The European Space Agency is thinking about an orbital swarm for self assembly and interferometry.

NASA is investigating the use of swarm technology for planetary mapping.

Ant-based Routing.

There are the possibility of using swarm intelligence to control nanobots within the body for the purpose of killing cancer tumors.

## Crowd Simulation

Artists are using swarm technology as a means of creating complex interactive systems or simulating crowds. For example Tim Burton's 'Batman Returns' was the first movie to make use of swarm technology for rendering, realistically depicting the movements of a group of bats using the Boids system. The Lord of the Rings film trilogy made use of similar technology, known as Massive, during battle scenes. Swarm technology is particularly attractive because it is cheap, robust, and simple.

## Ant-based Routing

The use of Swarm Intelligence in Telecommunication Networks has also been researched, in the form of Ant Based Routing. Basically this uses a probabilistic routing table rewarding/reinforcing the route successfully traversed by each "ant" (a small control packet) which flood the network. Reinforcement of the route in the forwards, reverse direction and both simultaneously have been researched: backwards reinforcement requires a symmetric network and couples the two directions together; forwards reinforcement rewards a route before the outcome is known (but then you pay for the cinema before you know how good the film is).

As the system behaves stochastically and is therefore lacking repeatability, there are large hurdles to commercial deployment. Mobile media and new technologies have the potential to change the threshold for collective action due to swarm intelligence.

## Data Clustering Using The Bees Algorithm: -

## Bees in Nature -

A colony of honey bees can extend itself over long distances in order to exploit a large number of food sources at the same time. The foraging process begins in a colony by scout bees being sent to search for promising flower patches. Flower patches with large amounts of nectar or pollen that can be collected with less effort tend to be visited by more bees, whereas patches with less nectar or pollen receive fewer bees. During the harvesting season, a colony continues its exploration, keeping a percentage of the population as scout bees. When they return to the hive, those scout bees that found a patch rated above a certain quality threshold deposit their nectar or pollen and go to the "dance floor" to perform a dance known as the "waggle dance". This mysterious dance is essential for colony communication, and contains three pieces of information regarding a flower patch: the direction in which it will be

found, its distance from the hive and its quality rating (or fitness). This

information helps the colony to send its bees to flower patches precisely, without using guides or maps. Each individual's knowledge of the outside environment is gleaned solely from the waggle dance. This dance enables the colony to evaluate the relative merit of different patches according to both the quality of the food they provide and the amount of energy needed to harvest it. After waggle dancing

on the dance floor, the dancer (i.e. the scout bee) goes back to the flower patch with follower bees that were waiting inside the hive. More follower bees are sent to more promising patches. This allows the colony to gather food quickly and efficiently.

While harvesting from a patch, the bees monitor its food level. This is necessary to decide upon the next waggle dance when they return to the hive. If the patch is still good enough as a food source, then it will be advertised in the waggle dance and more bees will be recruited to that source.

## The Proposed Clustering Bees Method-

The proposed clustering method exploits the search capability of the Bees Algorithm to overcome the local optimum problem of the other algorithms.

Below given steps shows the basic steps of the proposed clustering operation, which are essentially those of the Bees Algorithm.

1. Initialize the solution population.

2. Evaluate the fitness of the population.

3. While (stopping criterion is not met)

// Forming new population.

4. Select sites for neighborhood search.

5. Recruit bees for selected sites (more bees for

the best e sites) and evaluate fitness's.

6. Select the fittest bee from each site.

7. Assign remaining bees to search randomly and

evaluate their fitness's.

8. End While.

The algorithm requires a number of parameters to be set, namely: number of scout bees (n), number of sites selected for neighborhood searching (out of n visited sites) (m), number of top-rated (elite) sites among m selected sites (e), number of bees recruited for the best e sites (nep), number of bees recruited for the other (me) selected sites (nsp), and the stopping criterion.

## Problem Areas: -

This section of the paper will focus upon the problematic areas that are being faced by the different methods developed so far for the clustering techniques and will try to solve the problem of the response time later.

One of the important techniques for clustering is the partitioning method in which the partitions of the data is made and are represented through the clusters. This clustering method works well for finding the clusters in small and medium-sized databases. To find clusters with complex shapes and for clustering very large data sets we need to extend this technique in k-means and k-medoids method. However the k-means algorithm also has the drawback that it can be applied only when the mean of the cluster is defined as well as the value of k, the number of clusters is to be specified by the user in advance whereas the k-medoid method is robust than k-means method and works efficiently for the small data set but does not scales well for larger data sets and its efficiency depends on the sample size.

Another method proposed is the hierarchical method which creates a hierarchical decomposition of the given set of data objects. It also does not scales well. This method suffers from the fact that once a step (merge or split) is done, it can never be undone, so to improve the quality of the method we have to perform the careful analysis of the objects at each level in order to that it has the small computation cost. The BIRCH method included in it deals only with the numeric data and is also sensitive to the order of the data record. After that the method proposed was Density-based method which clusters the objects on the basis of the distance between the objects and works on small data set as well as finds the difficulty in encountering the clusters of the arbitrary shapes.

The grid-based method quantizes the object space into a finite number of cells that form a grid structure and has the dependency of the number of the cells in the grid. For too coarse structure the quality of this cluster analysis reduces to a greater extent. Since it does not considers the child and the neighbors for the construction of the parent cell therefore all cluster boundaries are horizontal or vertical so no diagonal boundaries are detected hence the accuracy is lowered inspite of the fast access.

The model based method integrates the idea of several clustering methods, so it is sometimes difficult to classify the given algorithm as uniquely belonging to only one clustering method category hence it developed two more methods as clustering high-dimensional data and constraint-based clustering.

Clustering high-dimensional data is a particularly important task in cluster analysis because many applications require the analysis of objects containing a large number of features or dimensions, but many dimensions may not be relevant. As the number of dimensions increases, the data becomes increasingly sparse so that the distance measurement between pairs of points become meaningless and the average density of points anywhere in the data is likely to be low. The accuracy of the clustering result may be degraded at the expense of simplicity of the method. The constraint based clustering is the clustering approach that performs clustering by incorporation of user-specified or application-oriented constraints. A constraint expresses a user's expectation or describes "properties" of the desired clustering results, and provides an effective means for communicating with the clustering process. Various kinds of constraints can be specified, either by a user or as per application requirements.

After summarizing the drawback of some of the major clustering methods we are going to present a new method which will overcome the shortcomings of the methods discussed above such as the new method proposed will have the fast accessing speed due to presence of functions and specially designed codes.

The problem of the uniqueness of the data is also solved by these newly developed codes which will be assigned to the data objects and the clusters. So, now let us see some of the details of the system that is being proposed.

## Proposed System: -

As we have seen that various techniques and methods are involved in the cluster analysis. After looking on their features and the drawbacks we have developed a method (Hash-Index) to analyze the clusters and the data objects in more efficient and accurate way.

Hash-Index method gets the idea from the structure which helps to organize the data objects or records in a systematic and more manageable way and increase the response time.

This method uses the 'Hash Function' to generate a 'Hash Code' which is used to partition the data objects in different groups according to the choice of Hash Function. That is for same set of data objects there can be different Hash Functions. Note: - It must be taken into consideration that the Hash code produced by the Hash Function should always be unique.

The Hash Code generated by the Hash Function is used to locate the position of the cluster in which the data object is present. The cluster can be treated as the 'Bucket' containing the data objects, so each hash code can also be called as the 'bucket number'.

Now this hash code alone will not work to uniquely identify the data object from the set of clusters because two or more data objects can reside in a cluster and each hash code represents a cluster so, we use a special term 'index' to uniquely identify the data objects in each cluster as well as set of clusters.

The 'index number' is provided to each of the data object which is to be added in the cluster. Provided each index number in a cluster should be unique. The index number can be in any form.

Now from the above discussion it must be clear that the combination of the 'Hash Code' and the 'Index Number' is unique. So for each data object we have unique combination of hash code and the index number.

Basic structure of the model or the method is: -

Hash Codes Index Numbers

In the above figure the vertical array represents the hash codes generated by the hash function and the linked are data objects which form a cluster.

When we compare Hash-Index method with other clustering methods we find several advantages of this method over others.

Advantages of Hash-Index method over different methods used in the clustering are: -

This method can search the data object very efficiently and accurately as it can not be done in Grid-Based method.

It can work on enormous amount of data which was not possible in partition method provided all the data objects are uniformly distributed in the buckets.

Since searching and retrieval is easy so no need to emphasis on every level of the partition which is the major drawback of the hierarchical method. As well as the changes in the data object or the clusters can be done which were not earlier possible in the hierarchical method.

In the Grid-Based method the neighbors are not taken into consideration whereas in Hash-Index method while granting the index number the neighbors are also considered and a appropriate place for the data object is finded.

This method works on too coarse data also which is not possible in the Grid-Based method, since there is no grid formation as well as it does does not depends on the hash values so data objects can be managed more efficiently in the clusters than other techniques.

Hash-Index method has a small computation cost due to the presence of fast calculating Hash Functions. The computation cost is the major problem of some of the good clustering techniques such as hierarchical, etc.