Clustering Is A Division Of Data Computer Science Essay

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

Clustering is a division of data into groups of similar objects. Representing the data by fewer clusters necessarily loses certain fine details, but achieves simplification. It models data by its clusters. Data modeling puts clustering in a historical perspective rooted in mathematics, statistics, and numerical analysis. From a machine learning perspective clusters correspond to hidden patterns, the search for clusters is unsupervised learning, and the resulting system represents a data concept. From a practical perspective clustering plays an outstanding role in data mining applications such as scientific data exploration, information retrieval and text mining, spatial database applications, Web analysis, CRM, marketing, medical diagnostics, computational biology, control and many other applications. It is important to understand the difference between clustering (unsupervised classification) and discriminant analysis (supervised classification). In supervised classification, we are provided with a collection of labeled (preclassified) patterns; the problem is to label a newly encountered, yet unlabeled, pattern. Typically, the given labeled (training) patterns are used to learn the descriptions of classes which in turn are used to label a new pattern. In the case of clustering, the problem is to group a given collection of unlabeled patterns into meaningful clusters. In a sense, labels are associated with clusters also, but these category labels are data driven; that is, they are obtained solely from the data.

Data mining adds to clustering the complications of very large datasets with very any attributes of different types. This imposes unique computational requirements on relevant clustering algorithms. A variety of algorithms have recently emerged that meet these requirements and were successfully applied to real-life data mining problems. In recent years, the dramatic rise in the use of the web and the improvement in process industries in general have transformed our society into one that strongly depends on information. The huge amount of data that is generated by this process contains important information that accumulates daily in databases and is not easy to extract. This data is scattered in between the upper and the lower limits, applying required control or strategy to this data requires lots of computational work, for having a better control strategy. The clustered data gives us a better control efficiency and performance of our system rather than working with an unorganized scattered dataset. The field of data mining developed as a means of extracting information and knowledge from databases to discover patterns or concepts that are not evident The process usually consists of the following: transforming the data to a suitable format, cleaning it, and inferring or making conclusions regarding the data. Data analysis underlies many computing applications, either in a design phase or as part of their on-line operations. Data analysis procedures can be dichotomized as either exploratory or confirmatory, based on the availability of appropriate models for the data source, but a key element in both types of procedures (whether for hypothesis formation or decision-making) is the grouping, or classification of measurements based on either of the following:

(i) Goodness-of-fit to a postulated model,

(ii) Natural groupings (clustering) revealed through analysis.

Cluster analysis is the organization of a collection of patterns (usually represented as a vector of measurements, or a point in a multidimensional space) into clusters based on similarity. Cluster analysis is thus a tool of discovery. It may reveal associations and structure in data which, though not previously evident, nevertheless are sensible and useful once found. The results of cluster analysis may contribute to the definition of a formal classification scheme, such as a taxonomy for related process variables, parameters or objects; or suggest statistical models with which to describe control; or indicate rules for assigning new cases to classes for control and analysis purposes; or provide measures of definition, variations and change in what previously were only broad concepts. Clustering is the classification of similar objects into different groups, or more precisely, the partitioning of a data set into subsets (clusters), so that the data in each subset (ideally) share some common trait - often proximity according to some defined distance measure

Clustering is useful in several exploratory pattern-analysis, grouping, decision making, and machine-learning situations, including data mining, document retrieval, image segmentation, and pattern classification. However, in many such problems, there is little prior information (e.g., statistical models) available about the data, and the decisionmaker must make as few assumptions about the data as possible. It is under these restrictions that clustering ethodology is particularly appropriate for the exploration of interrelationships among the data points to make an assessment (perhaps preliminary) of their structure.

1.1 Issues of Clustering

The main requirements that a clustering algorithm should satisfy are the Dealing with different types of attributes:€ following

There are various attributes of data that any clustering algorithm need to satisfy, the most general taxonomy being in common use distinguishes among numeric (continuous), ordinal, and nominal variables. A numeric variable can assume any value in R. An ordinal variable assumes a small number of discrete states, and these states can be compared.

Scalability to large datasets: The data sets could be in any possible range, varying between large extremes and they need to be normalized by the clustering algorithm.

Ability to work with high dimensional data: The data could be multidimensional varying from 1, 2……n.; depending on the application data on which clustering is being applied

Development of a medical inferencing system using data clustering. The data could be multidimensional varying from 1, 2……n.; depending on the application data on which clustering is being applied.

Ability to find clusters of irregular or arbitrary shape: The shape of clusters could be any arbitrary shapes. We prefer using Euclidean distance to get a circular shape of the clusters, but still shape of clusters can not be accurately defined

Handling outliers: The data points on the boundary of clusters need to be handled; this is done in hierarchical methods by associating the boundary points to one of the clusters. While in fuzzy clustering, we associate membership functions to the points lying on the boundary of clusters.

Time complexity : Complexity of the data points in terms of time has to be taken care of while clustering.

Data order dependency: Dependency of data points on other variable can affect the clustering of data and there by the cluster centers too, so it has to be taken care before hand.

Labeling or Assignment:

Clustering Parameters

Different parameters that affects the clustering process is given below. Reliance on a priori knowledge and user defined parameters

Interpretability of results

Minimal requirements for domain knowledge to determine input parameters;

Ability to deal with noise



It can be shown that there is no absolute "best" criterion which would be independent of the final aim of the clustering. Consequently, it is the user which must supply this criterion, in such a way that the result of the clustering will suit their needs. For instance, we could be interested in finding representatives for homogeneous groups (data reduction), in finding "natural clusters" and describe their unknown properties ("natural" data types), in finding useful and suitable groupings ("useful" data classes) or in finding unusual data objects (outlier detection).

Distance : A measure of Similarity

Since similarity is fundamental to the definition of a cluster, a measure of the similarity between two patterns drawn from the same feature space is essential to most clustering procedures. Because of the variety of feature types and scales, the distance measure (or measures) must be chosen carefully. It is most common to calculate the dissimilarity between two patterns using a distance measure defined on the feature space. We will focus on the well-known distance measures used for patterns whose features are all continuous. The most popular metric for continuous features is the Euclidean distance. An important component of a clustering algorithm is the distance measure between data points. If the components of the data instance vectors are all in the same physical units then it is possible that the simple Euclidean distance metric is sufficient to successfully group similar data instances. However, even in this case the Euclidean distance can sometimes be misleading. Figure shown below illustrates this with an example of the width and height measurements of an object. Despite both measurements being taken in the same physical units, an informed decision has to be made as to the relative scaling. As the figure shows, different scaling can lead to different clustering's.

1.2.2 Distance Functions

The concept of dissimilarity (or distance) or dual similarity is the essential component of any form of clustering that helps us navigate through the data space and form clusters. By computing dissimilarity, we can sense and articulate how close together two patterns are and, based on this closeness, allocate them to the same cluster. Formally, the dissimilarity d(x, y) between x and y is considered to be a two-argument function satisfying the following conditions:

d(x,y) ≥ 0 for every x and y

d(x,x) =0 for every x

d(x,y) = d(y,x)

This list of requirements is intuitively appealing. We require a nonnegative character of the dissimilarity. The symmetry is also an obvious requirement. The dissimilarity attains a global minimum when dealing with two identical patterns, that is

d(x,x) = 0.

Distance, (metric) is a more restrictive concept, as we require the triangular inequality to be satisfied; that is, for any pattern x, y, and z we have

d(x,y) + d(y,z) ≥ d(x,z)

1.3 Components of a Clustering

Typical pattern clustering activity involves the following steps

(1) Pattern representation (optionally including feature extraction and/or selection),

(2) Definition of a pattern proximity measure appropriate to the data domain,

(3) Clustering or grouping,

(4) Data abstraction (if needed), and

(5) Assessment of output (if needed).

Feature Selection and Extraction

Interpattern Similarity


Pattern representation refers to the number of classes, the number of available patterns, data, and the number, type, and scale of the features available to the clustering algorithm. Some of this information may not be controllable by the practitioner. Feature selection is the process of identifying the most effective subset of the original features to use in clustering. Feature extraction is the use of one or more transformations of the input features to produce new salient features. Either or both of these techniques can be used to obtain an appropriate set of features to use in clustering.

Pattern proximity is usually measured by a distance function defined on pairs of patterns. A variety of distance measures are in use in the various communities. A simple distance measure like Euclidean distance can often be used to reflect dissimilarity between two patterns, whereas other similarity measures can be used to characterize the conceptual similarity between. The grouping step can be performed in a number of ways. The output clustering (or clustering) can be hard (a partition of the data into groups) or fuzzy (where each pattern has a variable degree of membership in each of the output clusters). Hierarchical clustering algorithms produce a nested series of partitions based on a criterion for merging or splitting clusters based on similarity. Partitional clustering algorithms identify the partition that optimizes (usually locally) a clustering criterion.Additional techniques for the grouping operation include probabilistic and graphtheoretic clustering methods.

1.4 Clustering Procedure

The clustering procedure includes analysis, validation and visualization of clusters. These steps of clustering are described below.

1.4.1 Cluster Analysis

The objective of cluster analysis is the classification of objects according to similarities among them, and organizing of data into groups. Clustering techniques are among the unsupervised methods, they do not use prior class identifiers. The main potential of clustering is to detect the underlying structure in data, not only for classification and pattern recognition, but for model reduction and optimization. Different classifications can be related to the algorithmic approach of the clustering techniques.

1.4.2 Data

Clustering techniques can be applied to data that is quantitative (numerical), qualitative (categorical), or a mixture of both. In this thesis, the clustering of quantitative data is considered. The data are typically observations of some physical process. Each observation consists of n measured variables, grouped into an n-dimensional row vector

1.5 Types of clustering

Hierarchical algorithms find successive clusters using previously established clusters. These algorithms usually are either agglomerative ("bottom-up") or divisive ("top-down"). Agglomerative algorithms begin with each element as a separate cluster and merge them into successively larger clusters. Divisive algorithms begin with the whole set and proceed to divide it into successively smaller clusters.

Partitional algorithms typically determine all clusters at once, but can also be used as divisive algorithms in the hierarchical clustering.

Density-based clustering algorithms are devised to discover arbitrary-shaped clusters. In this approach, a cluster is regarded as a region in which the density of data objects exceeds a threshold. DBSCAN and OPTICS are two typical algorithms of this kind.

Subspace clustering methods look for clusters that can only be seen in a particular projection (subspace, manifold) of the data. These methods thus can ignore irrelevant attributes. The general problem is also known as Correlation clustering while the special case of axis-parallel subspaces is also known as Two-way clustering, co-clustering or biclustering: in these methods not only the objects are clustered but also the features of the objects, i.e., if the data is represented in a data matrix, the rows and columns are clustered simultaneously. They usually do not however work with arbitrary feature combinations as in general subspace methods. But this special case deserves attention due to its applications in bioinformatics.

Many clustering algorithms require the specification of the number of clusters to produce in the input data set, prior to execution of the algorithm. Barring knowledge of the proper value beforehand, the appropriate value must be determined, a problem on its own for which a number of techniques have been developed.

1.6 CMeans Clustering

Partitional clustering is an important part of cluster analysis. Based on various theories, numerous clustering algorithms have been developed, and new clustering algorithms continue to appear in the literature. It is known that Occam's razor plays a pivotal role in data-based models, and partitioned clustering is categorized as a data-based model.

The three main contributions of this paper can be summarized as follows:

According to a novel definition of the mean, a unifying generative framework for partitioned clustering algorithms, called a general c-means clustering model (GCM), is presented and studied.

Based on the local optimality test of the GCM, the connection between Occam's razor and partitioned clustering is established for the first time. As its application, a comprehensive review of the existing objective function-based clustering algorithms is presented based on GCM.

Under a common assumption about partitioned clustering, a theoretical guide for devising and implementing clustering algorithm is discovered. These conclusions are verified by numerical experimental results

1.6.1 Distance measure

An important step in most clustering is to select a distance measure, which will determine how the similarity of two elements is calculated. This will influence the shape of the clusters, as some elements may be close to one another according to one distance and farther away according to another. For example, in a 2-dimensional space, the distance between the point (x = 1, y = 0) and the origin (x = 0, y = 0) is always 1 according to the usual norms, but the distance between the point (x = 1, y = 1) and the origin can be 2, or 1 if you take respectively the 1-norm, 2-norm or infinity-norm distance.

Common distance functions:

The Euclidean distance (also called distance as the crow flies or 2-norm distance). A review of cluster analysis in health psychology research found that the most common distance measure in published studies in that research area is the Euclidean distance or the squared Euclidean distance.[Full citation needed]

The Manhattan distance (aka taxicab norm or 1-norm)

The maximum norm (aka infinity norm)

The Mahalanobis distance corrects data for different scales and correlations in the variables

The angle between two vectors can be used as a distance measure when clustering high dimensional data. See Inner product space.

The Hamming distance measures the minimum number of substitutions required to change one member into another.

Another important distinction is whether the clustering uses symmetric or asymmetric distances. Many of the distance functions listed above have the property that distances are symmetric (the distance from object A to B is the same as the distance from B to A).

1.6.2 C-means Algorithm:

It minimize the index of quality defines as sum of squared distances for all points included in the cluster space to the center of the cluster


Fix the number of cluster.

Randomly assign all training input vector to a cluster .this creates partition.

Calculate the cluster center as the mean of each vector component of all vectors assigned to that cluster. Repeat for all cluster

Compute all Euclidean distances between each cluster center and each input vector.

Update partitioned by assigning each input vector to its nearest cluster minimum Euclidean distance.

Stop if the center do not move any more otherwise loop to step, where is the calculation of a cluster center.

1.7 K-Means Clustering

Simply speaking k-means clustering is an algorithm to classify or to group your objects based on attributes/features into K number of group. K is positive integer number. The grouping is done by minimizing the sum of squares of distances between data and the corresponding cluster centroid. Thus the purpose of K-mean clustering is to classify the data.

The basic step of k-means clustering is simple. In the beginning we determine number of cluster K and we assume the centroid or center of these clusters. We can take any random objects as the initial centroids or the first K objects in sequence can also serve as the initial centroids.

Then the K means algorithm will do the three steps below until convergence

Iterate until stable (= no object move group):

Determine the centroid coordinate

Determine the distance of each object to the centroids

Group the object based on minimum distance

After the completition of K-Means Clustering Algorithm on our session data set, we will apply another algorithm i.e Markov Model to find out the page which has highest occurrence among all the web pages visited by a web user under different sessions. These are some of the steps which is follows in our complete process.

Step 1. Initialize the data set of visited web page on server.

Step 2. Define the number of clusters called K.

Step 3. Determine the centroid of the data set.

Step 4. Take each sample in sequence and compute its distance from the centroid of each of the clusters and group these K clusters according to the distance from centroid of the cluster.

Step 5. Check for some common data set in each group. If there is some such value, eliminate them.

Step 6. Repeat the process from Step 2 till all items are not collectively categorized.

Step 7. Now use these K clusters in the basic training set for the prediction algorithm.

Step 8. Now perform first level Markov Model to determine the occurrence of each page visited by user.

Step 9. Remove the value having less value than the average.

Step 10. Now perform level 2 Markov Model and find the appearance of group 2 pages like AB, AC.

Step 11. Again eliminate the value having value less than the average.

Step 12. Find the after and before visited page from the group.

Step 13. Perform the strength calculation between the associated values with pair groups.

Step 14. The value with highest strength will be represented as the highest calculated / strength value.

1.8 Hierarchical Clustering

Hierarchical clustering: For agglomerative hierarchical clustering, the number of clusters is increased from 1 until the desired number of clusters is reached. On the other hand, for divisive hierarchical clustering, the number of clusters is decreased from the size of the dataset until the desired number of clusters is reached.

Data clustering task has similiar procedures:

Collect dataset.

Apply a certain clustering algorithm to get clustering results.

Test the clustering results.

If the test passes, stops. Otherwise go back to step 2 to repeat the clustering process.

Vector quantization (VQ) is a specific field of data clustering which emphasizes on algorithmic aspect of minimizing a distortion measure for a large amount of data. VQ is commonly used for image and audio data, with a goal similar to partitional clustering but a process similar to hierarchical clustering.

Hierarchical clustering is one of the most frequently used methods in unsupervised learning. Given a set of data points, hierarchical clustering outputs a binary tree (dendrogram) whose leaves are the data points and whose internal nodes represent nested clusters of various sizes. The tree organizes these clusters hierarchically, where the hope is that this hierarchy agrees with the intuitive organization of real-world data. Hierarchical structures are ubiquitous in the natural world. For example, the evolutionary tree of living organisms (and consequently features of these organisms such as the sequences of homologous genes) is a natural hierarchy. Hierarchical structures are also a natural representation for data which was not generated by evolutionary processes. For example, internet newsgroups, emails, or documents from a newswire, can be organized in increasingly broad topic domains. The traditional method for hierarchically clustering data, as given in Duda and Hart (1973) and shown in algorithm 3.1, is a bottom-up agglomerative algorithm. It starts with each data point assigned to its own cluster and iteratively merges the two closest clusters together until all the data belongs to a single cluster. The nearest pair of clusters is chosen based on a given distance measure (e.g. Euclidean distance between cluster means, or distance between nearest points).

1.9 Optimization

Optimization is the act of obtaining the best result under given circumstances. In the most general terms, optimization theory is a body of mathematical results and numerical methods for finding and identifying the best candidate solutions from a collection of alternatives without having explicitly enumerate and evaluate all possible alternatives [Anthony Sulistio]. In design, construction, maintenance of any engineering systems etc., engineers have to take many technological and managerial decisions at several stages. The ultimate goal of all such decisions is either to minimize the effort required or to maximize the desired benefit. Since effort required or the benefit desired in any practical situation can be expressed as a function of certain decision variables, optimization can be defined as the process of finding he condition that give the maximum or minimum value of a function.

Optimization problems arise naturally in many engineering applications. Control problems can be formulated as optimization problems in which the variables are inputs and states, and the constraints include the model equations for the plant. At successively higher levels, optimization can be used to determine set points for optimal operations, to design processes and plants, and to plan for future capacity[D.H. Kim].

Optimization problems contain the following key ingredients:

Variables that can take on a range of values. Variables that are real numbers, integers, or binary (that is, allowable values 0 and 1) are the most common types, but matrix variables are also possible.

Constraints that define allowable values or scopes for the variables, or that specify relationships between the variables;

An objective function that measures the desirability of a given set of variables.

The optimization problem is to choose from among all variables that satisfy the constraints the set of values that minimizes the objective function.

1.10 Particle Swarm Optimization

Particle Swarm Optimization optimizes an objective function by undertaking a population - based search. The population consists of potential solutions, named particles, which are metaphor of birds in flocks. These particles are randomly initialized and freely fly across the multi dimensional search space. During flight, each particle updates its own velocity and position based on the best experience of its own and the entire population. The various steps involved in Particle Swarm Optimization Algorithm [15] are as follows:

Step 1: The velocity and position of all particles are randomly set to within pre-defined ranges.

Step 2: Velocity updating - At each iteration, the velocities of all particles are updated according to,

where pi and vi are the position and velocity of particle i, respectively; pi,best and gi,best is the position with the 'best' objective value found so far by particle i and the entire population respectively; w is a parameter controlling the dynamics of flying; R1 and R2 are random variables in the range [0,1]; c1 and c2 are factors controlling the related weighting of corresponding terms. The random variables help the PSO with the ability of stochastic searching.

Step 3: Position updating - The positions of all particles are updated according to,

After updating, pi should be checked and limited to the allowed range.

Step 4: Memory updating - Update pi,best and gi,best when condition is met,

where f(x) is the objective function to be optimized.

Step 5: Stopping Condition - The algorithm repeats steps 2 to 4 until certain stopping conditions are met, such as a pre-defined number of iterations. Once stopped, the algorithm reports the values of gbest and f(gbest) as its solution.

PSO utilizes several searching points and the searching points gradually get close to the global optimal point using its pbest and gbest. Initial positions of pbest and gbest are different. However, using thee different direction of pbest and gbest, all agents gradually get close to the global optimum.