Data Clustering Is An Exploratory Technique Computer Science Essay

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

Chapter 3


3.1 Introduction

Data clustering is an exploratory technique that has gained a lot of attention in various fields like data mining, statistics, and pattern recognition. Cluster analysis is a method to discover hidden structure in the data based on similarity. Clustering is an unsupervised classification of elements or patterns (observations, data items, or feature vectors) into groups (clusters) because it does not use pre-classified patterns. Clustering high dimensional datasets is an especially challenging task due to inherent sparsity of the data space for different correlations different subspaces in different data neighborhoods. Many clustering algorithms are proposed to find internal structure in the current data not in future data. The main aim clustering is to arrange data by finding some 'sensible' grouping of the data items. The data points belonging to the same cluster are given with same labels (Fayyad et al., 1996).

Emerging data mining applications place specific requirements on clustering techniques (Han and kamber, 2000).They are

(i) Effective treatment of high dimensional data sets

(ii) End-user understanding of the results,

(iii) Good scalability with database size and dimensionality,

(iv) The ability to discover clusters of arbitrary geometry, size and density,

(v) Detection of features relevant to clustering,

(vi) Noise tolerance,

(vii) Insensitivity to initialization and order of data input,

(viii) Handling of various data types and minimal requirements for domain knowledge.

3.1.1 Motivation

1. To approximate a large/infinite/continuous set by a finite set of representatives.

2. To find meaningful clusters in data.

3. To organize data and summarizing it through cluster prototypes used for data compression.

4. Identified structure in the data is used to gain insight into data, to generate hypothesis, to detect anomalies and to extract salient features

5. To measure the degree of similarity between objects or patterns or organisms.

6. Derived cluster prototypes can be visualized efficiently and effectively than the original data set (Lee, 1981; Dubes and jain, 1980).

7. To minimize I/O costs.

Mathematical definition of a cluster

Let X ϵRnxd a set of data items representing a set of n points xi in Rn. The goal is to partition X into K clusters Ck such that every object that belongs to the same cluster is more similar than objects in different groups. The result of the algorithm is an injective mapping X->C of data items Xi to clusters Ck.

3.2 Clustering problem

Typically the process of data clustering involves the following steps summarized by A.K. Jain et al. (Jain and Dubes, 1988; Jain et al., 1999)

(1) Object or pattern representation (may include feature extraction and/or selection),

(2) Pattern proximity

(3) Clustering or grouping,

(4) Data abstraction (optional), and

(5) Evaluation of clustering result (optional).


Pattern representation refers to the number of classes, the number of available patterns, and the number, type, and scale of the features available to the clustering algorithm. The process of identifying the most effective subset of the original features for clustering is called Feature selection. The method of transforming input features to produce new salient features is called feature extraction. Both techniques can be used to obtain an appropriate set of features to use in clustering.

Pattern proximity Pattern proximity refers to the metric that evaluates the similarity (or in contrast, the dissimilarity) between two patterns. A variety of distance measures are in use (Anderberg 1973; Jain and Dubes 1988; Diday and Simon 1976). A simple distance measure is Euclidean distance used to reflect dissimilarity between two patterns, whereas other similarity measures used to characterize the conceptual similarity between patterns (Michalski and Stepp, 1983). Distance measures are discussed in Section

The next step is How to perform clustering? The output of the clustering can be crisp or fuzzy. In crisp clustering every object is assigned exactly one clusters. In fuzzy clustering, each pattern has a variable degree of membership in each of the output clusters. Hierarchical clustering algorithms produce a nested series of partitions based on a criterion for merging or splitting clusters based on similarity. Partitional clustering algorithms identify the partition that optimizes clustering criterion. The various techniques for cluster formation is described in Section 3.2.3.

The two main goals of clustering techniques are:

1. Each group or cluster is homogenous. The elements belong to the same group are similar to each other.

2. Each group or cluster should be different from other clusters, i.e. an element that belongs to one group should be different from the elements of other groups.

Data abstraction is a method for extracting a simple and compact representation of a data set. In 1976 Diday and Simon proposed that, a typical data abstraction is a compact description of each cluster in terms of cluster prototypes or centroids.

How to evaluate the output of a clustering algorithm? All clustering algorithms produce clusters in spite of whether the given data set contain clusters or not. One solution is an assessment of the data domain than the clustering algorithm itself. The data that do not contain clusters should not be processed by a clustering algorithm. Cluster validity analysis is used for the evaluation of a clustering algorithm's output. Validity measures have objective (Dubes 1993) and used to decide the output is meaningful or not. There are three types of validation methods. An external validity measure compares the recovered structure to a predefined structure. An internal validity measure tries to determine if the structure is intrinsically appropriate for the data. A relative test compares two structures and measures their relative merit (Jain and Dubes, 1998; Dubes, 1993).

3.2.1 Pattern Representation

Data patterns are represented as points or vectors in a multi-dimensional space, where each dimension represents a distinct attribute or property describing the object.

Data matrix : Mathematically, a data set consists of n objects, each object is described by d attributes,is denoted by D = {x1, x2, . . . , xn}, where xi = (xi1, xi2, . . . , xid)T is a vector denoting the ith object and xij is a scalar denoting the jth component or attribute of xi. The number of attributes d is also called the dimensionality of the data set. Finally, the whole data set D is represented as nxd matrix. Data Normalization

The following are some common approaches to data standardization:

(1) Min-max normalization: It performs a linear transformation on the original data. Scales all numeric variables in the range [0, 1] shown in Eq. (3.1) (3.1)

(2)Z-score: For each attribute value subtract off the mean, μj, of that attribute and then divide by the attribute's standard deviation, σj. If the data are normally distributed, then most attribute values will lie between -1 and 1 (Kaufman and Rousseeuw, 1990)





μj is the mean of the jth feature and σj is the standard deviation of the jth feature.

(3) Divide each attribute value of an object by the maximum observed absolute value of that attribute. This restricts all attribute values to lie between -1 and 1. All values are positive, and thus, all transformed values lie between 0 and 1.


3.2.2 Pattern proximity

A pattern proximity measure is a basis for cluster generation to indicate how two patterns are similar to each other (Ram kumar and Swami, 1998). The proximity measure always corresponds to the pattern representation. A good proximity measure is capable of using the key features of the data domain. The existing relationship among patterns helps to determine a suitable pattern similarity measure. Data types and scales

The features of the objects have different data types measured on different data scales. The type of an attribute is determined by the set of possible values. The different types of attributes are nominal, ordinal or numeric (interval scaled, ratio scaled).

The different data scales are

1) Qualitative

a) Nominal attribute - The values of nominal attribute are symbols or names of things.

Ex: Occupation with the values teacher, dentist,Programmer and so on.

If symbols are used then code is assigned. The codes for occupation are 0-teacher,1-dentist,2-programmer, and so on.

b) Ordinal attribute- the values have a meaningful order or ranks.

Ex: Customer performance: 0-Good, 1-Better, 2-Best

2) Quantitative

a) Interval scaled variables are measured on a scale of equal size units. The values of interval-scaled attributes have order.

Example 1: Temperature attribute.

b) For ratio scaled variables, the scale has an absolute zero so that ratios are meaningful.

Ex: Height, width, length attributes. Distance measures for numerical data

Distance measures the similarity or dissimilarity between two data objects x =(x1,x2,...,xd) and y = (y1,y2,..., yd) and z=(z1,z2,. . .,zd)

Requirements on similarity or dissimilarity measures

i) Positivity: d(x,y) >=0

ii) For dissimilarity d(x,x) = 0 patterns are same. For similarity

d(x,x)>= max (d(x,y)).In this case d(x,x)=1.

iii) Symmetry: d(x,y) = d(y,x) i.e the proximity matrix is symmetric

iv) Triangle Inequality: d(x,y) <= d (x,z) + d(z,y)

Minkowski distance: The Minkowski distance between two objects x and y is defined in eq.3.8 .The Euclidean distance, Manhattan distance and maximum distance can be defined from Minkowski.


Manhattan distance: If r = 1, d is the Manhattan distance or city block distance


Euclidean distance: If r= 2, d is Euclidean distance defined in Eq.3.10


Where xj and yj are the values of the jth attribute of x and y, respectively.

The squared Euclidean distance is defined to be


If r=∞, d is the maximum distance.

The maximum distance is also called the "sup" distance. It is defined to be the maximum value of the distances of the attributes; that is, for two data points x and y in d-dimensional space, the maximum distance between them is


Mahalanobis Distance

Mahalanobis distance (Jain and Dubes, 1988; Mao and Jain, 1996) is used to reduce the distance distortion caused by linear combinations of attributes. It is defined by


Where ∑ is the covariance matrix of the data set .

Covariance matrix

Covariance is a well-known concept in statistics. Let D be a data set with n objects, each of which is described by d attributes x1, x2,. . . , xd known as variables. The covariance between two variables xr and xs is defined to be the ratio of the sum of the products of their deviation from the mean to the number of objects (Rummel, 1970), i.e.,


Where xij is the jth component of data point xi and is the mean of all data points in the jth variable, i.e.,


The covariance matrix is a dÃ-d matrix in which the entry (r, s) contains the covariance between variable xr and xs, i.e.,

Similar way, there are different measures to calculate similarity or dissimilarity between the objects for categorical data, binary data, mixed type data, and time series data.

3.2.3 Clustering methods

For numerical data, (Lorr, 1983) suggested two kinds of clusters, compact clusters and chained clusters. A compact cluster is a set of data points in which members have high mutual similarity. A compact cluster can be represented by a representative point or center (Michaud 1997). For categorical data, a mode is used to represent a cluster (Huang, 1998).A chained cluster is a set of data points in which any two data points can be reachable through a path.

In hard clustering, each object is assumed to belong to one and only one cluster.In fuzzy clustering an object can belong to one or more clusters with probabilities.

Clustering methods (Anderberg, 1973; Hartigan, 1975; Jain and Dubes, 1988; Jardine and Sibson, 1971; Sneath and Sokal,1973; Tryon and Bailey, 1973) can be broadly divided into two basic types: hierarchical and partitional clustering. Hierarchical clustering

Hierarchical algorithms creates a hierarchical decomposition of the inputs (Steinbach et al., 2000).The output of hierarchical algorithms is a structure called dendrogram (Horn, 1988) that iteratively splits the input set into smaller subsets until each subset consists of only one object. In such a hierarchy, each level of the tree represents a clustering of the input.

Input to a hierarchical algorithm is an n x n similarity matrix, where n is the number of objects to be clustered. A linkage function is an essential feature for hierarchical cluster analysis. Its value is a measure of the "distance" between two groups of objects (i.e. between two clusters). It is either single linkage, complete linkage or average linkage.

The two basic types of hierarchical clustering algorithms are (Han and Kamber, 2000)

Agglomerative algorithms (Bottom up approach): They produce a sequence of clustering schemes of decreasing number of clusters at east step. Through each step of the clustering scheme, two closest clusters are merged.

Divisive algorithms (Top down approach): These algorithms produce a sequence of clustering schemes of increasing number of clusters at each step. In each step, a selected cluster is split into two smaller clusters. The examples for these algorithms are: AGNES and DIANA.

AGNES is an agglomerative method. Initially, each cluster. The clusters are then merged step-by-step according to desired criterion. This process is repeated until all the objects are in one cluster.

DIANA,the divisive method starts by taking all the objects into one cluster.The cluster is split based on distance between the closest neighboring objects in the cluster.This process is repeated until each new cluster contains a single object.

BIRCH (Zhang, 1996) uses a hierarchical data structure called CF-tree for incrementally and dynamically clustering the incoming data points. CF-tree is a height- balanced tree that stores the clustering parameters. BIRCH finds a good clustering with a single scan of the data and improve the quality with additional scans (Halkidi et al., 2001).It is able to handle noise effectively (Zhang et al, 1996). The size of each node in the CF-tree is limited, so it holds limited number of inputs. It generates different clusters for different orders of the same input data. So BIRCH is order sensitive.

CURE (Guha et al., 1998) represents each cluster by a certain number of points generated by selecting well scattered points and then shrinking them toward the cluster centroid by a specified fraction. The algorithm is capable of identifying clusters with non-spherical shapes and with different sizes. It uses a combination of random sampling and partition clustering to handle large databases. It is also claimed to be capable of handling noise effectively.

ROCK (Guha et al., 1999) is a robust clustering algorithm for boolean and categorical data. ROCK uses the links concept to measure the proximity between a pair of data points. ROCK is scalable and will generate better quality clusters than traditional algorithms.

In hierarchical clustering, the number of output clusters is not predefined. The disadvantage of hierarchical algorithms is the difficulty in determining the termination condition, i.e. at which point the merging or splitting process stops. Partitional clustering

Partitional clustering finds clusters simultaneously by decomposing the data set into a set of disjoint clusters`. The global criterion of clustering is minimizing dissimilarity of the samples within each cluster, while maximizing the dissimilarity of different clusters. The output of the Partitioning algorithms is a bounded flat partition among the output clusters with cluster centers.

The input to the partitional algorithm is an n x d pattern matrix (K-means), where n points are embedded in a d-dimensional feature space. A partitional clustering algorithm obtains a single partition of the data instead of a dendrogram that is computationally expensive. The partitional techniques generate clusters based on optimization function. Combinatorial search of the set of possible labelings for an optimum value of a criterion is computationally prohibitive. Therefore, the algorithm is typically run multiple times with different starting states, and the best configuration obtained from all of the runs is used as the output of clustering. The examples of partitional algorithms are k-means clustering, fuzzy c-means clustering etc. Details of these algorithms are discussed in chapter 5.

Partitional algorithms are relatively scalable and simple. These are suitable for data sets with compact spherical clusters that are well separated. The limitations are poor effectiveness in high dimensional spaces, dependence on the user to specify number of clusters in advance, high sensitivity to noise and outliers, and inability to deal with non-convex clusters of varying size and density. Density based clustering

It requires clusters to have a certain number of objects in a given space.

DBSCAN (Density based spatial clustering of applications with noise)

The more popular density based algorithm is DBSCAN (Zhang et al, 1996).Cluster is defined as a maximal set of density connected points.DBSCAN searches for clusters by checking the ε-neighborhood of a point p contains atleast Minpts, a new cluster with p as a core object is created(Han and Kamber, 2000). Then it iteratively collects directly density reachable objects from these core objects, which may involve the merging of a few density reachable points into clusters. The process terminates when no new point can be added to any cluster.

Optics (Ordering points to identify the cluster structure)

Ankerst et al, 1999 proposed OPTICS to overcome the limitations of DBSCAN, Optics computes the cluster ordering of a data set that represents the density based clustering structure of the data. This analysis is used to represent the clusters graphically that helps in understanding.


Hinneburg and Keim, 1998 introduced the DENCLUE a clustering method based on a set of density distribution points. The method is built on the following ideas:

i) The influence of a data point can be modeled by a mathematical function called influence function that defines the impact of a data point within its neighborhood.

ii) Sum of the influence function applied to all the data points gives the overall density of the data space.

iii) Clusters are determined mathematically by identifying density attractors, where density attractors are local maxima of the overall density function.

The major advantages of DENCLUE are it provides good clustering for data sets with large amounts of noise, finds arbitrarily shaped clusters in high dimensional data sets. Grid based methods

The grid based clustering approach uses a multiresolution grid data structure. It divides the object space into number of cells a grid structure on which all of the operations for clustering are performed. The typical examples of grid based approach include STING and CLIQUE.

STING: statistical information grid

STING (Wang et al., 1997) divides the spatial area into rectangular cells. There are several levels of rectangular cells corresponding to different levels of resolution and these cells form a hierarchical structure. Each cell at higher level is partitioned to form a number of cells at the next lower level. Statistical parameters like count, mean (m), standard deviation(s), max, and min can easily be computed at higher level cells from the parameters at lower level cells. When the data base is populated with data, the parameters m, s, max, min of the bottom level cells are computed directly from the data. The benefits of STING over other clustering methods are it is query independent, scans the data base once to compute the statistical parameters of the cells, so the time complexity of generating clusters is O(n)


WAVECLUSTER (Sheikholeslami et al., 1998; 2000) is a multi-resolution clustering algorithm used to find dense regions in the transformed space.The original feature space is transformed by using wavelet transformation technique.A wavelet transform is a signal processing technique that decomposes a signal into different frequency subbands. A Multidimensional grid structure will be imposed on to the data space for summarization. It is a grid based and density based method. It can handle large data sets efficiently and discovers clusters with arbitrary shape. It is very fast, and efficient in detecting outliers. The computational complexity is O(n).


CLIQUE is an integrated clustering algorithm (Han and Kamber, 2000). It integrates density based and grid based clustering. It is useful for clustering high dimensional data in large databases. It performs clustering in two steps. In the first step n-dimensional data is partitioned into non- overlapping rectangular cells, identifying the dense units among those. In the second step, it generates a minimal description for each cluster. It determines the maximal region that covers the cluster of connected dense units. It then determines a minimal cover for each cluster. It is effective in finding effective subspaces of the highest dimensionality such that high- density clusters exist in those subspaces. It is insensitive to the order of input tuples. Model based clustering

Model based clustering methods attempt to define a model to describe each cluster and try to fit data into the mathematical model (Han and Kamber, 2000). The algorithms based on this approach are: neural network approaches like self-organizing feature map (SOM), probability density-based approaches, Gaussian mixture model, and Bayesian clustering. These methods are based on the assumption that the data are generated by a mixture of underlying probability distributions.

COBWEB is a popular and simple method of incremental conceptual clustering that adopts a statistical approach. The input objects are described by categorical attribute-value pairs. It creates hierarchical clustering in the form a classification tree. Each node in a classification tree is a concept and contains probabilistic description (conditional probabilities) of the concept classified under that node.

CLASSIT is an extension of COBWEB for incremental clustering of continuous data. Both are not suitable for clustering large data containing multiple dimensions.

The neural network approach to clustering tends to represent each cluster as an exemplar. New objects can be distributed to the cluster whose exemplar is most similar, based on distance measure. The attributes of an object assigned to a cluster can be predicted from the attributes of the cluster's exemplar.

3.3 Limitations of clustering

A problem with the clustering methods is that the interpretation of the clusters may be difficult. Most clustering algorithms prefer certain cluster shapes, and the algorithms will always assign the data to clusters of such shapes even if there were no clusters in the data. Some algorithms wish to make inferences about cluster structure then it is necessary to analyze the data set to check for clustering tendency. The results of the cluster analysis must be validated. (Jain and Dubes, 1988) present methods for both purposes.

Another potential problem is that the choice of the number of clusters may be critical: quite different kinds of clusters may emerge when K is changed. Good initialization of the cluster centroids may also be crucial; some clusters may even be left empty if their centroids lie initially far from the distribution of data. A research paper by Dubes in 1987 provides guidance on this key design decision.

Clustering is used to reduce the amount of data by categorization. For example in the case of the K-means algorithm the centroids that represent the clusters are still high-dimensional, and visualization methods are needed.

3.4 Conclusion

Clustering is dynamic field of research in data mining. In this chapter the basic concepts of clustering and motivation for clustering is presented. Next section explains the major steps of clustering process.In the next section several basic clustering techniques are presented. In this section hierarchical algorithms, basics of partitional clustering, density based algorithms are discussed.Grid based clustering algorithms and model based clustering algorithms are presented in the next part of the section.