Data mining cluster analysis

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.


Methods of cluster analysis are placed between statistics and informatics. They play an important role in the area of data mining. Cluster analysis divides data into meaningful or useful groups (clusters). If meaningful clusters are the goal, then the resulting clusters should capture the "natural" structure of the data. However, in other cases, cluster analysis is only a useful starting point for other purposes, e.g., data compression or efficiently finding the nearest neighbors of points [4]. The aim of this paper is to present some approaches to clustering in categorical data.

Whereas methods for cluster analysis of quantitative data are currently implemented in all software packages for statistical analysis and data mining, and the differences among them in this area are small, the differences in implementation of methods for clustering in qualitative data are substantial [3]. There are a number of clustering types which used as hierarchical, partitional, exclusive, non-exclusive, fuzzy, partial, complete, object clustering, and attribute clustering. Cluster analysis is also used and other types of cluster [1].

The scope of this paper is modest: to provide an introduction to cluster analysis in the field of data mining, where we define data mining to be the discovery of useful, but non-obvious, information or patterns in large collections of data. Much of this paper is necessarily consumed with providing a general background for cluster analysis, but we also discuss a number of clustering techniques that have recently been developed specifically for data mining. While the paper strives to be self-contained from a conceptual point of view, many details have been omitted. Consequently, many references to relevant books and papers are provided.[4]

What is Cluster Analysis?

Clustering is considered as an unsupervised classification process [5]. The clustering problem is to partition a dataset into groups (clusters) so that the data elements within a cluster are more similar to each other than data elements in different clusters by given criteria.

Cluster analysis groups data objects based on information found in the data that describes the objects and their relationships. The goal is that the objects in a group be similar to one another and different from the objects in other groups. Cluster analysis divides data into meaningful or useful groups (clusters)[3]. Cluster analysis is an exploratory discovery process. It can be used to discover structures in data without providing an explanation/interpretation. If meaningful clusters are the goal, then the resulting clusters should capture the 'natural' structure of the data. For example, cluster analysis has been used to group related documents for browsing, to find genes and proteins that have similar functionality, and to provide a grouping of spatial locations prone to earthquakes [1].

However, in other cases, cluster analysis is only a useful starting point for other purposes, e.g., data compression or efficiently finding the nearest neighbors of points [3]. Whether for understanding or utility, cluster analysis has long been used in a wide variety of fields: psychology and other social sciences, biology, statistics, pattern recognition, information retrieval, machine learning, and data mining.

In many applications, what constitutes a cluster is not well defined, and clusters are often not well separated from one another [5]. Nonetheless, most cluster analysis seeks, as a result, a partition of the data into non-overlapping groups.

To better understand the difficulty of deciding what constitutes a cluster, consider figures a) through c), which show twenty points and three different ways that they can be divided into clusters. If we allow clusters to be nested, then the most reasonable interpretation of the structure of these points is that there are two clusters, each of which has three subclusters. However, the apparent division of the two larger clusters into three subclusters may simply be an artifact of the human visual system. Finally, it may not be unreasonable to say that the points from four clusters. Thus, we stress once again that the definition of what constitutes a cluster is imprecise, and the best definition depends on the type of data and the desired results.[4]

(a) Original points. (b) Two clusters.

(c) Four clusters. (d) Six clusters.

What is not Cluster Analysis?

We illustrate the difference between cluster analysis and other techniques used to divide data objects into groups. The cluster analysis can be regarded as a form of classification in that it creates a labeling of objects with class (cluster) labels. It derives these labels only from the data. As such, clustering does not use previously assigned class labels, except perhaps for verification of how well the clustering worked. Thus, cluster analysis is distinct from pattern recognition or the areas of statistics know as discriminate analysis and decision analysis, which seek to find rules for classifying objects given a set of pre-classified objects.

While cluster analysis can be useful in the previously mentioned areas, either directly or as a preliminary means of finding classes, there is much more to these areas than cluster analysis. For example, the decision of what features to use when representing objects is a key activity of fields such as pattern recognition [3]. Cluster analysis typically takes the features as given and proceeds from there. Thus, cluster analysis, while a useful tool in many areas, is normally only part of a solution to a larger problem which typically involves other steps and techniques.[4]

Types of Clustering

An entire collection of clusters is commonly referred to as a clustering, and in this section, we describe various types of clustering: hierarchical versus partitional, exclusive versus non-exclusive, fuzzy versus non-fuzzy, partial versus complete, and object clustering versus attribute clustering.

Hierarchical versus Partitional: The most commonly discussed difference between collections of clusters is whether the clusters are nested or unnested or, in more traditional terminology, whether a set of clusters is hierarchical or partitional. A partitional of clusters is simply a division of the set of data objects into non-overlapping subsets (clusters) such that each data object is in exactly one subset [4]. A hierarchical clustering is a set of nested clusters organized as a hierarchical tree, where the leaves of the tree are singleton clusters of individual data objects, and where the cluster associated with each interior node of the tree is the union of the clusters associated with its child nodes [1]. The tree that represents a hierarchical clustering is called a dendrogram, a term that comes from biological taxonomy.

(a) Nested Clusters (b) Dendrogram

The distinction between a hierarchical and partitional clustering is not as great as it might seem. Specifically, by looking at the clusters on a particular level of a hierarchical tree, a partitional clustering can be obtained. For example, in Figure 1, the partitional clustering, from bottom to top, are {{1}, {2}, {3}, {4}}, {{1}, {2, 3}, {4}}, {{1}, {2, 3, 4}}, and {{1, 2, 3, 4}}. Thus, a hierarchical partitioning can be viewed as a sequence of partitional clusterings, and a partitional clustering can be viewed as a particular 'slice' of a hierarchical clustering.

In the hierarchical clustering of Figure 1, the set of clusters at a given level and the set of clusters of the level immediately preceding it are the same except that one of the clusters in the given level is the union of two of the clusters from the immediately preceding level [5]. While this approach is traditional and most common, it is not essential, and a hierarchical clustering can merge more than two clusters from one level to the next higher one [1].

Exclusive versus Non-Exclusive: In a non-exclusive clustering, a point can belong to more than one cluster. In the most general sense, a non-exclusive clustering is used to reflect the fact that an object may belong to more than one group (class) at a time, e.g., a person at a university may be both an enrolled student and an employee of the university. Note that the exclusive versus non-exclusive distinction is independent of the partitional versus hierarchical distinction [4].

In a more limited sense, a non-exclusive clustering is sometimes used when an object could reasonably be placed in any of several clusters. Rather than make a somewhat arbitrary choice and place the object in a single cluster, such objects are placed in all of the "equally good" clusters. This type of non-exclusive clustering does not attempt to deal with multi-class situations, but is more similar to the fuzzy clustering approach that we describe next.

(a) Non-traditional Nested Clusters (b) Non-traditional Dendrogram

Fuzzy versus Non-Fuzzy: In a fuzzy clustering, every point belongs to every cluster, but with a weight that is between 0 and 1. In other words, clusters are treated as fuzzy sets. Mathematically, a fuzzy set is one where an object belongs to any set with a weight that is between 0 and 1. For any object, the sum of the weights must equal 1. In a very similar way, some probabilistic clustering techniques assign each point a probability of belonging to each cluster, and these probabilities must also sum to one. Since the cluster membership weights for any object sum to 1, a fuzzy or probabilistic clustering does not address true multi-class situations where an object belongs to multiple classes.

Partial versus Complete: A complete clustering assigns every object to a cluster, whereas a partial clustering does not [1]. The motivation for a partial clustering is that not all objects in a data set may belong to well-defined groups. Indeed, many objects in the data set may represent noise, outliers, or "uninteresting background."

Object Clustering versus Attribute Clustering: While most clustering is object clustering, i.e., the clusters are groups of objects, attribute clustering, i.e., the clusters are groups of attributes, can also be useful. For instance, given a set of documents, we may wish to cluster the words (terms) of the documents, as well as or instead of the documents themselves. Attribute clustering is less common than object clustering, as are clustering techniques that attempt to cluster the both objects and attributes simultaneously.

Types of Clusters

Several working definitions of a cluster are commonly used and are described in this section. In these descriptions, we will use two dimensional points as our data objects to illustrate the differences between the different types of clusters, but the cluster types described are applicable to a wide variety of data sets [4].

Well-Separated: A cluster is a set of points such that any point in a cluster is closer to every other point in the cluster than to any point not in the cluster. Sometimes a threshold is used to specify that all the points in a cluster must be sufficiently close to one another.

Center-based Cluster: A cluster is a set of objects such that an object in a cluster is closer to the 'center' of a cluster, than to the center of any other cluster. The center of a cluster is often a centroid, i.e., the average of all the points in the cluster, or a medoid, the most 'representative' point of a cluster.

Contiguous Cluster: A cluster is a set of points such that a point in a cluster is closer to one or more other points in the cluster than to any point not in the cluster.

Density-based: A cluster is a dense region of points, which is separated by low-density regions, from other regions of high density. This definition is often used when the clusters are irregular or intertwined, and when noise and outliers are present. Note that the contiguous definition would find only one cluster. Also note that the three curves don't form clusters since they fade into the noise, as does the bridge between the two small circular clusters.

Shared Property: More generally, we can define a cluster as a set of points that share some property. This definition encompasses all the previous definitions of a cluster, e.g., points in a center based cluster share the property that they are all closest to the same centroid. However, the shared property approach also includes new types of clusters. In both cases, a clustering algorithm would need to have a very specific 'concept' of a cluster in mind to successfully find these clusters. Finding such clusters is sometimes called 'conceptual clustering.' However, too sophisticated a notion of a cluster would bring us into the area of pattern recognition, and thus, we will only consider the simpler types of clusters in this chapter[1].

Clusters Defined Via Objective Functions Another general approach to defining a set of clusters is by using an objective function. Specifically, we define an objective function that attempts to capture the 'goodness' of a set of clusters, and then define our clustering as the set of clusters that minimizes the objective function. To illustrate, for two dimensional points, a common objective function is squared error, which is computed by calculating the sum of the squared distance of each point to the center of its cluster. For a specified number of clusters, K we can then define our clusters to be the partitioning of points into K groups that minimizes this objective [2]. The K-means algorithm is based on this objective function.


Cluster analysis methods will always produce a grouping. The groupings produced by cluster analysis may or may not prove useful for classifying objects. If the groupings discriminate between variables not used to do the grouping and those discriminations are useful, then cluster analysis is useful[4]. Cluster analysis may be used in conjunction with discriminate function analysis. After multivariate data are collected, observations are grouped using cluster analysis. Discriminate function analysis is then used on the resulting groups to discover the linear structure of either the measures used in the cluster analysis and/or different measures.

Cluster analysis methods are not clearly established. We have clustering types which used as hierarchical, partitional, exclusive, non-exclusive, fuzzy, partial, complete, object clustering, and attribute clustering[1]. Also we use and other types of cluster. There are many options one may select when doing a cluster analysis using a statistical package. Cluster analysis is thus open to the criticism that a statistician may mine the data trying different methods of computing the proximities matrix and linking groups until "discovers" the structure that originally believed was contained in the data. One wonders why anyone would bother to do a cluster analysis for such a purpose [5].


[1] Cluster analysis: Basis Concepts and Algorithms

[2] "Cluster analysis and categorical data" Hana Rezanková Vysoká škola ekonomická v Praze, Praha

[3] "On Continuous Optimization Methods in Data Mining- Cluster Analysis, Classification and Regression - Provided for Decision Support and Other Applications" Tatiana Tchemisova, Bas¸ak Akteke-O¨ ztu¨rk, Gerhard Wilhelm Weber.

[4] "An Introduction to Cluster Analysis for Data Mining" 10/02/2000 11:42 AM

[5]"Visual Cluster Analysis in Data Mining" A thesis submitted in fulfilment of the requirements for the Doctorial Degree of Philosophy by Ke-Bing Zhang October 2007

[6] "Microarray Gene Expression Data Mining: Clustering Analysis Review" Erfaneh Naghieh and Yonghong Peng Department of Computing, University of Bradford