A Study On Datamining And Clustering Education Essay

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

Data mining is a process of analyzing data from diverse perspective and summarizing it into useful information. In simple words extracting knowledge from large amount of data .The use of data mining information will increase revenue and cut costs. Data mining software's allows you to analyze data in different perspective, classify it and summarize the relationships found. Thus data mining is a method in discovering correlations and patterns among number of fields in huge relational database.

Cluster & clustering: Clustering is the process of partitioning of data into groups of similar objects (clusters). It is a technique of data modeling which gives the exact concise summaries of the data.

There are several clustering techniques like

  • partitioning methods,
  • hierarchical methods
  • density-based methods,
  • grid-based methods,
  • model-based methods,
  • methods for high-dimensional data and
  • constraint-based clustering.


Clustering is a process of grouping a set of abstract or physical objects into classes of related objects. A cluster is a group of data objects that are analogous to one another within the same cluster and dissimilar to the data object in the other cluster. The process of clustering involves firstly partitioning the set of data into groups based on data analogy and then assigning label to the groups. Thus Clustering is advantageous as it is adaptable to changes and helps in distinguish between clusters.

Cluster analysis is similar to the human activity of distinguishing cats and dogs, or identifying between different things like bikes and cars. Cluster analysis is widely used in several fields like market research, pattern recognition, data analysis and image processing. For example in business clustering helps marketers to discover distinctive groups in their customer bases and portray customer group based in purchasing pattern. Clustering partitions large data sets into groups according to their similarity thus also called as data segmentation. Cluster analysis can be used as a tool to gain insight into distribution of data, to examine the distinctiveness of each cluster and center on a particular set of clusters for additional analysis. Data clustering is causative in the areas such as data mining, statistics, machine learning, spatial database, marketing etc.

Cluster analysis tools have been built as software packages such as S -plus, SPSS and SAS. Clustering is an example of unsupervised learning in machine. For this reason clustering is a form of learning by observation and not by examples. The efforts in data mining are inattentive on discovering methods for effective cluster analysis in outsized database. The research have resulted in active themes that focus on the scalability of clustering methods and the effectiveness of methods for clustering composite shapes and types of data, high dimensional clustering techniques ,mixed numerical and categorical data in outsized database.

Clustering a challenging field

Clustering is a challenging field of research in which its prospective applications masquerade their own special needs.

These are the typical requirements of clustering in data mining:

  • Scalability: Clustering algorithms should work well with both small and large databases. The results should be stable for both large and small volumes of data .So high scalable clustering algorithms are needed.

  • Dealing with different types of attributes: Algorithms must be designed to cluster data of all types like binary, categorical, ordinal or mixture of data types.

  • Discovery of clusters with arbitrary shape: Algorithms should be developed to find clusters with random shape and size.

  • Minimal requirements for domain knowledge to determine input parameters: Clustering becomes difficult when user have to input parameters for data sets. There is incline in the quality of the cluster.

  • Dealing with noisy data: Databases contain outliers and erroneous data which will lead to poor cluster quality.

  • Incremental clustering and insensitivity to the order of input records: Algorithms should incorporate new data s and new order of inputs without different clustering patterns.

  • Dimensionality: Cluster algorithms should handle several dimensions of data .High dimensional space is challenging and such data can be sparse and highly distorted.

  • Constraint-based clustering: Clustering with real time applications would involve various constraints.

  • Interpretability and usability: Users expect clustering results to be interpretable, understandable, and usable. The cluster may need to be tied to specific semantic interpretations and applications.

Clustering methods:

The major clustering methods available are

  • Partitioning methods
  • Grid-based methods
  • Model-based methods
  • Hierarchical methods
  • Density-based methods
  • Clustering high-dimensional data
  • Constraint-based clustering

There are two types of classification- Supervised and unsupervised

Supervised classification - we might know the available data and the amount

Unsupervised classification - both the availability and amount of data are unknown. This is called CLUSTERING

Intra clusters is one in which the similarity between the data is high compared to inter cluster.

Applications of Clustering

  • Image processing
  • Economic science
  • Spatial Analysis
  • Pattern recognition

Clustering Techniques

  • Hierarchial algorithm
  • Partitioned algorithm
  • Grid based
  • Model based
  • Density based
  • Frequent pattern-based
  • Constraint based

1.Partitioned algorithm

Generally a database containing m set of clusters and n set of data are constructed.The criteria for partitioning are

Heuristic methods: K-Means, k medoids, CLARANS

Global optimal: enumerates all paritions

K Means:

Every cluster is represented by the center of the cluster.

k-means algorithm

  1. The given objects are partitioned into non empty sets

  2. Determine the mean point, of the cluster

  3. Each object to the cluster is assigned with the nearest mean point

  4. step 2 is repeated when there are no more new values(mean point)

There are various dis advantages using K mean method, the number of clusters to be dealt should be predefined. This method is not able to handle Outliers. Only one parameter i.e. mean point is alone is used for classifying the cluster which is not sufficient.

k-medoids or PAM (Partition around medoids):

Each cluster is represented by one of the objects in the cluster

We improve the methodology by using medoids (most centrally located object in a cluster) as a reference point instead of mean value used earlier.


Initially a set of medoids is selected and recursively one of the medoids are replaced by non-medoids if it improves the total distance of the resulting clustering.This method works effectively for small data sets, but does not scale well for large data sets.

CLARA (Clustering Large Applications):

Various data sets are choosed and PAM is applied on each set which gives the best clustering as the output.It deals with larger data sets than PAM but the Efficiency of each data sets depends on the sample size

CLARANS ("Randomized sampling" CLARA):

This technique is highly efficient and scalable than both PAM and CLARA .In this method the neighboring sample are drawn dynamically.The clustering process can be presented as searching a graph where every node is a potential solution, that is, a set of k medoids .If the local optimum is found, CLARANS starts with new randomly selected node in search for a new local optimum

2.Hierarchical approach:

Based on some criteria the set of data (or objects) are decomposed hierarchically.

Typical methods: Diana, Agnes, BIRCH, ROCK, CAMELEON

In this approach distance matrix is chosen as the clustering criteria. The number of clusters k is not required instead a termination condition is needed.

AGNES (Agglomerative Nesting)

Bottom up approach, which merges the dissimilar matrix or the cluster in a non descending manner. Merging is performed in such a manner that one cluster is merged into another large cluster and into another larger cluster and so on until all the objects are merged into a single cluster.This method is not scalable, time complexity is O (n2), where n is the number of total objects

DIANA (Divisive Analysis)

Top down decomposition, this is the Inverse order of Agglomerative Nesting. In this method the object is spitted into smaller objects and so on and each of them form their own cluster. This method is rarely applied.

Integration of hierarchical with distance-based clustering:

BIRCH : uses CF-tree and incrementally adjusts the quality of sub-clusters

Balanced Iterative Reducing and Clustering using Hierarchies. It uses CF(Clustering Feature )-tree and incrementally adjusts to the quality of sub-clusters.Th is approach is basis for multi phase clustering.

Step1: scan DB to build an initial in-memory CF tree (a multi-level compression of the data that tries to preserve the inherent clustering structure of the data)

Step 2: arbitrary clustering algorithm is used to cluster the leaf nodes of the CF-tree

This method is useful to handle only numeric data, and sensitive to the order of the data record.

CF-Tree in BIRCH

Clustering feature: It concisely summarizes the statistics for a given sub cluster: the 0-th, 1st and 2nd moments of the sub cluster from the statistical point of view. Efficient utilization of storage and stores the crucial measurements for cluster computation.

A CF tree is a height-balanced tree that stores the clustering features for a hierarchical clustering .A non leaf node in a tree has "children" and these nodes store sums of the CFs of their children.

A CF tree has two parameters:

  • Branching factor: specify the maximum number of children.

  • Threshold: max diameter of sub-clusters stored at the leaf nodes


Clustering categorical data by neighbor and link analysis.RObust Clustering using links. These links are used to measure similarity/proximity. Its not a distance-based clustering. Computational complexity: is of order

Algorithm: sampling-based clustering

  • Samples are selected randomly

  • Then the Cluster are linked based on the similarity/proximity.

  • These clusters are then labeled.


Hierarchical clustering using dynamic modeling. Method of clustering complex objects

The similarity between the clusters are measured based on a dynamic model. Two clusters are merged only if the interconnectivity and closeness (proximity) between two clusters are high relative to the internal interconnectivity of the clusters and closeness of items within the clusters

Cure ignores information about interconnectivity of the objects; Rock ignores information about the closeness of two clusters

A two-phase algorithm

  1. Use a graph partitioning algorithm: cluster objects into a large number of relatively small sub-clusters

  2. Use an agglomerative hierarchical clustering algorithm: find the genuine clusters by repeatedly combining these sub-clusters

Density-based approach:

In this approach Clustering is done based on density (local cluster criterion), such as density-connected points

Typical methods: DBSACN, OPTICS, DenClue


  • Clusters are discovered based on the arbitrary shape

  • Density parameters are used as termination condition

  • Effectively handles noise.

Two parameters:

  • Eps: Maximum radius of the neighbourhood

  • MinPts: Minimum number of points in an Eps-neighbourhood of that point
  • NEps(p): {q belongs to D | dist(p,q) <= Eps}


A point p is density-reachable from a point q w.r.t. Eps, MinPts if there is a chain of points p1, …, pn, p1 = q, pn = p such that pi+1 is directly density-reachable from pi


A point p is density-connected to a point q w.r.t. Eps, MinPts if there is a point o such that both, p and q are density-reachable from o w.r.t. Eps and MinPts


Density Based Spatial Clustering of Applications with Noise.Relies on a density-based notion of cluster: A cluster is defined as a maximal set of density-connected points.Discovers clusters of arbitrary shape in spatial databases with noise


  • Arbitrary select a point p

  • Retrieve all points density-reachable from p w.r.t. Eps and MinPts.

  • If p is a core point, a cluster is formed.

  • If p is a border point, no points are density-reachable from p and DBSCAN visits the next point of the database.

  • Continue the process until all of the points have been processed.


  • Ordering Points To Identify the Clustering Structure

  • Produces a special order of the database wrt its density-based clustering structure

  • This cluster-ordering contains info equiv to the density-based clusterings corresponding to a broad range of parameter settings

  • Good for both automatic and interactive cluster analysis, including finding intrinsic clustering structure

  • Can be represented graphically or using visualization techniques


Using Statistical Density Functions


  • Solid mathematical foundation

  • Good for data sets with large amounts of noise

  • Allows a compact mathematical description of arbitrarily shaped clusters in high-dimensional data sets

  • Significant faster than existing algorithm (e.g., DBSCAN)

  • But needs a large number of parameters

Grid-based approach:

based on a multiple-level granularity structure(multi-resolution grid data structure)

Typical methods: STING, WaveCluster, CLIQUE

STING: A Statistical Information Grid Approach

The spatial area area is divided into rectangular cells

There are several levels of cells corresponding to different levels of resolution

Each cell at a high level is partitioned into a number of smaller cells in the next lower level

Statistical info of each cell is calculated and stored beforehand and is used to answer queries

Parameters of higher level cells can be easily calculated from parameters of lower level cell

count, mean, s, min, max

type of distribution-normal, uniform, etc.

Use a top-down approach to answer spatial data queries

Start from a pre-selected layer-typically with a small number of cells

For each cell in the current level compute the confidence interval

Remove the irrelevant cells from further consideration

When finish examining the current layer, proceed to the next lower level

Repeat this process until the bottom layer is reached


  • Query-independent, easy to parallelize, incremental update

  • O(K), where K is the number of grid cells at the lowest level


  • All the cluster boundaries are either horizontal or vertical, and no diagonal boundary is detected

WaveCluster: Clustering by Wavelet Analysis

A multi-resolution clustering approach which applies wavelet transform to the feature space

How to apply wavelet transform to find clusters

Summarizes the data by imposing a multidimensional grid structure onto data space

These multidimensional spatial data objects are represented in a n-dimensional feature space

Apply wavelet transform on feature space to find the dense regions in the feature space

Apply wavelet transform multiple times which result in clusters at different scales from fine to coarse

Wavelet transform: A signal processing technique that decomposes a signal into different frequency sub-band (can be applied to n-dimensional signals)

Data are transformed to preserve relative distance between objects at different levels of resolution

Allows natural clusters to become more distinguishable


Input parameters

# of grid cells for each dimension

the wavelet, and the # of applications of wavelet transform Why is wavelet transformation useful for clustering?

Use hat-shape filters to emphasize region where points cluster, but simultaneously suppress weaker information in their boundary

Effective removal of outliers, multi-resolution, cost effective

Major features:

Complexity O(N)

Detect arbitrary shaped clusters at different scales

Not sensitive to noise, not sensitive to input order

Only applicable to low dimensional data

Both grid-based and density-based

CLIQUE (Clustering In QUEst)

Automatically identifying subspaces of a high dimensional data space that allow better clustering than original space

CLIQUE can be considered as both density-based and grid-based

It partitions each dimension into the same number of equal length interval

It partitions an m-dimensional data space into non-overlapping rectangular units

A unit is dense if the fraction of total data points contained in the unit exceeds the input model parameter

A cluster is a maximal set of connected dense units within a subspace

Partition the data space and find the number of points that lie inside each cell of the partition.

Identify the subspaces that contain clusters using the Apriori principle

Identify clusters

Determine dense units in all subspaces of interests

Determine connected dense units in all subspaces of interests.

Generate minimal description for the clusters

Determine maximal regions that cover a cluster of connected dense units for each cluster

Determination of minimal cover for each cluster


A model is hypothesized for each of the clusters and tries to find the best fit of that model to each other

Typical methods: EM, SOM, COBWEB

EM - Expectation Maximization

EM - A popular iterative refinement algorithm

An extension to k-means

Assign each object to a cluster according to a weight (prob. distribution)

New means are computed based on weighted measures

General idea

Starts with an initial estimate of the parameter vector

Iteratively rescores the patterns against the mixture density produced by the parameter vector

The rescored patterns are used to update the parameter updates

Patterns belonging to the same cluster, if they are placed by their scores in a particular component

Algorithm converges fast but may not be in global optima

Initially, randomly assign k cluster centers

Iteratively refine the clusters based on two steps

Expectation step: assign each data point Xi to cluster Ci with the following probability

Maximization step:

Estimation of model parameters

Conceptual clustering

A form of clustering in machine learning

Produces a classification scheme for a set of unlabeled objects

Finds characteristic description for each concept (class)

COBWEB (Fisher'87)

A popular a simple method of incremental conceptual learning

Creates a hierarchical clustering in the form of a classification tree

Each node refers to a concept and contains a probabilistic description of that concept

Frequent pattern-based:

Based on the analysis of frequent patterns

Typical methods: pCluster

User-guided or constraint-based:

Clustering by considering user-specified or application-specific constraints

Typical methods: COD (obstacles), constrained

Clustering in applications: desirable to have user-guided (i.e., constrained) cluster analysis

Different constraints in cluster analysis:

Constraints on individual objects (do selection first)

Cluster on houses worth over $300K

Constraints on distance or similarity functions

Weighted functions, obstacles (e.g., rivers, lakes)

Constraints on the selection of clustering parameters

# of clusters, MinPts, etc.

User-specified constraints

Contain at least 500 valued customers and 5000 ordinary ones

Semi-supervised: giving small training sets as "constraints" or hints