Clustering Algorithm For Mixed Dataset Cultural Studies Essay

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

Clustering is the main challenge task for Data Mining. One of the aim for Data mining is Clustering i.e., grouping similar data points into number of clusters. There are different clustering algorithms for grouping data to clusters. The algorithms works on different datasets i.e., either on pure Numeric datasets or pure Categorical datasets and few of the algorithms works both mixed numeric and categorical datasets. In these paper a technique is using for mixed datasets splitting it as different datasets and performing suitable algorithms for datasets to form clusters, and combining those split clusters as a categorical datasets and finally applying a suitable clustering algorithm to get the final clusters. Previously they done the same but they used clustering algorithms which doesn't handled outliers perfectly so here I am using CURE and ROCK algorithms for Numeric and Categorical datasets .

Keywords- Clustering, mixed dataset, Numerical and Categorical datasets.


Clustering the name itself defines the challenging task in data mining, where it is one of the major technique/algorithm which is used for grouping similar objects to form clusters. Many data mining applications require partition of data form clusters through that a group of similar objects can be observed like Companies maintaining database for employees according to range of salaries, or a group of clients showing interest in government jobs. It may also helps in identification of the same group of houses in a city according to value, type and geographic location. In applications such as road traffic, telecommunications there is something to extract data measuring the evolution of parameters those are useful for decision making. Where these can be either numerical or categorical type. Cluster analysis has been widely used in number of applications such as image processing, market research, pattern recognition and finally data analysis. The datasets may be either a Numeric dataset or Categorical dataset for each type of dataset their a clustering algorithm to perform clusters i.e., group of similar objects. But the real world data contains not only numeric but also categorical which is mixed type dataset no algorithm works best for the mixed datasets. Where k means is used but it is not suitable for large datasets. Clustering is the collection of group of objects which are similar to inside clusters and dissimilar to other objects belonging to other clusters. Generally the major clustering methods are defined as "Partitioning methods", and "Hierarchical methods", Density-based methods", and "Grid based methods", "Model based methods" .

It is tough for applying ancient clustering rule directly into these forms of knowledge. Typically once individuals ought to apply ancient distance-based agglomeration algorithms to cluster these varieties of knowledge, a numeric value will be assign to each type of category to those attributes. Some of categorical values like "small", "medium", "large", can be easily changed to numeric type values, where the other attributes contains values like colours which are "green", "yellow", "orange",..etc, it can't be change to numeric values because we can't give values as they are plenty of colours that never assumed, thus it will be a challenging work to specify values to those attributes.

In this paper I have taken a real world data which is meaningful as the original dataset After that dividing the original dataset to Numeric and Categorical dataset separately. For Numeric dataset use the Hierarchical method either bottom-up or top-down approach , I am using bottom -up approach i.e., CURE(Clustering using REpresentatives). And for Categorical dataset applying the same Hierarchical clustering i.e., ROCK(Robust Clustering using links) both the CURE and ROCK are hierarchical clustering algorithms which are used for clustering the Numeric and Categorical datasets. Where the Hierarchical method can be defined as either AGNES(AGglomerative NESting) or DIANA(Divisive ANAlysis). The AGNES also named as bottom-up approach, starts with initialized object combing with other to form a group, is then form other group with other objects , it continues until the final cluster is formed or until termination condition holds. The DIANA also named as top-down approach where it exactly opposite to the AGNES, it continues from the cluster where all the objects are hold. In each iteration a main cluster is divided to small clusters until each object in the main cluster forms a separate cluster or until termination condition occurs. One more technique is Dendrogram is used to represent the process of hierarchical clustering. It shows how the objects are grouping together at each level initially starts at level0.

In base paper they divide the original dataset as numeric and categorical datasets after applying the individual clustering algorithms for both the datasets, clusters are produced. Finally the clustering results of the both categorical and numeric datasets are combined together and treated as categorical dataset on which the K-prototype algorithm is employed to produce the final clusters. The drawback of using K-prototype is it inherits the ideas of the K-means as calculating the Euclidean distance and calculating the cluster centre according to centre the objects which near to centre can form group. But it doesn't handle the outliers which are far away from the cluster centre. Now we are using ROCK algorithm for produced clusters from both numeric and categorical datasets.


A. Methodology

1. Taking the real world dataset as Original dataset splitting it to the pure Numeric dataset and pure Categorical dataset

2. Applying CURE Clustering algorithm on Numerical dataset to produce clusters

3. Applying ROCK Clustering algorithm on Categorical dataset to produce clusters

4. Combining the output clusters from the CURE and ROCK algorithms and treated as Categorical dataset

5. Applying the ROCK Clustering algorithm on the final Categorical datasets to produce final clusters

B. Cluster Divide and Conquer approach for Mixed Dataset:

Dataset with mixed knowledge kind square measure common in world. Cluster Divide and Conquer could be a

Technique to mix many runs of different cluster technique

to mix many runs of different cluster algorithmic program to urge a typical partition of original dataset. The two datasets are Numeric and Categorical.

Fig1: Overview of Algorithm Frame work

CURE(Clustering using Representatives) :

CURE is also an Hierarchical clustering algorithm used for large datasets that adopts a Middle ground between centroid based mostly and all-points approach. Its belongs to Agglomerative i.e.., bottom-up approach. Here it represents a each cluster by a fixed number of points which are generated by selecting well scattered points from the cluster, then shrink the points towards the cluster centre by a specified faction.

The main advantage of CURE is shrinking helps to intense the effects of outliers. Set a target i.e., representative point number c, for each of the clusters select c well scattered points attempting to capture the physical shape and geometry of the cluster. The chosen scattered representative points are then finally shrunk towards the centroid in a fraction of a where 0<= a<=1.

Where the below figure depicts that we selected the well selected points i.e.., 1,2,3,4,5,6

Fig 2: CURE Clustering Procedure

We can notice that "+" is the cluster centre, here 1 and 2 are far away from the centre and 3,4,5,6 are near to the cluster centre and the 3,4 are shrunk towards the centre 2&5 and 6&1 merge each other and form cluster. If they are near to the cluster centre they merge and form a final cluster.





Fig 3: Clustering Procedure

The above figure depicts that two clusters with close pair of representatives are merged to one, and after another c points are selected from original representatives and it is merged to form a another one clusters, it continues until target k clusters are found.


Random Sampling:

When all the dataset is considered as input to the algorithm, therefore the execution time will be high due to the I/O costs. The Random sampling is better answer which reduces the execution time and algorithm results are better than the traditional algorithms. To speed up the algorithm operations, random samplings are fitted in Main Memory .

Flow of Improved CURE:

Partition Sampling:

When the clusters in the dataset became less dense, random sampling with limited points become useless because it implies poor quality of clustering. Therefore to speed up the CURE proposing a simple partitioning scheme.

Partition n data points into p partition i.e., n/p each

Second, pre-cluster each partition until the number of clusters n/pq reaches in each partition for some q>1

Then, each result cluster in the first phase is used as the second pass clustering input to form the final clusters.

ROCK(Robust Clustering using linKs) :

ROCK is also one of the Hierarchical clustering that deals with concepts of links i.e., the number of common neighbours between two objects for data with non-numeric which is categorical attributes. The formal clustering algorithms for clustering data with Boolean and categorical values use distance functions such as Euclidean and Manhattan distance. However these distance functions doesn't leads to high quality clusters when clustering categorical data.

Most of the clustering algorithms follow the same method that is when clustering the similarity between the two points are merged into a single cluster. This one leads to errors where two distinct clusters have a few points or outliers that are close, therefore following the similarity between two points to make a clustering decision would leads that two clusters to be merged. ROCK takes a global approach which is considering the neighbourhoods of individual pairs of points. If two similar points have the same neighbourhoods then the two points likely belong to same one and form a cluster and thereby merged.

More formally , the two points mi and mj are neighbours i.e., if sim(mi, mj)>= θ, where sim defines the similarity function and θ considered as the threshold value which is specified by user. The number of links between mi and mj is defined as the number of common neighbours between them. If the links between those points is large then it can said that it is more likely belong to same cluster. By considering neighbouring data points in the relationship between the pair of points ROCK is more robust than the standard clustering methods which focuses only on point similarity, A good example of data containing categorical attributes is market basket data here a transaction represents one customer, and each transactions contain set of items purchased by the customer. Use to cluster the customers where that customers with similar buying pattern are in a one cluster. Use for

Characterizing different customer groups based on similar buying patterns

Targeted Marketing

Predict buying patterns of new customers based on profile i.e, if he buys bread then he must buy cheese

In market database, where attributes of data points are non-numeric, transactions viewed as records with Boolean attributes corresponding to single item i.e., TRUE if transaction contain item, FALSE if doesn't contain item. Boolean attributes are special case of categorical Attributes.

The database consists of set of transactions , each of the transactions contains set of items. According to Jaccard coefficient for sim(Ti,Tj), the similarity between the two transactions Ti and Tj is

Sim(Ti,Tj)=|Ti∩Tj| / |Ti∪Tj|

Where the more items that contains in two transactions Ti and Tj have in common and dividing by |Ti∪Tj| is the scaling factor which ensures that is between 0 and 1.

A link is defined as the number of common neighbours between two records. Intuitively, it is the number of distinct paths of length 2 that exist between a pair of points. Note that the point is considered as a neighbour itself ,there is a link from each neighbour of the "root" point back to itself through the root. Therefore if a point has x neighbours the x^2 links are due to it. Due to the links advantage is it captures the neighbourhood related information of the data i.e., a step towards more global and robust solutions

ROCK follows three steps those are Random Sampling, Clustering with links, Labelling data on database

Random Sampling:

Usually it is a large number of data

Enables ROCK to reduce the number of points considered to reduce complexity

Clusters generated by the sample points

With appropriate sample size, the quality of clustering is not affected

Clustering with links:

Determines the best pairs of clusters to merge at each step of ROCK's hierarchical clustering algorithm. For a pair of clusters Ci and Cj, link [Ci, Cj] stores the number of cross links between the clusters Ci and Cj i.e.,


Algorithm for Computing links:

Procedure compute_links(S)


1.Compute nbrlist[i] for every point i in S

2. Set link[i,j] to be zero for all i,j

3. for i:=1 to n do {

4. N := nbrlist[i]

5. for j:=1 to |N|-1 do

6. for l:=j+1 to |N| do

7. link|N[j],N[l]] := link[N[j],N[l]]+1

8. }


Algorithm for ROCK Clustering Algorithm:

Procedure cluster(S,k)//Set on n Sample points and k number of clusters

begin Compute_links(S) // Compute links

2.for each s E S do

3. q[s]:= build_loacl_heap(link,s) //link[s; j]>0

4. Q=build_global_heap(S,q) // g(s, max(a[s])

5. while size(Q)>k do{ // iterates until k clusters are left

6. u:= extract_max(Q)

7. v:= max(q[u]) // find the best clusters to merge

8. delete(Q,v)

9. w:= merge(u,v)

10. for each x E q(u) U a{v} do{// Update the q for each x

11. link[x,w]:= link[x,u]+link[x,v]

12. delete (q[x].u): delete(q[x], v)

13. insert (q[x],w,g(x,w)); insert(q[w],x,g(x,w //Create q for w

14. update (Q,x,q[x])

15. }

16. insert (Q, w, q[w]) //Update Q

17. deallocate(q[u]);deallocate (q[v])

18. }


IV. Cluster Divide and Conquer Based Method for Mixed Data

DCMCM Algorithm:

Input: The Dataset DS

Output: Each Data Object Identified

1.Splitting the Dataset DS into Numeric Dataset(NDS) and Categorical Dataset(CDS)

2. Clustering the Categorical Dataset using Categorical Clustering Algorithm using ROCK

3. Clustering the Numeric Dataset using Numeric Clustering Algorithm using CURE

4. Combining the output clusters of CDS and NDS into a Categorical Dataset

5. Clustering the Combined Categorical Dataset using ROCK Clustering Algorithm to form final clusters

The DCMCM algorithm can observed in fig 1 where Dataset split into pure Numeric and Categorical dataset. Applying well existing clustering algorithms to yield the clusters, there by combining the resulted clusters of two datasets is treated as Categorical dataset. Applying proposed algorithm i.e.., ROCK on final dataset to exploit the final clusters.


In these paper my main contribution is to provide an algorithm framework for mixed datasets and proposed an idea i.e., CURE and ROCK hierarchical clustering algorithms for Numeric and Categorical Datasets and finally ROCK applying in final Categorical dataset . From the comparison of the Existing algorithm i can say that ROCK clustering algorithm is well one to cluster the Categorical dataset

In the future work my contribution is to implement the what i have proposed algorithm for datasets. And using another alternative clustering algorithms into algorithm framework, to get further experimental changes in this methodology.


Jiawei Han and Micheline Kamber. "Data Ware Housing and Data Mining Concepts and Techniques" Third Edition 2007

A. Ahmad and L.Dey,(2007), A k-mean clustering algorithm for mixed numeric and categorical data, Data and knowledge Engineering Elsevier Publication

V.V. Cross and T.A. Sudkamp, Similarity and compatibility in Fuzzy Set Theory:assessment and Applications

S. Guha, R. Rastogi, and K. Shim, 2000. ROCK: A Robust Clustering Algorithm for Categorical Attributes. Information Systems

Zengyou He, Xiaofe i Xu, Shengchun Deng,Clustering Mixed Numeric and Categorical Data: A Cluster Ensemble Approach

1.Srinivasulu Asadi*, 2Ch. D.V. Subba Rao, 3C. Kishore and 4Shreyash Raju, Clustering the Mixed Numerical and Categorical Datasets Using Similarity Weight and Filter Method

Martin Ester, Hans-Peter Kriegel, and Xiaowei Xu. A database interface for clustering in large spatial databases. In Int'l Conference on Knowledge Discovery in Databases and Data Mining (KDD-95), Montreal, Canada, August 1995

A. Ahmed and D. Lipika. A k-mean clustering algorithm for mixed numeric and categorical data..Data and Knowledge Engineering, 63(2):503-527, 2007

H. Chung-Chian, C. Chin-Long, and S. Yu-Wei. Hierarchical clustering of mixed data based on distance hierarchy. In-formation Sciences, 177(20):4474-4492, 2007

HE Zengyou, XU Xiaofei and DENG Shengchun, Squeezer: An Efficient Algorithm for Clustering Categorical Data.

M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A density-based algorithm for discovering clusters in large spatial databases with noise. InProc. of 2nd Int. Conf. on Knowledge Discovery and Data Mining (KDD), pages 226-231, Portland, Oregon, August 1996.