Customer analysis is carried out by all businesses in some form or the other. It refers to the process of understanding who our customers are, where they come from and what motivates them to buy our products and services. Recent developments in the fields of business investment, scientific research and information technology have resulted in the collection of massive data which becomes highly useful in finding certain patterns governing the data source. Customer segmentation is a way to have more targeted communication with the customers. The process of segmentation describes the characteristics of the customer groups within the data. Clustering algorithms are popular in finding hidden patterns and information from such repository of data. This paper presents how different clustering algorithms can be compared and the optimal one can be selected for the purpose of customer segmentation for a shop. The information extracted can be used to satisfy the needs of the key customers and also helps in making strategic decision as to how the business can be expanded. This paper compares the four clustering algorithms: K-means, K-medoid, Fuzzy C-means and Gustafson-kessel. The best one is selected and optimal number of clusters is found out and applied to that algorithm for customer segmentation.
Data Mining is the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data (Pujari, 2001). With the widespread use of databases and the explosive growth in their sizes, organizations are faced with the problem of information overload. The problem of effectively utilizing these massive volumes of data is becoming a major problem for all enterprises. These volumes of data can be used to find patterns in the form of predicative and classification models, which play a critical role in supporting an organization's operational, tactical and strategic decisions (Cox, 2005). There are different techniques in data mining that intent to uncover these patterns and associations from large databases. Large databases in commerce driven industries such as retailing, manufacturing, insurance, banking and investments contain a staggering amount of raw data. Knowledge engineers and system analysts are now faced with challenges of maintaining a competitive edge in the world of e-commerce and rapidly changing delivery channels. To meet these challenges, they must look into the accumulated raw data to find new customer relationships, possible newly emerging lines of business and ways to cross sell existing products to new and existing customers (Cox, 2005).
Data mining can also be applied to marketing databases. This paper illustrates how data mining technique: clustering can be applied to customers' database and segment it for its marketing strategies planning. Here, I have collected six months transactional database of a shop which I would refer to as myshop. Located at two peak areas myshop sales and supplies certain items whose sales and certain features associated with the items are as follows:
i) Frequency in a month
ii) Sale of cold drink
iii) Sale of coffee
iv) Sale of tea
v) Sale of Varieties of sauces
vi) Sale of Fruits, pickles, vegetable & non-vegetable (Canned)
vii) Sale of Varieties of sweet dishes
viii) Sale of Varieties of confectionery
ix) Sale of Breads
x) Sale of Cakes
xi) Sale of Chocolate varieties
xii) Sale of Cheese products
xiii) Sale of Ice-cream products
xiv) Number of items sold
xv) Amount spent in buying items
Attracting the customers can be done in many different ways. But to do this we have to understand their behavior, their buying pattern etc. This paper provides a solution to find the targeted customers by comparing four clustering algorithms: K-means, K-medoids, Fuzzy C-means, Gustafson-Kessel; the first two being non-fuzzy while the last two are based on fuzzy technique. The optimal one is selected and segmentation is done using the dataset. A set of validation measures are applied to find out the optimal number of clusters, which is the required parameter of the algorithms. Section 2 describes the related work done in this field. Section 3 explains the four clustering techniques. Section 4 defines segmentation and explains the validation techniques for the purpose of selecting the optimal algorithm. Section 5 presents the experimental results and segmentation process. Section 6 concludes the paper.
Peng et al., (2009) completed a research which intended to provide an overview of the past and current data mining activities. They collected 1436 textual documents collected from premier data mining journals and conference proceedings to determine which data mining subjects had been studied over the last several years. They came up with five hot topics as popular techniques in data mining: clustering, data preprocessing, association analysis, classification and text mining. In the recent publication of Ngai et al.(2008), a survey of 24 journals from the last five consecutive years, was done. The research aimed to give a research summary on the application of data mining in the Customer Relation Management (CRM) domain and techniques which are most often used. It concentrated on 87 core articles. They have clearly proved that only one article discussed the visualization of data mining results within the context of CRM and that more research could be done on this issue.
A review of image segmentation by clustering was reported in Jain and Flynn (1996). They have clearly shown that segmentation by clustering can be one of the best tools of data mining and could be applied in many applications. A comparison of various clustering algorithms for constructing the minimal spanning tree and short spanning path was in Lee (1981). The importance and interdisciplinary nature of clustering is evident through its vast literature. Comparisons of various hard clustering algorithms, based on experiments have been reported in Mishra and Raghavan (1994) and Al-Sultan and Khan (1996). Wang & Garibaldi, (2005), compared FCM (fuzzy) and K-means (non-fuzzy) clustering techniques while clustering IR spectra that were collected from a large area of an axillaries lymph node tissue section.
The research could have been improved by increasing the number of algorithm from one each of fuzzy & non-fuzzy to two each such as K-means, K-medoids, Fuzzy C-means and Gustafsson-Kessel.
Cluster analysis is primarily a tool for discovering previously hidden structure in a set of unordered objects. In this case one assumes that a 'true' or natural grouping exists in the data. However, the assignment of objects to the classes and the description of these classes are unknown. By arranging similar objects into clusters one tries to reconstruct the unknown structure in the hope that every cluster found represents an actual type or category of objects. In partitioning the data only cluster centers are moved and none of the data points are moved (Han and Kamber, 2003). Thus clustering is an iterative process of finding better and better cluster centers in each step. Most popular clustering algorithms are k-means and its variations. This paper presents the k-means, k-medoids, fuzzy c-means and Gustafson-Kessel clustering algorithms.
The value 'k' stands for the number of cluster seeds initially provided for the algorithm. The seeds or the centroids can be selected arbitrarily from the objects. This algorithm takes the input parameter 'k' and partitions a set of m objects into k clusters. The technique work by computing the distance between a data point and the cluster center to add an item into one of the clusters so that intra-cluster similarity is high but inter-cluster similarity is low. A common method to find the distance is to calculate the sum of the squared difference, known as the Euclidian distance (Cox, 2005), is given in equation (1).
dk : is the distance of the kth data point
n : is the number of attributes in a cluster
Xjk : is jth value of the kth data point
Cji : is the jth value of the ith cluster center
When all the distance of the data points are calculated they are assigned to the k centroids using the minimum distance as givne in equation (2). Then the value of the new cluster centers are calculated as the average of all the objects present in the same cluster. The cluster center calculation causes the previous centroid location to move towards the center of the cluster set. This is continued until there is no change in cluster centers.
Sj(t) : is the membership of xi in jth cluster
t : is the current iteration cycle value
xi : is the ith data point
Cj : is the center of the jth cluster
Ck : is the center of the kth cluster
Pseudo code of k-means clustering algorithm (Cox, 2005) is given below:
initialise k=number of clusters
initialise Cj (cluster centers)
Set Cycle variable t=1
For i=1 to n Distribute sample points(xi)into k clusters
For j=1 to k : Calculate Sj(t) for xi applying (2)
For j=1 to k : Compute new cluster centers by calculating weighted average
Until Cj estimate stabilize
k-medoid clustering, is also a hard partitioning algorithm, which uses the same equations as the k-means algorithm. The only difference is that in k-medoid the cluster centers are the nearest data points to the mean of the data in one cluster. This simply means that in k-means the centroid, in most cases, is none of the data points. K-medoids is useful when we need the centroid to be one of the datapoints.
Fuzzy C-means Clustering
The real world data is almost never arranged according to our need. Instead, clusters have ill defined boundaries that smear into the data space often overlapping the perimeters of surrounding clusters. In most of the cases the real world data points seem to belong to more than one cluster. This happens because natural data do not happen to appear in clear-cut crisp fashion. This imprecise and qualitative knowledge, as well as handling of uncertainty can be done through the use of fuzzy sets. Fuzzy logic supports, to a reasonable extent, human type reasoning in natural form by allowing partial membership for data items in fuzzy subsets. Integration of fuzzy logic with data mining techniques has become one of the key constituents of soft computing in handling the challenges posed by the massive collection of natural data (Pal and Mitra, 2002).
A fuzzy clustering provides a flexible and robust method of method of assigning data points into clusters (cox, 2005). Each data point will have an associated degree of membership for each cluster center in the range [0,1], indicating the strength of its placement in that cluster. This becomes important for those points which lie near the boundaries, as these points can be included as a member in more than one cluster depending on the degree of membership. Equation (3) shows how the membership in the j-th cluster is calculated.
Âµj(xi) : is the membership of xi in the jth cluster
dji : is the distance of xi in cluster cj
m : is the fuzzification parameter
p : is the number of specified clusters
dki : is the distance of xi in cluster Ck
The new value of the cluster can be calculated usin the equation (4). This is simply a special form of weighted average.
Cj : is the center of the jth cluster
xi : is the ith data point
Âµ : the function which returns the membership
m : is the fuzzification parameter
Pseudo code of fuzzy c-means clustering algorithm (Cox, 2005) is given below:
initialize p=number of clusters
initialize m=fuzzification parameter
initialize Cj (cluster centers)
For i=1 to n :Update Âµj(xi) applying (3)
For j=1 to p :Update Ci with(4)with current Âµj(xi)
Until Cj estimate stabilize
The process of grouping a set of physical or abstract objects into classes of similar objects is called clustering (Han & Kamber, 2003). A Cluster is a collection of data objects that are similar to one another within the same cluster and are dissimilar to the objects in other clusters. A cluster of data objects can be treated jointly as one group in many applications. It aims at changing through large volumes of data in order to reveal useful information in the form of new relationships, patterns, or clusters, for decision-making by a user. myshop too is interested in customer segmentation and make strategic decisions from the result. Customer segmentation is a term used to describe the process of dividing customers into homogeneous groups on the basis of shared or common attributes like habits, tastes, etc. (Giha, Singh & Ewe, 2003). While this problem can be solved using the four clustering algorithms, there are certain questions to be answered:
What will be the optimal number of clusters?
Which algorithm will be chosen as the best for customer segmentation?
The above addressed problems can be solved using cluster validation, which refers to the technique whether a partition is correct and how to measure the correctness of a partition. Clustering algorithms are designed in such a way that it gives the best fit. However, the best fit sometimes may not be meaningful at all. The number of clusters might not be correct or the cluster shapes do not correspond to the actual groups in the data. Sometimes it might so happened that the data cannot be grouped in a meaningful way. One can distinguish two main approaches to determine the correct number of clusters in the data (Jansen, 2007):
Start with a sufficiently large number of clusters, and successively reducing this number by combining clusters that have the same properties.
Cluster the data for different values of c and validate the correctness of the obtained clusters with validation measures.
In order to apply the second approach, this paper uses the following validation measures (Balasko, Abonyi & Balazs, 2006):
Partition Coefficient (PC): Defined by Bezdek, it measures the amount of 'overlapping' between clusters.
Where uij is the membership of data point j in the cluster i.
Classification Entropy (CE) : It is a variation on the partition coefficient which calculates the membership values of the cluster.
where uij is the membership of data point j in the cluster i.
Partition Index (PI): It is the ratio of the sum of compactness and separation of the clusters. The individual cluster is measured with the cluster validation method which is normalized by dividing it by the fuzzy cardinality of the cluster. The sum of the value for each individual cluster is used to receive the partition index.
Separation Index (SI): The separation index uses a minimum-distance separation to validate the partitioning.
Xie and Beni's Index (XB): It is a method to quantify the ratio of the total variation within the clusters and the separations of the clusters. The lowest value of the XB index should indicate the optimal number of clusters.
Dunn's Index (DI): This index was originally designed for the identification of hard partitioning clustering. Therefore, the result of the clustering has to be recalculated. The main disadvantage of the Dunn's index is the very expensive computational complexity as c and N increase.
Alternative Dunn Index (ADI): To simplify the calculation of the Dunn index, the Alternative Dunn Index was designed. This will be the case when the dissimilarity between two clusters, measured with is rated in under bound by the triangle-inequality:
were vj represents the cluster center of the j-th cluster.
Experiments and results
Determination of optimal number of clusters
The disadvantage of using the above clustering algorithms is the need to supply the number of clusters.
Thus it is a tiresome problem for us to know what value of c will be suitable for our needs. This optimal number of cluster can be found out using the above mentioned validation measures. I have set c=2 to 10 and tried with all the four algorithms. The transactional database of myshop for six months is processed with these different values of c. Table 1 shows the result of K-means clustering with these different values of c. The first row is the values of c while the rest of the rows give the values of the corresponding validation measure for a particular value of c. As we can see the that the values of CE in table 1 is always 'NaN'(Not-a-number). This is because the classification entropy measure was designed for fuzzy partitioning methods and in this case the hard partitioning algorithm k-means is used.
Table 1: The values of all the validation measures with K-means clustering
There are different techniques of finding the optimal value of c, one of which is Elbow Criterion. Elbow Criterion is a common rule of thumb which says that one should choose the number of cluster so that adding another cluster does not add sufficient information. More specifically, when we graph a validation measure given by the clusters against the number of clusters, the first cluster will add much information (explain a lot of variance), but at some point the marginal gain will drop, giving an angle in the graph (the elbow). Figure 1 shows how curves are formed at different values of c for the different validation measures using fuzzy C-means. As we draw similar graph with the rest of the algorithms, we see that majority of the curves are formed at c=3. Thus we conclude that c=3 is the optimal value for the number of clusters.
Figure 1: values of validation measures versus number of clusters for fuzzy C-means (a) PC, (b) CE,
(c) PI, (d) SI, (e) XBI, (f) DI and (g) ADI
Finding the best algorithm
Now our next objective is to find out the best algorithm of the four given that c=3. Table 2 gives us the different values of validation measures for c=3 for the four algorithms. From the table we see that the FCM contributes to four optimum values which is the maximum of all the algorithms. Thus we can conclude that the best algorithm of the four is fuzzy C-means for the given dataset.
Table 2: The numerical values of validation measures for c=3
Choosing fuzzy c-means algorithms for our objective of customer segmentation we draw boxpots of distribution of distances from cluster centers within clusters for the fuzzy C-means algorithm with c=3 given by figure 2. The box indicates the upper and lower quartiles, while the red line inside the boxes indicates the median. Table 3 gives the result of segmentation when c=3.
Figure 2: Distribution of distances from cluster centers within clusters
for the fuzzy C-means algorithm with c=3.
Not all the results obtained are interesting, only the features with much deviation need to be selected for analysis. Table 3 gives certain features selected along with their segment values. Please refer to the introduction section for their corresponding representations.
Table 3: Deviation of the three segments from the average
Segment 1: Consists of regular customers who spent a lot of money; they buy mostly 'COLD DRINK', 'CHOCOLATES' & 'ICE-CREAM'.
Segment 2: Consists of flying customers; spent less; buy 'DESSERTS' & 'BREADS' like other customers.
Segment 3: Consist of customers who comes sometimes; spent average amount; they buy 'BREADS' & 'CHEESE PRODUCTS'.
FuTURE rESEARCH dIRECTIONS
The present research can be improved to obtain more suitable customer segmentation. The first recommendation is to use a different set of features for the customer segmentation leading to different clusters and thus different segments. To see how the outcome of the clustering can be affected by the feature values, a complete data analysis research is required. Secondly, finding optimal number of clusters can be improved by exploring varieties of techniques, more sophisticated then elbow criterion. Using genetic programming could be one option to be explored. It can also be found out by additional clustering techniques such as DB scan. We can increase the number of clusters which employ the same technique; Gath-Geva is one among others.
This research is a comparative case study of the four clustering algorithms: K-means, K-medoids, Fuzzy C-means and Gustafson-Kessel. It is applied to 8072 customer-transactions. The number of clusters on this dataset was subjectively set from 2 to 10 and tried with all the four algorithms. It was found that FCM gave the best result when the number of cluster is set to 3. Different validation methods have been proposed in the literature. However, none of them is perfect by oneself. Therefore, this research uses several indexes, the reader may explore for more indexes. This model can be used in other fields of science and social sciences.