This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.
Clustering is a separation of data object into groups of similar objects. On behalf of the data by fewer clusters essentially loses certain fine details, but it achieves simplification. Data modeling puts clustering in a chronological point of view entrenched in mathematics, statistics etc. From machine learning point of view the clusters are the hidden patterns used to the search for clusters is unsupervised learning resulting in representing a data concept. In practical point of view clustering plays an exceptional role in data mining applications such as scientific data exploration, information retrieval and text mining, spatial database applications, Web analysis, CRM, marketing, medical diagnostics, computational biology, and many others Miroslav Marinov et al., (2004).
Clustering is the area under discussion of active research in several fields such as like statistics, pattern recognition, biometrics and machine learning. Data mining is the concept it adds to clustering with various difficulties of very large datasets with many attributes of different types. This gives a unique computational necessity on significant clustering algorithms. A various algorithms are emerged and it is applied to real-life data mining problems successfully to meet these requirements.
Cluster analysis is a group of objects based on the information that are found in the data describing the objects or their relations. The main goal of the clustering is that the objects in a group will be similar or related to one other and different from the objects in other groups. There is a greater similarity within a group and the difference between the groups is greater is to be better or more distinct the clustering. The definition of a cluster is not well defined and in many applications the required clusters are not well separated from one another. However, most of the cluster analysis gives as a result, a crisp categorization of the data into non-overlapping groups. To better understand that the difficulty of deciding what constitutes is a cluster, consider figures 2a through 2d, which show twenty points and three different ways that they can be divided into clusters. If the clusters are to be nested, the structure of these points contain more reasonable interpretation of that there are two clusters, each of which has three subclusters. On the other hand, the obvious division of the two larger clusters divided into three subclusters may be an artifact of the human. Lastly, it may not be difficult to say that the points from four clusters. Thus, stress once again that the definition of what constitutes a cluster is inaccurate, and the best definition depends on the type of data and the desired results.
Figure 2: a) Initial Points
Figure 2: b) Two Clusters
Figure 2: c) Six Clusters
Figure 2: d) Four Clusters
Figure 2: Types of Clusters
SOME WORKING DEFINITIONS OF A CLUSTER
Generally the cluster does not have a common definition. Though, several working definitions of a cluster are commonly used in practice Richard C. Dubes and Anil K. Jain (1988).
Well-Separated Cluster Definition
A cluster is a set of points such that any point in a cluster is closer to every other point in the cluster than to any point not in the cluster. Sometimes a threshold is used to specify that all the points in a cluster must be sufficiently close to one another.
Figure 3: Three well-separated clusters of 2 dimensional points.
On the other hand, in many sets of data, a point on the edge of a cluster may be closer (or more similar) to some objects in another cluster than to objects in its own cluster. Consequently, many clustering algorithms use the following criterion.
Center-based Cluster Definition:
A cluster is a set of objects such that an object in a cluster is closer (more similar) to the "center" of a cluster, than to the center of any other cluster. The center of a cluster is often a centroid, the average of all the points in the cluster, or a medoid, the most "representative" point of a cluster.
Figure 4: Four center-based clusters of 2 dimensional points.
Contiguous Cluster Definition (Nearest neighbor or Transitive Clustering):
A cluster is a set of points such that a point in a cluster is closer (or more similar) to one or more other points in the cluster than to any point not in the cluster.
Figure 5: Eight contiguous clusters of 2 dimensional points.
Density based cluster is a cluster in which a dense region of points, which is separated by low-density regions, from the other regions of high density. This definition is more often used when the clusters are irregular or intertwined, and when noise and outliers are present. Note that the contiguous definition would find only one cluster in figure 6. Also note that the three curves don't form clusters since they fade into the noise, as does the bridge between the two small circular clusters.
Figure 6: Six dense clusters of 2 dimensional points.
Similarity-based Cluster definition:
A cluster is a set of objects that are "similar", and objects in other clusters are not "similar." A variation on this is to define a cluster as a set of points that together create a region with a uniform local property, e.g., density or shape.
Classification of Clustering Algorithms
In this section, explained about the most well-known clustering algorithms. The reason for having many clustering methods is the idea of "cluster" is not exactly defined (Estivill-Castro, 2000). As a result many clustering methods are developed, with the help of dissimilar induction principle. Farley and Raftery (1998) put forward the method, dividing the clustering methods into two groups, one is a hierarchical and another one is partitioning methods. Han and Kamber (2001) put forward the method in which they categorizing the methods into additional three main categories; they are density-based methods, model-based clustering and gridbased methods. A substitute categorization based on the induction principle of the various clustering methods is explained by (Estivill-Castro, 2000).
Most commonly used clustering methods is as follows
Density-Based Connectivity Clustering
Density Functions Clustering
These methods build the clusters by partitioning the instances in either a top-down or bottom-up approach.
These methods can be subdivided as following:
Agglomerative hierarchical clustering
Each object in clustering initially indicates a cluster of its own. Then the indicated or represented clusters are successively combined until the preferred cluster structure is formed.
Divisive hierarchical clustering
Each and every object at first belongs to only one cluster. Then the cluster is divided into more sub-clusters, which are consecutively divided into their own sub-clusters. This process is continues until the desired cluster structure is formed.
The result of this hierarchical method is a dendrogram which, representing the nested grouping of objects and similarity levels at which groupings change. A clustering of the data objects is formed by cutting the dendrogram at the preferred match level.
The merging of clusters is performed due to some similarity measure, chosen to optimize some criterion.
The hierarchical clustering methods could be further divided according to the manner that the similarity measure is calculated (Jain et al., 1999): They are
Consider the distance between two clusters which is to be equal to the shortest distance from one cluster to the other cluster. If the data consist of similarities, the similarity between the data consist a pair of clusters is measured to be equal to the greatest similarity from any member of one cluster to any member of the other cluster is described by (Sneath and Sokal, 1973).
Consider the distance between two clusters is equal to the longest distance from any member of one cluster to any member of the other cluster is described by (King, 1967).
Consider the distance between two clusters is equal to the average distance from any member of one cluster to any member of the other cluster. Such clustering algorithms is described in (Ward, 1963) and (Murtagh, 1984).
The disadvantages of the single-link clustering and the average-link clustering can be summarized (Guha et al., 1998):
Single-link clustering has a disadvantage which is known as the "chaining effect": A few points that form a bridge between two clusters may cause the single-link clustering to unify these two clusters into one.
Average-link clustering may cause lengthened clusters to split and for portions of neighboring elongated clusters to merge.
The complete-link clustering methods usually produce more compact clusters and more useful hierarchies than the single-link clustering methods, yet the single-link methods are more versatile.
Generally, hierarchical methods are characterized with the following strengths:
The single-link methods, for example, maintain good performance on data sets containing non-isotropic clusters, including well separated, chain-like and concentric clusters.
Hierarchical methods produce not one partition, but multiple nested partitions, which allow different users to choose different partitions, according to the desired similarity level. The hierarchical partition is presented using the dendrogram.
Figure 7: Hierarchical Clustering
Partitioning methods move the instances by moving them from one cluster to another, initially from an initial partitioning. Such methods require that the number of clusters will be pre-set before by the user. To achieve this global optimality in partitioned-based clustering, a comprehensive listing process of all possible partitions is required. Because this is not feasible, certain greedy heuristics are used in the form of iterative optimization. Specifically, a relocation method iteratively relocates the points between the k clusters. The subsequent subsections here became a various types of partitioning methods. This clustering algorithm includes the first ones that appeared in the Data Mining Community.
The goal in k-means is to produce k clusters from a set of k objects, so that the squared-error objectives function:
is minimized. In the above expression, are the clusters, p is a point in a cluster and the mean of cluster . The mean of a cluster is given by a vector, which contains, for each attribute, the mean values of the data objects in this cluster and. Input parameter is the number of clusters, k , and as an output the algorithm returns the centers, or means, of every cluster , most of the times excluding the cluster identities of individual points. The distance measure usually employed is the Euclidean distance. Both for the optimization criterion and the proximity index, there are no restrictions, and they can be specified according to the application or the user's preference. The algorithm is as follows:
Select k objects as initial centers;
Assign each data object to the closest center;
Recalculate the centers of each cluster;
Repeat steps 2 and 3 until centers do not change;
The algorithm is relatively scalable, since its complexity is, , where I denotes the number of iterations, and usually
PAM is an extension to k-means, intended to handle outliers efficiently. Instead of cluster centers, it chooses to represent each cluster by its medoid. A medoid is the most centrally located object inside a cluster. As a consequence, medoids are less influenced by extreme values; the mean of a number of objects would have to "follow" these values while a medoid would not. The algorithm chooses k medoids initially and tries to place other objects in clusters whose medoid is closer to them, while it swaps medoids with non-medoids as long as the quality of the result is improved. Quality is also measured using the squared-error between the objects in a cluster and its medoid. The computational complexity of PAM is, with I being the number of iterations, making it very costly for large n and k values.
A solution to this is the CLARA algorithm, by Kaufman and Rousseeuw (1990). This approach works on several samples of size s, of the n tuples in the database, applying PAM on each one of them. The output depends on the s samples and is the "best" result given by the application of PAM on these samples. It has been shown that CLARA works well with 5 samples of 40 + k size Kaufman and Rousseeuw (1990), and its computational complexity becomes, . Note that there is a quality issue when using sampling techniques in clustering: the result may not represent the initial data set, but rather a locally optimal solution. In CLARA for example, if "true" medoids of the initial data are not contained in the sample, then the result is guaranteed not to be the best.
The CLARANS approach works as follows:
Randomly choose k medoids;
Randomly consider one of the medoids to be swapped with a non-medoid;
If the cost of the new configuration is lower, repeat step 2 with new solution;
If the cost is higher, repeat step 2 with different non-medoid object, unless a limit has been reached (the maximum value between 250 and k(n-1);
Compare the solutions so far, and keep the best;
Return to step 1, unless a limit has been reached (set to the value of 2);
CLARANS compares an object with every other, in the worst case and for every of the k medoids. Thus, its computational complexity is, , which does not make it suitable for large data sets.
Well separated Clusters
Clusters of different sizes close to each other
Figure 8: Three applications of the k-means algorithm
Figure 8 presents the application of k-means on three kinds of data sets. The algorithm performs well on appropriately distributed (separated) and spherical-shaped groups of data (Figure 8(a)). In case the two groups are close to each other, some of the objects on one may end up with in different clusters, especially if one of the initial cluster representatives is close to the cluster boundaries (Figure 8(b)). Finally, k-means does not perform well on non-convex-shaped clusters (Figure 8(c)) due to the usage of Euclidean distance. As already mentioned, PAM appears to handle outliers healthier, since the medoids are less prejudiced by extreme values than means, which something that k-means fails to perform in an acceptable way.
Graph theoretic methods are methods that produce clusters by means of graphs. The edges of the graph connect the instances that are denoted as nodes. A well-known graph-theoretic algorithm is based on the Minimal Spanning Tree (MST) (Zahn, 1971). Incompatible edges are edges whose weight is considerably larger than the average of nearby edge lengths. An additional graph-theoretic approach constructs graphs based on incomplete neighborhood sets (Urquhart, 1982).
Single-link clusters are subgraphs of the MST of the data instances. Each subgraph is a connected component, that is to say a set of instances in which each instance is connected to at least one other member of the set, so that the set is maximal with respect to this property. Hence the subgraphs are produced according to some similarity threshold.
Complete-link clusters are maximal complete subgraphs, formed using a similarity threshold. A maximal complete subgraph is a subgraph such that each node is connected to every other node in the subgraph and the set is maximal with respect to this property.
A density-based method shows that the points belong to each cluster are drawn from a specific probability distribution (Banfield and Raftery, (1993). The overall distribution of the data is assumed to be a mixture of several distributions. The aim of these methods is to identify the clusters and their distribution parameters. These methods are designed for discovering clusters of arbitrary shape which are not necessarily convex, that is:
This does not necessarily imply that:
The plan is to keep on growing the given cluster as long as the density in the neighborhood that exceeds some threshold. That is to say, the neighborhood of a given radius contains at least a minimum number of objects. When each cluster is characterized by local mode or maxima of the density function, these methods are called mode-seeking. A great deal of work in this field has been based on the underlying assumption that the component densities are multivariate Gaussian or multinominal form. An acceptable solution in this case is to use the maximum likelihood principle. According to this principle, one should choose the clustering structure and parameters such that the probability of the data being generated by such clustering structure and parameters is maximized. The expectation maximization algorithm (Dempster et al., 1977), which is a general-purpose maximum likelihood algorithm for missing-data problems, have been useful to the problem of parameter estimation. This algorithm begins with an initial estimate of the parameter vector and then alternates between two steps (Farley and Raftery, 1998): an "E-step", in which the conditional expectation of the complete data likelihood given the observed data and the current parameter estimates is to be computed, and an "M-step", in which parameters that maximize the expected likelihood from the E-step are determined. This algorithm is shown to converge to a local maximum of the observed data likelihood.
The K-means algorithm may be viewed as a degenerate EM algorithm, in which:
Applying instances to clusters in the K-means possibly will measure as the E-step; forming new cluster centers possibly may be the M-step. The DBSCAN algorithm in which the clusters are discovered by arbitrary shapes and it is efficient for large spatial databases. The algorithm searches for clusters by searching the neighborhood of each object in the database and checks if it contains more than the minimum number of objects. It is described by (Ester et al., 1996).
AUTOCLASS is an algorithm widely used to cover a variety of distributions, together with Gaussian, Bernoulli, Poisson, and log-normal distributions (Cheeseman and Stutz, 1996). Other well-known density-based methods include: SNOB (Wallace and Dowe, 1994) and MCLUST is present in (Farley and Raftery, 1998).
Density-based clustering also employ as a nonparametric methods, such as searching for bins with large counts in a multidimensional histogram of the input instance space (Jain et al., 1999).
WORKING OF BASIC CLUSTERING ALGORITHM
The K-means clustering technique is a simple technique begins with a description of the basic algorithm.
Basic K-means Algorithm is used for finding K clusters.
1. Select K points as the initial centroids.
2. Assign all points to the closest centroid.
3. Recompute the centroid of each cluster.
4. Repeat steps 2 and 3 until the centroids don't change.
In the absence of numerical problems, this procedure always converges to a solution, although the solution is typically a local minimum. The following diagram gives an example of this. Figure 9a shows the case when the cluster centers coincide with the circle centers. This is a global minimum. Figure 9b shows local minima.
Figure 9a: A globally minimal clustering solution
Figure 9b: A locally minimal clustering solution
Choosing initial centroids
The proper initial centroids are chosen by the key step of the basic K-means procedure. It is simple and well-organized to choose initial centroids randomly, but the results are often poor. It is possible to precede a multiple runs with a dissimilar set of randomly chosen the each initial centroids is one study but this may still not work depending on the data set and the number of clusters sought. Start with a very simple example of three clusters and 16 points.
Figure 10a indicates the "natural" clustering that result when the initial centroids are "well" distributed. Figure 10b indicates a "less natural" clustering that happens when the initial centroids are poorly chosen.
Figure 10a: Good starting centroids and a "natural" clustering.
Figure 10b: Bad starting centroids and a "less natural" clustering.
Also constructed the artificial data set, shown in figure 11a as another illustration of what can go wrong. The figure consists of 10 pairs of circular clusters, where each cluster of a pair of clusters is close to each other, but relatively far from the other clusters. The probability in which an initial centroid will come from any given cluster is 0.10, but the probability that each cluster will have exactly one initial centroid is
If there is any problem as long as in two initial centroids fall anywhere in a pair of clusters, since the centroids will reallocate themselves, one to each cluster, and so achieve a globally minimal error. However, it is likely that one pair of clusters will have only one initial centroid. In that case, the pairs of clusters are far apart, the K-means algorithm will not redistribute the centroids between pairs of clusters, and thus only local minima will be achieved. When starting with an uneven distribution of initial centroids as shown in figure 11b, get a non-optimal clustering, as is shown in figure 11c, where different fill patterns indicate different clusters. One of the clusters is split into two clusters, while two clusters are joined in a single cluster.
Figure 11a: Data distributed in 10 circular regions
Figure 11b: Initial Centroids
Figure 11c: K-means clustering result
Because random sampling may not cover all clusters, other techniques are often used for finding the initial centroids. For example, initial centroids are often chosen from dense regions, and so that they are well separated, i.e., so that no two centroids are chosen from the same cluster.
HOW THE CLUSTERING METHODS OPTIMIZE INTO VARIOUS TECHNIQUES
These methods try to optimize the robust between the given data and some other mathematical models. Unlike conventional clustering, it identifies groups of objects; model-based clustering methods also find characteristic descriptions for each group, where each group represents an idea or class. The most frequently used induction methods are decision trees and neural networks.
Here the data is represented by a hierarchical tree, each leaf refers to a concept and it contains a probabilistic description of that concept. Several algorithms are developed to produce a classification trees for representing the unlabelled data. The most well-known algorithms are:
COBWEB-This algorithm assumes that all attributes are independent. Its aim is to achieve high certainty of nominal variable values, given a cluster. This algorithm is not suitable for large database clustering (Fisher, 1987).
CLASSIT, an extension of COBWEB for continuous-valued data, unfortunately has similar problems as the COBWEB algorithm.
This algorithm shows that each cluster is represented as a neuron or a prototype. The input data is also neurons, which are connected to the trial product of neurons. For each and every connection it has a weight, which is learned adaptively during learning process. Self-organizing map (SOM) is a popular neural network algorithm. This algorithm constructs a single-layered network. The learning process takes place in a "winner-takes-all" fashion. The prototype neurons fight for the current instance. The winner and its neighbors learn by having their weights in tune.
The SOM algorithm is successfully used for vector quantization and speech recognition. It is useful for visualizing high-dimensional data in 2D or 3D space. However, it is sensitive to the initial selection of weight vector, as well as to its different parameters, such as the learning rate and neighborhood radius.
Traditional clustering approaches generate partitions, in a partition each instance is belongs to one and only one cluster. Consequently, the clusters in a hard clustering are disjointed. Fuzzy clustering (Hoppner, 2005) extends this idea and suggests a soft clustering plan. In this case, each pattern is associated with every cluster using some sort of membership function, namely, each cluster is a fuzzy set of all the patterns. Larger membership values indicate higher confidence in the assignment of the pattern to the cluster. A hard clustering can be obtained from a fuzzy partition by using a threshold of the membership value.
The most popular fuzzy clustering algorithm is the fuzzy c-means (FCM) algorithm. Although it is better than the hard K-means algorithm at avoiding local minima, FCM can still converge to local minima of the squared error criterion. The design of membership functions is the most important problem in fuzzy clustering; different choices include those based on similarity decomposition and centroids of clusters. A generalization of the FCM algorithm has been proposed through a family of objective functions. A fuzzy c-shell algorithm and an adaptive variant for detecting circular and elliptical boundaries have been presented.
The ROCK Algorithm
ROCK (RObust Clustering using linKs) Sudipto Guha (1999) is a hierarchical algorithm for categorical data. Guha et al. propose a novel approach based on a new concept called the links between data objects. This idea helps to overcome problems that arise from the use of Euclidean metrics over vectors, where each vector represents a tuple in the data base whose entries are identifiers of the categorical values. More precisely, ROCK defines the following:
two data objects and are called neighbors if their similarity exceeds a certain threshold given by the user, i.e.
two data objects and , define : is the number of common neighbors between the two objects, i.e., the number of objects and are both similar too.
the interconnectivity between two clusters and is given by the number of cross-links between them, which is equal to
the expected number of links in a cluster is given by . In all the experiments presented
In brief, ROCK measures the similarity of two clusters by comparing the aggregate interconnectivity of two clusters against a user-specified static interconnectivity model. After that, the maximization of the following expression comprises the objective of ROCK:
Draw Random Samples
Cluster Samples with Links
Label Data on a Disk
Figure 12: Overview of ROCK [GRS99]
A random sample is drawn and a clustering algorithm (hierarchical) is involved to merge clusters. Hence, need a measure to identify clusters that should be merged at every step. This measure between two clusters and is called the goodness measure and is given by the following expression.
Where is now the number of cross-links between clusters:
The pair of clusters for which the above goodness measure is maximum is the best pair of clusters to be merged.
Shared Nearest Neighbor Clustering
1) First the k-nearest neighbors of all points are found. In graph terms this can be regarded as breaking for all but the k strongest links from a point to other points in the proximity graph.
2) All pairs of points are compared and if
a) any two points share more than kt â‰¤ k neighbors, and
b) The two points being compared are among the k-nearest neighbors of each,
This approach has a number of nice properties. It can handle clusters of different densities since the nearest neighbor approach is self-scaling. This approach is transitive, i.e., if point, p, shares lots of near neighbors with point, q, which in turn shares lots of near neighbors with point, r, then points p, q and r all belong to the same cluster. This allows this technique to handle clusters of different sizes and shapes. However, transitivity can also join clusters that shouldn't be joined, depending on the k and kt parameters. Large values for both of these parameters tend to prevent these spurious connections, but also tend to favor the formation of globular clusters.
Genetic Algorithm using Clustering
A genetic algorithm (GA), proposed by Holland , is a search heuristic, mimicking the process of natural evolution, used for optimization and search problems. The algorithms belong to the class of evolutionary algorithms in that they use operations from evolutionary algorithms and extend evolutionary algorithms by encoding candidate solutions as strings, called chromosomes).
GAs has the following phases:
Initialization: Generate an initial population of K candidates and compute fitness.
Selection: For each generation, select ÂµK candidates based on fitness to serve as parents.
Crossover: Pair parents randomly and perform crossover to generate offspring.
Mutation: Mutate offspring.
Replacement: Replace parents by offspring and start over with selection.
Other Techniques in Clustering
When performing clustering on categorical data, it is obvious that the techniques used are based on co-occurrences of the data objects or the number of neighbors they have, and at the same time do not deal with mixed attribute types. STIRR adopts theory from the dynamical systems area and spectral graph theory to give a solution. CACTUS employs techniques similar to the ones used in frequent item-set discovery and summarizes information in a similar way as BIRCH does.
It is a faith that there exist methods not yet applied to categorical attributes which mainly lead to more succinct result (recall that STIRR needs a painful post-processing step to describe the results). For instance, there are techniques employed by the machine learning community which are used to cluster documents according to terms they contain [ST00]. It is our interest to examine the properties of these methods and investigate whether it can be effectively applied to categorical as well as mixed attribute types.
The clustering algorithm plays an important role in medical field and also in gene expression dataset.
CLUSTERING ALGORITHM FOR GENE EXPRESSION AND ITS IMPLEMENTATION
DNA microarray technology is a fundamental tool in studying gene expression. The buildup of data sets from this technology measure the relative abundance of mRNA of thousands of genes across tens or hundreds of samples have underscored the need for quantitative analytical tools to examine such data. Owing to the bulky number of genes and complex gene ruling, clustering is a helpful to investigative the method for analyzing these data. The clustering divides the data into a small number of comparatively homogeneous groups or clusters. There is minimum two ways are present to be appropriate to apply cluster analysis to microarray data. One way is cluster arrays, in which samples from the different tissues, cells at different time points out a biological process treatment. Global expression profiles of various tissues or cellular states are classified using this type of clustering. Another use of this clustering is to cluster genes according to their expression levels across different conditions. This method intends to group co-expressed genes and to reveal co-regulated genes or genes that may be involved in the same pathways.
Numerous clustering algorithms have been proposed for gene expression data. For instance, Eisen, Spellman, Brown and Botstein (1998) apply an alternative of the hierarchical average-linkage clustering algorithm to identify groups of co-regulated yeast genes. Tavazoie et al. (1999) reported their success with k-means algorithm, an approach that minimizes the overall within-cluster dispersion by iterative reallocation of cluster members. Tamayo et al. (1999) used self-organizing maps (SOM) to identify clusters in the yeast cell cycle and human hematopoietic differentiation data sets. There are many others. Some algorithms require that every gene in the dataset belongs to one and only one cluster (i.e., generating exhaustive and mutually exclusive clusters), while others may generate "fuzzy" clusters, or leave some genes unclustered. The first type is most frequently used in the literature and we restrict our attention to them here. The hardest problem in comparing different clustering algorithms is to find an algorithm-independent measure to evaluate the quality of the clusters. In this chapter, introduce several indices (homogeneity and separation scores, silhouette width, redundant scores and WADP) to assess the quality of k-means, hierarchical clustering, PAM and SOM on the NIA mouse 15K microarray data. These indices use objective information in the data themselves and evaluate clusters without any a priori knowledge about the biological functions of the genes on the microarray. Begin with a discussion of the different algorithms. This is followed by a description of the microarray data pre-processing. Then we elaborate on the definitions of the indices and the performance measurement results using these indices. We examine the difference between the clusters produced by different methods and their possible correlation to our biological knowledge.
K-means is a partitioning algorithm in which the objects are classified to one of k groups, k chosen a priori. Cluster membership is single-minded by manipulative the centroid for each group and conveying each object to the group with the closest centroid. This approach minimizes the overall cluster dispersion by iterative reallocation of cluster members (Hartigan and Wong (1979)).
In a general sense, a k-partitioning algorithm takes as input a set S of objects and an integer k, and outputs a partition of S into subsets . It uses the sum of squares as the optimization criterion. Let be the r th element of , and be the distance
Between and .The sum-of-squares criterion is defined by the cost function . In particular, k-means works by calculating the centroid of each cluster denoted and optimizing the cost function . The goal of the algorithm is to minimize the total cost:
The implementation of the k-means algorithm we used in this study was the one in S-plus (MathSoft, Inc.), which initializes the cluster centroids with hierarchical clustering by default, and thus gives deterministic outcomes. The output of the k-means algorithm includes the given number of k clusters and their respective centroids.
PAM (Partitioning around medoids)
Another k-partitioning approach is PAM, which can be used to cluster the types of data in which the mean of objects is not defined or available (Kaufman and Rousseuw (1990)). Their algorithm finds the representative object (i.e., medoid, which is the multidimensional version of the median) of each Si, denoted , uses the cost function and tries to minimize the total cost.
We used the implementation of PAM in the S-plus. PAM finds a local minimum for the objective function, that is, a solution such that there is no single switch of an object with a medoid that will decrease the total cost.
Partitioning algorithms are based on specifying an initial number of groups, and iteratively reallocating objects among groups to convergence. In contrast, hierarchical algorithms combine or divide existing groups, creating a hierarchical structure that reflects the order in which groups are merged or divided. In an agglomerative method, which builds the hierarchy by merging, the objects initially belong to a list of singleton sets Then a cost function is used to find the pair of sets from the list that is the "cheapest" to merge. Once merged, Si and Sj are removed from the list of sets and replaced with . This process iterates until all objects are in a single group. Different variants of agglomerative hierarchical clustering algorithms may use different cost functions. Complete linkage, average linkage, and single linkage methods use maximum, average, and minimum distances between the members of two clusters, respectively.
SOM (Self-organization map)
SOM uses a competition and cooperation mechanism to achieve unsupervised learning. In the classical SOM, a set of nodes is arranged in a geometric pattern, typically 2-dimensional lattice. Each node is associated with a weight vector with the same dimension as the input space. The purpose of SOM is to find a good mapping from the high dimensional input space to the 2âˆ’D representation of the nodes. One way to use SOM for clustering is to regard the objects in the input space represented by the same node as grouped into a cluster. During training, each object in the input is presented to the map and the best matching node is identified. Formally, when input and weight vectors are normalized, for input sample x(t) the winner index c (best match) is identified by the condition:
where t is the time step in the sequential training, mi is the weight vector of the ith node. After that, weight vectors of nodes around the best-matching node c = c(x) are updated as where Î± is the learning rate and is the "neighborhood function", a decreasing function of the distance between the ith and cth nodes on the map grid. To make the map converge quickly, the learning rate and neighborhood radius are often decreasing functions of t. After the learning process finishes, each object is assigned to its closest node.
CLUSTERING PREREQUISITES IN GENE EXPRESSION
Clustering GE usually involves the following basic steps :
(1) Pattern representation: It involves in the demonstration of the data matrix for clustering, number, type, dimension and scale of GE profiles available. A number of these were set during execution of the experiment; on the other hand, definite features are controllable, such as scaling of measurements, imputation, normalisation techniques, representations of up/down-regulation etc. An optional step of feature selection can be carried out.
These are two distinctive procedures in which the former refers to selecting a subset of the original features. It would be most effective to use in the clustering procedure, the latter to the use of transformations of the input features to produce new salient features that may be more biased in the clustering procedure, e.g. Principal Component Analysis.
(2) Definition of pattern proximity measure: Typically measured a distance between pairs of genes. On the other hand, conceptual measures can be used to characterize the similarity among a group of gene profiles e.g. Mean Residue Score of Cheng and Church.
(3) Clustering the data: To find structures (clustering) in the dataset a clustering algorithm is used. Clustering methods can be broadly categorized according to the classification due to .
(4) Data abstraction: Representation of structures found in the dataset. In GE data, this is usually human orientated, so data abstraction must be easy to interpret. It is usually a compact description of each cluster, through a cluster prototype or representative selection of patterns within the cluster, such as cluster centroid.
(5) Assessment of output: Validity of clustering results is essential to cluster analysis of GE data. A cluster output is valid if it cannot reasonably be achieved by chance or as an artifact of the clustering algorithm. Validation is achieved by careful application of statistical methods and testing hypotheses. These measures can be categorized as:
(i) Internal validation,
(ii) External validation and
(iii) Relative validation.
REQUIREMENTS FOR CLUSTERING ANALYSIS
Typical Problems and Desired Characteristics
The desired characteristics of a clustering algorithm depend on the particular problem under consideration.
Clustering techniques for large sets of data must be scalable, both in terms of speed and space. It is not unusual for a database to contain millions of records, and thus, any clustering algorithm used should have linear or near linear time complexity to handle such large data sets. (Even algorithms that have complexity of O(m2) are not practical for large data sets.) Some clustering techniques use statistical sampling. Nonetheless, there are cases, e.g., situations where relatively rare points have a dramatic effect on the final clustering, where a sampling is insufficient.
Furthermore, clustering techniques for databases cannot assume that all the data will fit in main memory or that data elements can be randomly accessed. These algorithms are, likewise, infeasible for large data sets. Accessing data points sequentially and not being dependent on having all the data in main memory at once are important characteristics for scalability.
Independence of the order of input
Some clustering algorithms are dependent on the order of the input, i.e., if the order in which the data points are processed changes, then the resulting clusters may change. This is unappealing since it calls into question the validity of the clusters that have been discovered. They may just represent local minimums or artifacts of the algorithm.
Effective means of detecting and dealing with noise or outlying points
A point which is noise or is simply an atypical point (outlier) can often distort a clustering algorithm. By applying tests that determine if a particular point really belongs to a given cluster, some algorithms can detect noise and outliers and delete them or otherwise eliminate their negative effects. This processing can occur either while the clustering process is taking place or as a post-processing step.
However, in some instances, points cannot be discarded and must be clustered as well as possible. In such cases, it is important to make sure that these points do not distort the clustering process for the majority of the points.
Effective means of evaluating the validity of clusters that are produced.
It is common for clustering algorithms to produce clusters that are not "good" clusters when evaluated later.
Easy interpretability of results
Many clustering methods produce cluster descriptions that are just lists of the points belonging to each cluster. Such results are often hard to interpret. A description of a cluster as a region may be much more understandable than a list of points. This may take the form of a hyper-rectangle or a center point with a radius. Also, data clustering is sometimes preceded by a transformation of the original data space - often into a space with a reduced number of dimensions. While this can be helpful for finding clusters, it can make the results very hard to interpret.
The ability to find clusters in subspaces of the original space.
Clusters often occupy a subspace of the full data space. Hence, the popularity of dimensionality reduction techniques is used. Many algorithms have difficulty finding, for example, a 5 dimensional cluster in a 10 dimensional space.
The ability to handle distances in high dimensional spaces properly
High-dimensional spaces are quite different from low dimensional spaces. In [BGRS99], it is shown that the distances between the closest and farthest neighbors of a point may be very similar in high dimensional spaces. Perhaps an intuitive way to see this is to realize that the volume of a hyper-sphere with radius, r, and dimension, d, is proportional to rd, and thus, for high dimensions a small change in radius, means a large change in volume. Distance based clustering approaches may not work well in such cases. If the distances between points in a high dimensional space are plotted, then the graph will often show two peaks: a "small" distance representing the distance between points in clusters, and a "larger" distance representing the average distance between points. If only one peak is present or if the two peaks are close, then clustering via distance based approaches will likely be difficult. Yet another set of problems has to do with how to weight the different dimensions. If different aspects of the data are being measured in different scales, then a number of difficult issues arise. Most distance functions will weight dimensions with greater ranges of data more highly. Also, clusters that are determined by using only certain dimensions may be quite different from the clusters determined by using different dimensions. Some techniques are based on using the dimensions that result in the greatest differentiation between data points. Many of these issues are related to the topic of feature selection, which is an important part of pattern recognition.
Ability to function in an incremental manner
In certain cases, e.g., data warehouses, the underlying data used for the original clustering can change over time. If the clustering algorithm can incrementally handle the addition of new data or the deletion of old data, then this is usually much more efficient than re-running the algorithm on the new data set.
APPLICATIONS OF CLUSTERING
Biology, computational biology and bioinformatics
Plant and animal ecology
Cluster analysis is used to explain and to make spatial and temporal comparisons of communities (assemblages) of organisms in heterogeneous environments; it is also used in plant systematics to generate artificial phylogenies or clusters of organisms (individuals) at the species, genus or higher level that share a number of attributes
Clustering is used to build groups of genes with related expression patterns (also known as coexpressed genes). Often such groups contain functionally related proteins, such as enzymes for a specific pathway, or genes that are co-regulated. High throughput experiments using expressed sequence tags (ESTs) or DNA microarrays can be a powerful tool for genome annotation, a general aspect of genomics.
Clustering is used to group homologous sequences into gene families. This is a very important concept in bioinformatics, and evolutionary biology in general. See evolution by gene duplication.
High-throughput genotyping platforms
Clustering algorithms are used to automatically assign genotypes.
Human genetic clustering
The similarity of genetic data is used in clustering to infer population structures.
On PET scans, cluster analysis can be used to differentiate between different types of tissue and blood in a three dimensional image. In this application, actual position does not matter, but the voxel intensity is considered as a vector, with a dimension for each image that was taken over time. This technique allows, for example, accurate measurement of the rate a radioactive tracer is delivered to the area of interest, without a separate sampling of arterial blood, an intrusive technique that is most common today.
Clustering can be used to divide a fluence map into distinct regions for conversion into deliverable fields in MLC-based Radiation Therapy.
Cluster analysis is widely used in market research when working with multivariate data from surveys and test panels. Market researchers use cluster analysis to partition the general population of consumers into market segments and to better understand the relationships between different groups of consumers/potential customers, and for use in market segmentation, Product positioning, New product development and Selecting test markets.
Grouping of shopping items
Clustering can be used to group all the shopping items available on the web into a set of unique products. For example, all the items on eBay can be grouped into unique products.
Social network analysis
In the study of social networks, clustering may be used to recognize communities within large groups of people.
Search result grouping
In the process of intelligent grouping of the files and websites, clustering may be used to create a more relevant set of search results compared to normal search engines like Google. There are currently a number of web based clustering tools such as Clusty.
Slippy map optimization
Flickr's map of photos and other map sites use clustering to reduce the number of markers on a map. This makes it both faster and reduces the amount of visual clutter.
Clustering is useful in software evolution as it helps to reduce legacy properties in code by reforming functionality that has become dispersed. It is a form of restructuring and hence is a way of directly preventative maintenance.
Clustering can be used to divide a digital image into distinct regions for border detection or object recognition.
Clustering may be used to identify different niches within the population of an evolutionary algorithm so that reproductive opportunity can be distributed more evenly amongst the evolving species or subspecies.
Recommender systems are designed to recommend new items based on a user's tastes. They sometimes use clustering algorithms to predict a user's preferences based on the preferences of other users in the user's cluster.
Markov chain Monte Carlo methods
Clustering is often utilized to locate and characterize extrema in the target distribution.
Cluster analysis can be used to identify areas where there are greater incidences of particular types of crime. By identifying these distinct areas or "hot spots" where a similar crime has happened over a period of time, it is possible to manage law enforcement resources more effectively.
Educational data mining
Cluster analysis is for example used to identify groups of schools or students with similar properties.
Clustering algorithms are used for robotic situational awareness to track objects and detect outliers in sensor data.
To find structural similarity, etc., for example, 3000 chemical compounds were clustered in the space of 90 topological indices.
To find weather regimes or preferred sea level pressure atmospheric patterns.