An Explanation On A Spanning Tree English Language Essay

Published:

A spanning tree is an acyclic subgraph of a graph G, which contains all the vertices from G. The minimum spanning tree (MST) of a weighted graph is the minimum weight spanning tree of that graph. With the classical MST algorithms [18, 13, 15], the cost of constructing a minimum spanning tree is O(mlog n), where m is the number of edges in the graph, n is the number of vertices. More efficient algorithms for constructing MSTs have also been extensively researched [12, 7, 8]. These algorithms promise close to linear time complexity under different assumptions. A Euclidean minimum spanning tree (EMST) is a spanning tree of a set of n points in a metric space (ER), where the length of an edge is the Euclidean distance between a pair of points in the point set. The MST clustering algorithm is known to be capable of detecting clusters with irregular boundaries [24]. Unlike traditional clustering algorithms, the MST clustering algorithm does not assume a spherical shaped clustering structure of the underlying data. The EMST clustering algorithm [17, 24] uses the Euclidean minimum spanning tree of a graph to produce the structure of point clusters in the n-dimensional Euclidean space. Clusters are detected to achieve some measure of optimality, such as minimum intracluster distance or maximum intercluster distance [1]. The (E)MST clustering algorithm has been widely used in practice.

Lady using a tablet
Lady using a tablet

Professional

Essay Writers

Lady Using Tablet

Get your grade
or your money back

using our Essay Writing Service!

Essay Writing Service

Once the MST is built for a given input, there are two different ways to produce a group of clusters. If the number of clusters k is given in advance, the simplest way to obtain k clusters is to sort the edges of the minimum spanning tree in descending order of their weights, and remove the edges with the first k − 1 heaviest weights [1, 22]. We call this approach the standard EMST clustering algorithm or SEMST in the rest of the paper. The second approach does not require a preset cluster number. Edges, that satisfy a predefined inconsistency measure, are removed from the tree. We use the inconsistency measure suggested by Zahn in [24], and therefore we call the clustering algorithm Zahn's EMST clustering algorithm or ZEMST.

In this paper, we propose two EMST based clustering algorithms to address the issues-undesired clustering structures and an unnecessarily large number of clusters, commonly

faced by the SEMST and the ZEMST algorithm respectively. Our first algorithm assumes the number of clusters is given. The algorithm constructs a EMST of a point set and removes the inconsistent edges that satisfy an inconsistency measure. The process is repeated to create a hierarchy of clusters until k clusters are obtained. The second algorithm partitions the point set into a group of clusters by maximizing the overall standard deviation reduction. The final number of clusters is determined by finding the local minimum of the standard deviation reduction function.

RELATED WORK

Clustering algorithms based on minimum and maximum spanning trees have been extensively studied.

In the mid 80's, Avis [2] found an O(n2 log2 n) algorithm for the minmax diameter 2-clustering problem.

Asano, Bhattacharya, Keil, and Yao [1] later gave an optimal O(n log n) algorithm using maximum spanning trees for minimizing the maximum diameter of a bipartition. The problem becomes NP-complete when the number of partitions is beyond two [11]. Asano, Bhattacharya, Keil, and Yao also considered the clustering problems in which the goal is to maximize the minimum intercluster distance. They gave an O(n log n) algorithm for computing a k-partition of a point set by removing the k −1 longest edges from the minimum spanning tree constructed from that point set [1].

Zahn [24] proposes to construct an MST of a point set and delete inconsistent edges - the edges, whose weights are significantly larger than the average weight of the nearby edges in the tree. The inconsistency measure requires one of the following three conditions hold:

1. w > wN1 + c Ã- σN1 orw > wN2 + c Ã- σN2

2. w > max(wN1 + c Ã- σN1 , wN2 + c Ã- σN2)

3.w/(max(c Ã- wN1, c Ã- wN2)) > f,

where c and f are preset constants. All the edges of a tree that satisfy the inconsistency measure are considered inconsistent and are removed from the tree. This result in a set of disjoint subtrees each represents a separate cluster. Note that the resulting cluster structure is affected by the depth d of the neighborhood N, and the constants c and f.

Lady using a tablet
Lady using a tablet

Comprehensive

Writing Services

Lady Using Tablet

Plagiarism-free
Always on Time

Marked to Standard

Order Now

Eldershaw and Hegland [6] re-examine the limitations of many clustering algorithms that assume the underlying clusters of a data set are spherical. They present a clustering algorithm by constructing a graph using Delaunay triangulation, and removing the edges between neighbors that are longer than a cut-off point. Next, they apply a graph partitioning algorithm to find the isolated connected components in the graph, and each discovered component is treated as a cluster. Similar to Zahn's MST clustering algorithm, this algorithm divides a point set into a certain number of clusters at once by removing all edges in the graph that are longer than a threshold. Unlike in Zahn's method, they choose a cut-off point which corresponds to the "global" minimum of a function that measures how well the consistent edges and the inconsistent edges in the graph are separated.

More recently, P¨aivinen [16] proposed a scale-free minimum spanning tree (SFMST) clustering algorithm which constructs a scale free network and outputs clusters containing highly connected vertices and those connected to them.

The MST clustering algorithm has been widely used in practice. Xu (Ying), Olman and Xu (Dong) [22] use an MST to represent multidimensional gene expression data. They point out that an MST-based clustering algorithm does not assume that data points are grouped around centers or separated by a regular geometric curve. Thus the shape of a cluster boundary has little impact on the performance of the algorithm. They describe three objective functions and the corresponding clustering algorithms for computing a k partition of the spanning tree for any predefined k > 0. The first algorithm simply removes the k − 1 longest edges so that the total weight of the k subtrees is minimized. The second objective function is defined to minimize the total distance between the center and each data point in a cluster. The algorithm first removes k − 1 edges from the tree, which creates a k-partition. Next, it repeatedly merges a pair of adjacent partitions and finds its optimal 2-clustering solution. They observe that the algorithm quickly converges to a local minimum. The third objective function is defined to minimize the total distance between the "representative" of a cluster and each point in the cluster. The representatives are selected so that the objective function is optimized. This algorithm runs in exponential time in the worst case.

Oleksandr Grygorash, Yan Zhou, Zach Jorgensen (04031882.pdf), proposed two EMST-based clustering algorithms, one assumes a given cluster number and they called it a hierarchical EMST clustering algorithm or HEMST, the other does not and they called it a maximum standard deviation reduction clustering algorithm or MSDR.

HEMST clustering algorithm

Given a point set S in En and the desired number of clusters k, the hierarchical method starts by constructing an MST from the points in S. The weight of an edge in the tree is the Euclidean distance between the two end points. Next, the average weight w of the edges in the entire EMST and its standard deviation σ are computed; any edge with a weightw > w + σ is removed from the tree. This leads to a set of disjoint subtrees ST ={T1, T2, . . .}. Each of the subtrees Ti is treated as a cluster, which has a centroid ci. If the number of the subtrees |ST | < k, k − |ST | additional longest edges are removed from the entire edge set of ST to produce k disjoint subtrees. If |ST | > k, a representative point is identified for each subtree. The representative point ri for a cluster Ti ∈ ST is defined as the point p ∈ Ti that is closest to the centroid ci of Ti. In other words, d(p, ci) = minpj∈Ti d(pj, ci). Once all the representative points are found, each point in a particular subtree is replaced with the representative point of the subtree, thus reducing the number of points in S to |ST |. A EMST is constructed from the representative points of the clusters, and the same tree partitioning process is repeated. When |ST | = k, the clustering process is considered complete, having produced the required k clusters.

Instead of removing the k−1 longest edges all at once to create a k-partition, our algorithm first partitions the point set into a number of more compact clusters. Subsequently, a new partitioning process is repeated on the EMST constructed from a much smaller set of representative points. Each representative point is close to the centroid of the subset created in the previous round. The algorithm eventually outputs k representative points r1, . . . , rk for the final required k clusters. Each point in the given point set is grouped according to its membership in a particular subtree. A point p is assigned to a cluster i if p ∈ Ti.

MSDR clustering algorithm

Lady using a tablet
Lady using a tablet

This Essay is

a Student's Work

Lady Using Tablet

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

Examples of our work

HEMST algorithm assumes that the desired number of clusters is given in advance. In practice, determining the number of clusters is often coupled with discovering the cluster structure. In this section, we present another EMST based clustering algorithm-the maximum standard deviation reduction clustering algorithm, or MSDR which does not require a predefined cluster number. MSDR algorithm first constructs a EMST from the given point set S. Next, it computes the standard deviation of the edges in the EMST, and removes an edge to obtain a set of two disjoint subtrees such that the overall standard deviation reduction is maximized. This edge removing process is repeated to create more disjoint subtrees until the overall standard deviation reduction is within a threshold. The desired number of clusters is obtained by finding the local minimum of the standard deviation reduction function.

Given a point set S, our algorithm groups the data points in such a way that each pair of points in the group are either directly or indirectly close to each other in the metric space. Two points are directly close to each other if the distance between them is small. They are indirectly close to each other if they are far apart, but there exists a point in the same group, to which both points are close. This objective allows us to detect clusters that havemore complex geometric shapes than spherical clusters. For a given point set and the corresponding minimum spanning tree, we partition the MST into a set of disjoint subtrees SK = {T1, T2, . . . , TK} such that the following objective function is satisfied:

where T0 denotes the original EMST and SK denotes the final partition SK = {T1, T2, . . . , TK} of T0 that results in maximum overall standard deviation reduction, σ(T0) denotes

the standard deviation of the edges in T0, σ(SK) denotes the weighted average of the standard deviation of the edges in the disjoint trees Tj=1,...,K ∈ SK:

Δσ(SK) denotes the maximum standard deviation reduction that leads to the partition SK = {T1, T2, . . . , TK}, Δσ(S_K ) denotes the maximum standard deviation reduction leading to the immediate precedent of SK, i.e. S_K = {T1, . . . , TK−1}. _ is a small positive value that determines when the iterative edge removing process stops. The desired number of clusters is determined by applying polynomial regression on Δσ(SK)i produced in each iteration i of the edge removing process. The number of clusters, as a critical point of the regression function, with a positive second derivative is chosen as the final number of clusters. Points in each subtree Tj ∈ SK are members of a cluster.

N.S. P¨aivinen, T.K. Gr¨onfors (01631514.pdf), proposed modified scale-free MST clustering method. When constructing a scale-free minimum spanning tree (SFMST), reversed distances are used as the edge weights when constructing the SFMST; the edge with the biggest weight is selected to be added to the tree in each step of the construction algorithm. (In fact, the resulting spanning tree could be said to be the maximum spanning tree.) If d(i, j) means the distance between ith and jth node (measured in the Euclidean metrics), the edge weights, w0(i, j), can be defined as w0(i, j) = _maxi,j(d(i, j))_ − d(i, j). The operation _·_ means rounding up towards the nearest integer and it is done in order to avoid the edge weight to have value zero. If a node gets "enough" edges, meaning that

the number of edges exceeds a pre-defined threshold value, each additional edge gives an extra fitness value bonus to the highly-connected node, or, actually, to all the possible edges originating from the highly-connected node. The fitness value was defined as wnew(i, j) = w0(i, j) + ncn, where n is the number of edges, and c is a constant, 0.5 < c < 1 [8].

One way to get rid of the threshold value telling when to update the weights is to use a "gravity-like" weight updating. First, the original edge weights are set to w0(i, j) = 1/d(i, j)2, and the weights are updated always when an edge is added to the tree. The weight updaing is defined as wnew(i, j) = ncn/d(i, j)2, where c is a constant with same value range as before and n is the number of edges. In this study, different values for c ranging from c = 0.6 to c = 0.98 were tested. This version of the SFMST clustering is called as the modified SFMST method, and its performance is under inspection in this study.

An SFMST can be seen to be composed of hubs, bunches and branches. A hub is a highly connected node (having at least m neighbors; the value of m can be defined in many ways depending on the situation), a bunch is a hub and all its neighbors, and a branch is a chain of nodes. A pair of nodes is neighbors, if they are linked together with an edge. The constant c affects to the number of hubs in the SFMST: a smaller value means more (smaller) hubs, a larger value corresponds with less hubs having many links; see Figures 3-5 for an example of the effect of c. It could also be said that the parameter c tells how converged the SFMST structure is to get a clustering from an SFMST, edges need to be removed from the structure. In this study, a cluster in an SFMST is defined to be a hub and its neighbors. If two hubs are connected directly to each other or there is only one node between them, they belong to the same cluster. If there is a branch coming from a bunch node, it is assigned to the same cluster. See Figure 1 for three examples. Using this definition, it is possible that there are nodes which do not belong to any cluster; they are referred to as non-clustered nodes.

(04097652.pdf), in the paper, Finding optimal no. of clusters from artificial datasets, Scale-free minimum spanning trees (SFMSTs) were constructed from the artificial test datasets, and the number of clusters, based on the distribution of the edge lengths, as well as the clustering itself was obtained from the structure. In this study, we claim that the scale-free clustering method [3] presented earlier is a method which finds the clustering and the number of clusters in a single execution round: there is no need to calculate different clusterings from which one has to select the best one

Clustering methods

In this study, three clustering methods were used: as a reference, the nearest neighbor (nn) and k-means clustering [2], and a scale-free clustering method. The nearest neighbor, or single linkage, method can be realized with a minimum spanning tree [6]. The last method, the scale-free clustering, is based on a minimum spanning tree (MST) clustering method [7], in which the data points constitute a complete weighed graph with the distances between the points as the weighs. Then minimum spanning tree of this graph is constructed, and by removing some edges from it, a clustering is constructed with the remaining connected components defined as clusters.

The main problem with the MST clustering is how to define the inconsistent edges that are to be removed. Usually the inconsistence is defined using the edge lengths and standard deviations [7]. The edge length distribution may play a crucial role in the selection of the inconsistent edges; Duda et al. show an example where the edge length distribution of a minimal spanning tree is bimodal, and if all the edges of intermediate or long length are removed, clustering of the dataset if achieved[2].

In the scale-free clustering method, a minimum spanning tree of the dataset is constructed in such a way that high connectivity is preferred, that is, the nodes which already have many edges are more likely to get more connections [3]. The resulting MST has a scale-free structure, and is thus called a scale-free minimum spanning tree (SFMST). In this study the modification of our SFMST construction method [8] was used. There is one control parameter which determines the exact structure of the SFMST, and thus by varying the value of this parameter, different SFMSTs can be obtained.

A clustering can be obtained from the SFMST by removing some edges. In this study, the removed edges are selected with the help of the edge length distribution. It is known that the average edge length in scale-free graphs depends logarithmically on the number of nodes, but r the probability distribution function can take different forms. A lognormal distribution was fitted to the edge length data; a lognormal distribution [ 1] is a probability y distribution related to the normal distribution in such a way that if x is a random variable distributed lognormally, then ln(x) is distributed normally The number of bins in the histogram was automatically detected with the Freedman-Diaconis rule [12]. Now, if the edge length histogram is truncated at the point where it first reaches zero, thus ignoring the edges corresponding with the isolated bars at the rightmost end of the histogram, the resulting structure is a collection of subtrees and it can be defined as a clustering.

Defining the Number of Clusters

Model-based cluster analysis has also been used as a clustering method which automatically finds the number of clusters present in the dataset. The basic idea behind the model-based clustering is the assumption that the dataset is generated by a mixture of probability distributions in which each component represents a different cluster. In fact, in the a k-means method, one is fitting Gaussian mixtures to the data[14]. The probability function estimation has been proposed as a method to estimate the number of clusters without constructing the clustering itself [15].

Zhiqiang Xie, et al (04406228.pdf), Aiming at the lower efficiency of the former MST (minimum spanning tree) clustering algorithm based on gene expression, a modified IMST (improved minimum spanning tree) clustering algorithm applied to common problem is brought forward. In EMST eliminated k-1edges which are the long edges from the MST, the MST will be divide into k sub-trees [6]. Obviously, if the edge between different clusters is longer than the edges from the same cluster, this algorithm is good enough. But when different clusters are connected by a series of short edges, or there are noises or outliers, this simple method may not have the best clustering result. In order to determine how many times the sub-tree should be divided into automatically, the algorithm can detect that whether the new sub-tree is outliers. And the algorithm can get k clusters correctly by eliminating the outliers and increasing the effective dividing times.

In the IMST Algorithm, An element a is chosen from the sample database, all elements are divided into three parts, namely upper bound data set ,lower bound data set and middle set according to the definition 1,2 and 3. Then the distance of the elements which is between upper bound data set and lower bound data set are not calculated. That is to say, there are no tree's branches of the MST between the elements of upper bound data set and that of lower bound data set, so the efficiency of constructing MST is improved consumedly. An element which owns the shortest distance to a is selected from upper bound data set and lower bound data set, and the element is made the root of right sub-tree and left sub-tree, the others continue to divide according to this recursion mode until only the middle set and the node's middle set leave. The nodes' middle sets adopt the algorithm described as part 3.3 to connect and adjust, they become the nodes of MST and finally the MST is finished. On the basis of practical need, the MST will be divided into k+1 sub-trees after it is deleted the k longest edges by the clustering algorithm of eliminating the longest edges and IMST clustering algorithm. Then matrixes are constructed according to their relevant sub-trees. The node which has maximal degree in the matrix of the sub-tree will be the center of a cluster. According to the distance relationship of every node and the medoid, we can quickly finish classifying of a sub-tree, so this clustering partition is finished.

Prasanta K. Jana and Azad Naik (Ref-1MST.pdf), proposed validity index as fallows

B. Validity Index

Validity index is generally used to evaluate the clustering results quantitatively. In this paper we focus on the validity index, which is based on compactness and isolation. Compactness measures the internal cohesion among the data elements whereas isolation measures separation between the clusters [18]. We measure the compactness by Intra-cluster distance and separation by Inter-cluster distance, which are defined as follows.

Intra-cluster distance: This is the average distance of all the points within a cluster from the cluster centre given by

Inter-cluster distance: This is the minimum of the pair wise distance between any two cluster centers given by

In the evaluation of our clustering algorithm, we use the validity index proposed by Ray and Turi [19] as follows

Threshold value: This denotes the limit when two points get disconnected if the distance between them is greater than this limit.

III. PROPOSED ALGORITHM

The basic idea of our proposed algorithm is as follows. We first construct MST using Kruskal algorithm and then set a threshold value and a step size. We then remove those edges from the MST, whose lengths are greater than the threshold value. We next calculate the ratio between the intra-cluster distance and inter-cluster distance using equation (3) and record the ratio as well as the threshold. We update the threshold value by incrementing the step size. Every time we obtain the new (updated) threshold value, we repeat the above procedure. We stop repeating, when we encounter a situation such that the threshold value is maximum and as such no MST edges can be removed. In such situation, all the data points belong to a single cluster. Finally we obtain the minimum value of the recorded ratio and form the clusters corresponding to the stored threshold value. The above algorithm has two extreme cases: 1) with the zero threshold value, each point remains within a single cluster; 2) with the maximum threshold value all the points lie within a single cluster. Therefore, the proposed algorithm searches for that optimum value of the threshold for which the Intra-Inter distance ratio is minimum. It needs not to mention that this optimum value of the threshold must lie between these two extreme values of the threshold. However, in order to reduce the number of iteration we never set the initial threshold value to zero.

Outlier Detection

Caiming Zhong, Xueming Lin, Ming Zhang (05193793.pdf), present a graph-cut based method to detect outliers in two phases. In the first phase,a two-round- MST based graph is constructed and a graph-cut criterion is defined to cut the graph. In the second phase, we define the outlier factor of clusters (OFC) and the outlier factor of individual object within a cluster (OFintra), and identify outliers by the two factors. Even though the proposed method is graph-cut based, the k-nearest neighborhood is not used.

Jiang, Tseng, Su (Twophaseclu..pdf), in the paper Two phase clustering for outliers detection, has used in first phase, modified k-Means (MKP) to detect clusters and in the second phase (OFP) , use MST by the centriods of set C.

John Peter (ijcam5n5_6.pdf, 5167.pdf), in his paper, An Efficient Algorithm for Local Outlier Detection Using Minimum Spanning Tree, (LODMST) proposed a Minimum Spanning Tree based clustering algorithm for detecting outliers, without prior of no. of clusters. The algorithm partition the dataset into optimal number of clusters. Small clusters are then determined and considered as outliers. The rest of the outliers (if any) are then detected in the remaining clusters based on temporary removing an edge (Euclidean distance between objects) from the data set and recalculate the weight function (MSTWF). If the noticeable changes occurred in the weight function, then one of the points is considered as outlier based on degree number of point. The algorithm uses a new cluster validation criterion based on the geometric property of data partition of the dataset in order to find the proper number of clusters. The algorithm works in two phases. The first phase of the algorithm creates optimal number of clusters, where as the second phase of the algorithm detect outliers. The key feature of our algorithm is it finds noise-free/error-free clusters for a given dataset without using any input parameters.

Our Local Outlier Detection using Minimum Spanning Tree based Clustering (LODMST) algorithm is based on Minimum Spanning Tree does not require a predefined cluster number. The algorithm constructs an EMST of a point set and removes the inconsistent edges that satisfy the inconsistence measure. The process is repeated to create a hierarchy of clusters until optimal numbers of clusters (regions) are obtained. Using the optimal number of clusters outliers can be easily detected.

LODMST Clustering Algorithm

Given a point set S in En, the hierarchical method starts by constructing a Minimum Spanning Tree (MST) from the points in S. The weight of the edge in the tree is Euclidean distance between the two end points. So we named this MST as EMST1. Next the average weight Ŵ of the edges in the entire EMST1 and its standard deviation σ are computed; any edgewith W > Ŵ + σ or current longest edge is removed from the tree. This leads to a set of disjoint subtrees ST = {T1, T2 …}.Each of these subtrees Ti is treated as cluster. We propose a new algorithm named, Local Outlier Detection Using Minimum Spanning Tree algorithm (LODMST), which does not require a predefined cluster number. The algorithm works in two phases. The first phase of the algorithm partitioned the EMST1 into sub trees (clusters/regions). The centers of clusters or regions are identified using eccentricity of points. These points are a representative point for the each subtree ST. A point ci is assigned to a cluster i if ci ΠTi. The group of center points is represented as C = {c1, c2……ck}. These center points c1, c2 ….ck are connected and again minimum spanning tree EMST2 is constructed is shown in the Figure 4. A Euclidean distance between pair of clusters can be represented by a corresponding weighted edge. Our Algorithm is also based on the minimum spanning tree but not limited to two-dimensional points. The Second phase of the algorithm used to determine the outliers. Based on the definition of small clusters as defined in [30], we define small cluster as a cluster with fewer points than half the average number of points in the optimal number of clusters. We first detect small clusters (outliers) from optimal number of clusters. To detect the outliers from the rest of the clusters (if any), we use a new approach based on Minimum Spanning Tree based weight function. For any undirected graph G the degree of a vertex v, written as deg (v), is equal to the number of edges in G which contains v, that is, which are incident on v. First we compute the value of Minimum Spanning Tree based weight function (MSTWF) for each cluster. Then we temporarily remove an edge (Euclidean distance between objects) from the cluster and re-calculate the weight function value. If the removal of an edge causes noticeable decrease in the weight function value, then the object (point) connected with the edge is considered as an outlier based on the degree number; otherwise, it is not. Formally we define the weight function of MST based clustering as

|C| |E|

MSTWFi = S S Wij(e) (1)

i=1 j=1

|C| , |E| be optimal number of clusters and number of edges in the cluster(subtree), Wij(e) is the weight of jth edge of the ith cluster. The weight function represents the (Euclidean) sum of distances between the objects (points) in the cluster produced by the clustering algorithm. When scanning the cluster (MST), the edges are ordered from smaller to larger lengths. Then we define the threshold as:

THR=max(Li - Li-1)* t (2)

Where Li is largest in the order and t Î [0,1] is a user defined parameter. Let MSTWFi be the weight function produced by the clustering algorithm for each cluster (subtee). Let MSTWFij be the weight function produced by the algorithm after removing a edge with

Wij from the cluster (sub tree). Subtracting MSTWFij from MSTWFi gives the difference between the two values expressed as

MSTDWFi = MSTWFi - MSTWFij (3)

Here, we use a cluster validation criterion based on the geometric characteristics of the clusters, in which only the inter-cluster metric is used. The LODMST algorithm is the nearest centroid-based clustering algorithm, which creates region or subtrees (clusters/regions) of the data space. The algorithm partitions a set S of data of data D in data space in to n regions (clusters). Each region is represented by a centroid reference vector. If we let p be the centroid representing a region (cluster), all data within the region (cluster) are closer to the centroid p of the region than to any other centroid q:

R (p) ={xÎDdist(x, p)£dist(x, q)"q} (4)

Thus, the problem of finding the proper number of clusters of a dataset can be transformed into problem of finding the proper region (clusters) of the dataset. Here, we use the MST as a criterion to test the inter-cluster property. Based on this observation, we use a cluster validation criterion, called Cluster Separation (CS) in LODMST algorithm [10].

Cluster separation (CS) is defined as the ratio between minimum and maximum edge of MST. ie.,

CS=Emin/Emax (5)

where Emax is the maximum length edge of MST, which represents two centroids that are at maximum separation, and Emin is the minimum length edge in the MST, which represents two centroids that are nearest to each other. Then, the CS represents the relative separation of centroids. The value of CS ranges from 0 to 1. A low value of CS means that the two centroids are too close to each other and the corresponding partition is not valid. A high CS value means the partitions of the data is even and valid. In practice, we predefine a threshold to test the CS. If the CS is greater than the threshold, the partition of the dataset is valid. Then again partitions the data set by creating subtree (cluster/region). This process continues until the CS is smaller than the threshold. At that point, the proper number of clusters will be the number of cluster minus one. The CS criterion finds the proper binary relationship among clusters in the data space. The value setting of the threshold for the CS will be practical and is dependent on the dataset. The higher the value of the threshold the smaller the number of clusters would be. Generally, the value of the threshold will be > 0.8[10]. Figure 3 shows the CS value versus the number of clusters in hierarchical clustering. In practice, we predefine a threshold to test the CS. If the CS is greater than the threshold, the partition of the dataset is valid. Then again partitions the data value < 0.8 when the number of clusters is 5. Thus, the proper number of clusters for the data set is 4. Further more, the computational cost of CS is much lighter because the number of subclusters is small. This makes the CS criterion practical for the LODMST algorithm when it is used for clustering large dataset to detect outliers.

John Peter (john.pdf), Meta Similarity Fine Clusters Using Dynamic Minimum Spanning Tree with Self-Detection of Best Number of Clusters, proposed Dynamically Growing Minimum Spanning Tree for Fine Meta Clustering algorithm (DGEMSTFMC) algorithm works in three phases. The first phase of the algorithm create rough clusters by removing outliers from data set with guaranteed intra-cluster similarity, where as the second phase of the algorithm removes local outliers from the rough clusters. The third phase creates dendrogram using the fine clusters as objects with guaranteed inter-cluster similarity. The first phase of the algorithm uses divisive approach, where as the second phase uses agglomerative approach. In this paper I used both the approaches in the algorithm to find Best Meta similarity fine clusters.

DGEMSTFMC Clustering Algorithm

Given a point set S in En, the hierarchical method starts by constructing a Minimum Spanning Tree (MST) from the points in S. The weight of the edge in the tree is Euclidean distance between the two end points. So I named this MST as EMST1. Next the average weight Ŵ of the edges in the entire EMST1 and its standard deviation σ are computed; any edge with W > Ŵ + σ or current longest edge is removed from the tree. This leads to a set of disjoint subtrees ST = {T1, T2 …} (divisive approach). Each of these subtrees Ti is treated as cluster. Oleksandr Grygorash et al proposed minimum spanning tree based clustering algorithm [35] which generates k clusters. My previous algorithm [27] generates k clusters with centers, which used to produce Meta similarity clusters. Both of the minimum spanning tree based algorithms assumed the desired number of clusters in advance. In practice, determining the number of clusters is often coupled with discovering cluster structure. Hence I propose a new algorithm named, Dynamically Growing Minimum Spanning Tree for Fine Meta Clustering algorithm (DGEMSTFMC), which does not require a predefined cluster number. The algorithm works in three phases. The first phase of the algorithm creates rough clustering by partitioned the EMST1 into sub trees (clusters/regions). The centers of clusters or regions are identified using eccentricity of points. These points are a representative point for the each subtree ST. A point ci is assigned to a cluster i if ci Є Ti. The group of center points

is represented as C = {c1, c2……ck}. These center points c1, c2 ….ck are connected and again minimum spanning tree EMST2 is constructed is shown in the Fig. 4. A Euclidean distance between pair of clusters can be represented by a corresponding weighted edge. My Algorithm is also based on the minimum spanning tree but not limited to two-dimensional points. There were two kinds of clustering problem; one that minimizes the maximum intra-cluster distance and the other maximizes the minimum inter-cluster distances. My Algorithm produces clusters with both intra-cluster and inter-cluster similarity. The Second phase of the algorithm finds fine clustering by removing local outliers from rough clusters. Third phase constructs dendrogram, which can be used to interpret about inter-cluster distances.

This new algorithm is neither single link clustering algorithm (SLCA) nor complete link clustering algorithm (CLCA) type of hierarchical clustering, but it is based on the distance between centers of clusters. This approach leads to new developments in hierarchical clustering. The level function, L, records the proximity at which each clustering is formed. The levels in the dendrogram tell us the least amount of similarity that points between clusters differ. This piece of information can be very useful in several medical and image processing applications. To detect the outliers from the clusters I use the degree number of points in the clusters. For any undirected graph G the degree of a vertex v, written as deg (v), is equal to the number of edges in G which contains v, that is, which are incident on v[13].

S.John Peter (20100238.pdf), A Novel Algorithm for Meta Similarity Clusters Using Minimum Spanning Tree(CEMST), propose two minimum spanning trees based clustering algorithm. The first algorithm produces k clusters with center and guaranteed intra-cluster similarity. The second algorithm is proposed to create a dendrogram using the k clusters as objects with guaranteed inter-cluster similarity. The first algorithm uses divisive approach, where as the second algorithm uses agglomerative approach. In this paper we used both the approaches to find Meta similarity clusters.

1. CEMST Clustering Algorithm

Given a point set S in En and the desired number of clusters k, the hierarchical method starts by constructing an MST from the points in S. The weight of the edge in the tree is Euclidean distance between the two end points. Next the average weight Ŵ of the edges in the entire EMST and its standard deviation σ are computed; any edge with W > Ŵ + σ or current longest edge is removed from the tree. This leads to a set of disjoint subtrees ST = {T1, T2 …} (divisive approach). Each of these subtrees Ti is treated as cluster. Oleksandr Grygorash et al proposed algorithm [14] which generates k clusters. We modified the algorithm in order to generate k clusters with centers. Hence we named the new algorithm as Center Euclidean Minimum Spanning Tree (CEMST). Each center point of k clusters is a representative point for the each subtree ST. A point ci is assigned to a cluster i if ci ∈ Ti. The group of center points is represented as S = {c1, c2……ck}

2. EMSTU Clustering Algorithm

The result of the CEMST algorithm consists of k number clusters with their center. These center points c1, c2 ….ck are connected and again minimum spanning tree is constructed is shown in the Figure 3. A Euclidean distance between pair of clusters can be represented by a corresponding weighted edge. Our Algorithm is also based on the minimum spanning tree but not limited to two-dimensional points. There were two kinds of clustering problem; one that minimizes the maximum intra-cluster distance and the other maximizes the minimum inter-cluster distances. Our Algorithms produces clusters with both intra-cluster and inter-cluster similarity. We propose Euclidean Minimum Spanning Updation algorithm (EMSTU) converts the minimum spanning tree into dendrogram, which can be used to interpret about inter-cluster distances. This new algorithm is neither single link clustering algorithm (SLCA) nor complete link clustering algorithm (CLCA) type of hierarchical clustering, but it is based on the distance between centers of clusters. This approach leads to new developments in hierarchical clustering. The level function, L, records the proximity at which each clustering is formed. The levels in the dendrogram tell us the least amount of similarity that points between clusters differ. This piece of information can be very useful in several medical and image processing applications.

P. Murugavel, M. Punithavalli,( IJCSE11-03-01-157[1].pdf), this paper, three partition-based algorithms, PAM, CLARA and CLARANs are combined with k-medoid distance based outlier detection to improve the outlier detection and removal process.

Svetlana Cherednichenko (2005 MSc .Pdf), Outlier Detection in Clustering , this thesis presents a theoretical overview of outlier detection approaches. A novel outlier detection method is proposed and analyzed, it is called Clustering Outlier Removal (COR) algorithm. It provides efficient outlier detection and data clustering capabilities in the presence of outliers, and based on filtering of the data after clustering process. The algorithm of our outlier detection method is divided into two stages. The first stage provides k-means process. The main objective of the second stage is an iterative removal of objects, which are far away from their cluster centroids. The removal occurs according to a chosen threshold.

Moh'd Belal Al- Zoubi, (ejsr_28_2_16.pdf), An Effective Clustering-Based Approach for Outlier Detection, proposed method based on clustering approaches for outlier detection is presented. We first perform the PAM clustering algorithm. Small clusters are then determined and considered as outlier clusters. The rest of outliers (if any) are then detected in the remaining clusters based on calculating the absolute distances between the

medoid of the current cluster and each one of the points in the same cluster.

The basic structure of the proposed method is as follows:

Step 1. Perform PAM clustering algorithm to produce a set of k clusters.

Step 2. Determine small clusters and consider the points (objects) that belong to these clusters as outliers.

For the rest of the clusters (not determined in Step 2)

Begin

Step 3. For each cluster j, compute the ADMPj and Tj values.

Step 4. For each point i in cluster j,

if ADMPij > Tj then classify point i as an outlier; otherwise not.

End