This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.
A cluster is a group of data objects which are similar characteristics to each other within cluster and dissimilar characteristics to other data objects belonging to other cluster .Data clustering is a young scientific discipline under vigorous development. There are large number of research papers scattered in many conference proceedings and periodicals, mostly in the fields of data mining, statistics, machine learning, spatial database, biology, marketing, and so on, with different emphases and different techniques. Owing to large amounts of data which is collected from different data resources, cluster analysis has recently become highly active topic in data mining research.
Partitioning clustering is efficient and conceptually simple, but is often difficult to unveil the natural cluster structure of complex hierarchical data. Hierarchical clustering provides output in the form of hierarchical structure which is more informative than the unstructured set of clusters formed by the partitioning clustering. During the last decades, hierarchical clustering has become very popular in various scientific disciplines, such as molecular biology, medicine or economy. However, well-known hierarchical clustering algorithms like Single Link  often fail to detect the true clusters for a real time data set. The dendrogram generated by Single Link is usually very complex and difficult to interpret. Moreover, the performance of many hierarchical clustering algorithms are very sensitive to noise and outliers. How can we find meaningful representation hierarchical clusters of a given dataset? In this paper, we present a GUI Based Tool which is designed to enhance the concept of clustering by synchronization using hierarchical clustering algorithm for identifying meaningful levels of the cluster hierarchy corresponding to high-quality clusters. For the effective clustering, we used synchronization algorithm sync which is combined with the Minimum Description Length principle. The key idea is to regard each data object as a coupled phase oscillator and each object interacts dynamically with similar objects on different levels. The process of the hierarchical clustering inspired by synchronization involves the following stages: Starting from initial conditions, each object runs independently with its own phase. As time evolves, those objects with highest density forming local clusters will synchronize together with a small interaction range. Then, in a sequential process, more and more objects synchronize together and clusters are produced with a larger interaction range. Finally, the whole population will synchronize together and have a common phase. Thus, with different interaction ranges, the dynamical process of synchronization reveals the whole structure of the data set on all scales, from the micro-scale at an early stage up to the macro-scale. At each scale, outliers are effectively detected since they exhibit differently and hardly synchronize with any of the cluster objects. The principle of synchronization thus allows detecting a natural clustering of complex multi-scale data sets with outliers.
Rest of the paper is organized as follows: Section 2 gives the related work in the clustering process. The details of the GUI Based Tool are given in the section 3. Section 4 gives the experimental results. Conclusions are drawn in the section 5.
2 RELATED WORK
Hierarchical clustering algorithms decompose a data set into several levels of partitions, representing by a dendrogram. One of the most well-known hierarchical clustering approaches is Single-Link .Initially the clusters are obtained by placing every data object in a unique cluster, in every step two closest clusters are merged until all objects are in a whole cluster .For the merging criterion several alternatives have been proposed, such as Average-Link and Complete-Link (for a detailed survey see ). The hierarchy obtained by the merging order is visualized as a dendrogram. For a real data set, the dendrogram is often very complex. If a large data set has N objects, the generated dendrogram contains N -1 layer and thus it is difficult to find optimal splitting levels that correspond to meaningful clusters. Outliers may also cause the so-called single-link effect that two clusters are difficult to be separated if there is a chain between the two clusters. The CURE technique uses several representative points to evaluate the similarity measure between the clusters to form clusters of the arbitrary shape and avoid the so-called single-link effect. However, for a given data set, it is still difficult to define appropriate splitting levels which correspond to meaningful clusters. Furthermore, the dendrogram is created to indicate the clustering process, and does not display the true hierarchical structure of a data set.The well known OPTICS algorithm  analyzes the hierarchical data from the perspective of density.It provides the reachability plot to give a more intuitive and transparent way to visualize the hierarchical cluster structure for large data sets. However, for many real data sets, the reachability plot is very smooth and cannot find the hierarchal clusters. In combination with MDL, the algorithm hierarchical synchronization generates an interpretable cluster tree only consisting of meaningful levels, each representing a clustering of high quality. Besides the cluster tree, the output of hierarchical synchronization includes the locality-quality diagram which allows the user to comprehensively assess the quality of the cluster hierarchy over all levels. Below figure shows the generic framework of hierarchical clustering.
The partitioning clustering algorithm Sync, it starts the dynamical interaction among objects with a small value of ¿½ and then increases it stepwise until all objects synchronize in a cluster. Minimum Description Length(MDL) is used to find the best cluster structure: whenever the clusters are good representation of the data structure, they can be used for efficient coding(or compression) of the data set, which results in the minimal MDL value. Actually, the MDL principle can also be linked to hierarchical data analysis, not linking the global minimal MDL value, but all local stable minimal MDL values. The key observation is that if a data set exhibits a hierarchical cluster structure, the MDL values show several distinct stable local minima.
Specifically, when the interaction range ¿½ starts with a small value, objects with highest density will synchronize together and are regarded as a cluster. If the synchronized objects form reasonable clusters reacting the data structure at the micro-scale, the coding costs of the clusters will result in a local relatively low MDL value. By increasing of the ¿½ with the step size ?¿½, if there exists a hierarchal structure of the data, a period of the interaction ranges ¿½ will result in the similar clustering results, which thus result in a period of stable MDL values. Then, in a sequential process, with the further increase of the ¿½, more and more objects with less local density will tend to synchronize together. Equally, if these new synchronized clusters indicate a meaningful level of the hierarchical structure of data, it will result in a new period of relatively low and stable MDL values for a range of ¿½. Finally, all objects may merge together with enough interaction range ¿½.
Kuramoto Model plays a very important role in synchronization concept.Kuramoto's theory for the synchronization transition of globally coupled phase oscillators to populations where each oscillator has a different coupling strength. We show that, beyond the transition, even those oscillators with very small couplings may participate in the synchronized ensemble, provided that their natural frequencies are close enough to the synchronization frequency. In finite systems, numerical realizations reveal that the transition is preceded by a regime of clustering where the population splits into internally synchronized groups of various sizes.
3 A GUI Based Tool
In this section we present the detail working of the GUI tool which uses synchronization concept and hierarchical clustering algorithm. Below figure shows the synchronized clusters from GUI tool.
Step 1: The large dataset is may divided into number of clusters according to k value given.
Step 2: According to the kuramoto model the objects in clusters are runs independently.
Step 3: The coupling strength between the objects, S=0,then those two objects are will synchronize together.
Step 4: Finally kuramoto model using sync algorithm and form various synchronized clusters.
4 EXPERIMENTAL WORK
All the experiments were done on 2.50GHz Intel Core i5 machine with 4GB main memory running Window 7 operating system. We implemented the GUI Tool using Java. Below experiment results shows that clustering by synchronization using hierarchical clustering algorithm gives more accurate synchronized clusters when compared with clustering by synchronization with kuramoto model.
We experimented the GUI Tool on Tic-Tac-Toe Endgame dataset that contains a set of board configurations possible at the end of game.It includes 958 instances (legal tic-tac-toe endgame boards) and 9 attributes, each corresponding to one tic-tac-toe square.
5 CONCLUSION AND FUTURE WORK
In this paper, we present a GUI Based Tool for clustering by synchronization which uses hiearchical clustering algorithm and MDL principle for accurate synchronized clusters.We experimented the GUI based tool on real dataset and evaluated the performance with kuramoto model. Finally we conclude that the GUI based tool with hierarchical clustering algorithm for clustering by synchronization performs more accurately and time efficiently when compare with clustering by synchronization using kuramoto model.Future work will focus on exploiting the powerful concept of synchronization for subspace clustering.In addition, we will investigate on data visualization techniques based on the simulated object movement. As a long term goal we want to closely integrate simulation into the datamining process to design robust algorithms.