This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.
Abstract- Cluster analysis, primitive exploration with little or no prior knowledge, consists of research developed across a wide variety of communities. Most clustering techniques ignore the fact about the different size or levels - where in most cases, clustering is more concern with grouping similar objects or samples together ignoring the fact that even though they are similar, they might be of different levels. For really large data sets, data reduction should be performed prior to applying the data-mining techniques which is usually performing dimension reduction, and the main question is whether some of these prepared and preprocessed data can be discarded without sacrificing the quality of results. Existing clustering techniques would normally merge small clusters with big ones, removing its identity. In this study we propose a method which uses divide and conquer technique to improve the performance of the K-Means clustering method.
Keywords: Clustering, K-Means, Euclidean Space, Object Similarity, High Dimensional Data
Inmost of the diagnostic and therapeutic applications of the image guided intervention, vessel extraction is an important prerequisite step for quantitative analysis of the vessel tree. During the intervention procedure cardiologist have to rely on 2D fluoroscopic images which are acquired in real time. Inaddition to common difficulties of fluoroscopic images such as low signal to noise ratio and improper illumination, the need for reliable processing of a large amount of data in a limited computation time makes the real-time segmentation of vascular structures a challenging task. Althoughmany techniques exist for vessel segmentation in the literature, only a few methods are suitable to meet intra-operative and real-time constraints. Generally, there are two schemes described in the literature for extracting vessel structures from angiograms 1) segmenting a complete vascular structure [ ] and 2) directly explore the boundaries or centerline using an exploratory tracing algorithm [ ]. The first scheme consists of a wide range of vessel extraction techniques in which the primary goal is to identify the image pixels located on the area covered by vessels. It requires extensive processing of image pixels and generally relies on adaptive segmentation, followed by extracting blood vessel centerlines and branch point analysis. Typically, this scheme requires much time and computational resources to complete the task, scales poorly with image size, and does not adapt to partially provide useful results [18,19,20,21 az article optimal sche..,maghaleh hassan, ]. The second approach, employed by this paper and several others [Can et al.],[?], is referred to as vessel tracking, vectorial tracking, or tracing. These methods usually avoid preprocessing steps which require processing of every image pixel such as enhancement filters, edge detection and morphological operators, and try to execute as few numerical operations as possible. In tracing methods the first step is to collect a number of initial points (seed points) by exploring the image along a grid of pixel-wide lines (Fig.1), identifying the pixels with local gray-level minima. Then by utilizing a set of verification rules, local properties of the initial points are examined to test the validity of each seed point. In the third step other pixels on the region of interest are explored recursively .
A large number of related studies have been performed on the exploratory tracing of the vessels in both angiograms , ophthalmic X-ray images [Can et al,...]. In the literature, previous techniques can be classified into automatic and semi-automatic techniques. Semi-automatic techniques are commonly used in quantitative coronary analysis (QCA) and require manual user inputs such as initial and end point or direction and width of the vessels to accomplish the segmentation task.
In this present paper, we intend to propose a method for
We focus Since the 2D
Separation of vessel from backgraound can be non-trivial
Similar samples will have small value for the cosine; i.e. the angle between their vectors will be small.
Many studies have been done in clustering are more focused on the distance but not the level or size of the samples. Samples might be of the same group but different level - for example fisherman and fishing company. In human related data sets it could be educational level, level of income, wealth and so on. Fig.1 shows two similar objects based on cosine similarity but they are of different in terms of level/size.
There is a lack of studies in large datasets with high dimensionality. Most studies have been done with high dimensional data are proposed the number of dimensions. However, they suffer from the inaccuracy and unreliability of the result. By working with smaller subspaces, we can use Euclidean distance without worrying about the complexity and system resources. Thus we are able to improve the existing clustering techniques in terms of efficiency and accuracy.
Our proposed method is not to reduce dimension but to select subspaces by clustering and perform clustering based on these subspaces. We expect that it would be able to cluster samples of similar size and level. This method can be more efficient and accurate than a single one pass clustering. Some hypothesis has been considered:
- H1 - Proposed method would be able to group samples of similar size and find similarity among them.
- H2 - Proposed method is faster than single step clustering due to use of divide and conquers technique.
- H3 - Proposed method is more accurate than a single step clustering.
- H4 - Proposed method would allow Euclidean distance to be used in high dimensional data.
In this study we assume that the space is orthogonal and dimensions for all objects are the same and finally we use ordinal data type because of the application. Next section describes related works and in continue framework and methodology will be explained. Finally results of experiments is presented and discussed about it.
II. RELATED WORKS
There is a general categorization for high dimensional data set clustering:
1-Dimension reduction, 2-Parismonious models, 3-Subspace clustering . Feature selection and feature extraction are most popular techniques in dimension reduction. It is clear that in both methods we will have losing information which naturally affects accuracy. Ref.  reviewed the literature on parsimonious models and Gaussian models from the most complex to simplest which yields a method similar to the K-Means approach. When we have low dimensional spaces these methods aren't able to work well. there are two main approaches for subspaces methods first class centers are considered on a same unknown subspace, second each class is located on specific subspace . The idea of subtopics or subgroups is appropriate for document clustering and text mining . Tensor factorization as a powerful technique has been used in . Inconsistency has been shown in those public data sets because of outlier.
Ref.  proposed a framework which integrates subspace selection and clustering. Equivalency between kernel K-Means clustering and iterative subspace selection has been shown. Ref.  proposed a general clustering improver scheme which is included two main steps. In first step it uses intrinsic properties of the data set for dimension reduction after that several iteration of a clustering algorithms are applied, each with different parameter. Base on BIC criterion the best result will be selected. There are some weaknesses for this method e.g. since BIC fits a model to specific data distribution it can not be used to compare models of different data sets. ref.  presented a semi-supervised clustering method base on spherical K-Means via feature projection which is tailored for handling sparse high dimensional data. They first formulated constraint-guided feature projection then applied the constraint spherical K-Means algorithm to cluster data with reduced dimension.
Ref.  proposed two methods of combining objective function clustering and graph theory clustering.
- Method 1: incorporates multiple criteria into an objective function according to their importance, and solves this problem with constrained nonlinear optimization programming.
- Method 2: consists of two sequential procedures:
- A traditional objective function Clustering for generating the initial result
- An auto associative additive system based on graph theory clustering for modifying the initial result.
Ref.  Proposed sequential combination methods for data clustering
- In improving clustering performance they proposed the use of more than one clustering method.
- They investigated the use of sequential combination clustering as opposed to simultaneous combination and found that sequential combination is less complex and there are improvements without the overhead cost of simultaneous clustering.
In clustering points lying in high dimensional spaces, formulating a desirable measure of "similarity" is more problematic. Recent research shows that for high dimensional spaces computing the distance by looking at all the dimensions is often useless, as the farthest neighbor of a point is expected to be almost as close as its nearest neighbor .
To compute clusters in different lower dimensional subspaces, recent work has focused on projective clustering, defined as follows: given a set P of points in Rd and an integer K, partition into subsets that best classify into lower dimensional subspaces according to some objective function. Instead of projecting all the points in the same subspace, this allows each cluster to have a different subspace associated with it . Proposed model will be more effective and achieves significant performance improvement over traditional method for most clustering. It will be able to cluster samples of similar size and level and be more efficient and accurate than a single one pass clustering. there are some delimitations for this method. First space should be orthogonal it means there is no correlation among attributes of an object. Second base on application that is used all attributes in an object have the same kind of data types. Without losing generality it is possible to extend proposed method to other kinds of data types. Samples are considered in a same number of dimensions.
In this paper we present K-means divide and conquer clustering by select subspaces by clustering and perform clustering based on these subspaces.
III. RESEARCH MODEL
In this study, our framework includes five main steps as shown in fig.2. After preprocessing and normalization we apply for each object to calculate the object's length. After calculating the object's length, we will cluster the objects base on their length and in final step we separately cluster each group based on their features.
- The data has to be cleaned before we do the mining. Since the data is obtained from a survey in a Persian website, some of the labels need to be renamed.
- There are also values which are outside the range. The range is supposed to be between 1 to 5.
- There are also missing or unfilled values
- If it is less than 1, set to 1.
- If it is greater than 5, set to 5.
- Ignore the object by removing it.
- Assign 1 to cell.
The data are of ordinal type therefore it has to be normalized by using the following formula.
Where Vnew is the normalized value, Vold is the previous value which we want to be normalized and mf is the maximum value of the considered range.
3) Combinational method for Clustering
In this step is used to calculate the object size then we put the samples with the same size/level into the same cluster.
4) Divide subspace clustering
In this step we divide the data set into a number of subspaces. Fig.2 shows that subspaces will be further clustered based on their features. It should be noted that in this step we separate objects based on their size not their features or attributes.
5) Subspace clustering
Finally, we cluster each subspace that was created by the previous K-means clustering. After the second clustering we should have objects of almost the same size/level and similarity value in the same cluster.
Using the frame work that we mentioned before, we developed an experiment for evaluating our method. We have conducted a comparison of data mining tools to select a suitable platform for conducting our experiments. We decided to use SPSS Clementine against its alternatives such as Oracle Data Mining and WEKA platforms. SPSS Clementine 12.0 is a good support for visual analysis and enabled us to query records that belong to each cluster. In case of our experiment, SPSS Clementine 12.0 supports wider range of algorithms and has better data preparation.
V. EXPERIMENTAL PROCEDURE
In our study a data set from a Persian website (www.peivand.org) has been used. In this web site there are some psychology questions which are answered by users. Base on user's answers a table is extracted that includes different features of user's personality. For each user 80 feature is recorded, each of which is ranked between 1 and 5. Based on this dataset (which is included 750 records) in first clustering step we will find people who are generally in the same level but different personality. In second clustering step people who have similar personality is found.
VI. RESULTS AND DISCUSSION
In this section, we report and discuss the results of our experiments. For evaluating the effect of utilizing our proposed method, we compared the number of iterations required to converge the K-Means algorithm. In the first experiment, the number of clusters is 8 and clustering has been done without any grouping. The error is defined as the summation of the sample distances from the center of each cluster.
In the first step, objects categorized to four groups base on their length and each group clustered base on similarity. By comparing the results of these two experiments, we observe that the number of iterations in the second experiment is reduced. It means that with the same number of clusters in both experiments we have less iterations in proposed method. On the other hand we are able to extract clusters with small number of members whereas they are merged with bigger clusters in regular K-Means clustering. Thus we can conclude that the proposed method is more accurate and efficient than original K-Means clustering method.
VII. CONCLUSIONS AND FUTURE WORKS
Using two steps clustering in high dimensional data sets with considering size of objects helps us to improve accuracy and efficiency of original K-Means clustering. When objects are clustered base on their size, in fact, we are using subspaces for clustering. It causes achieving more accurate and efficient results. For this purpose we should consider orthogonal space which means that there should be no correlation among attributes of objects and size of dimension should be equal in all objects.
For future work we will investigate applying this method in different domain e.g. text mining. Also effects of different parameters in clustering for example K and number of dimensions should be studied.
- C.Bouveyran, S. Girard , C.Schmid : Technical Report 1083M, LMC-IMAGE 2008
- C.Fraley and A. Raftery: model based clustering discriminate analysis and density estimation, journal of American statistics association, 97, 611-631 ,2002
- H. Bock, on the interface between cluster analysis, principal component clustering and multidimensional scaling, 1987
- Jeffery L. Solka: text data mining Theory and methods, vol 2, 94-112, 2008
- H. Huang, C. Ding, D. Luo, T. Li, simultaneous tensor subspace selection and clustering, KDD08, lasvegas USA ,2008.
- J. Ye, Z. Zhao, M. Wu, discriminative K-Means for clustering, proceeding of the annual conference, 2007.
- R. Varshavsky, D. itorn and M. Linial, cluster algorithm optimizer: A framework for large datasets , ISBRA pp 85- 96,springer 2007.
- W. Tang, H. Xiong, S. Zhang, J. Wu, Enhancing semi- supervised clustering, KDD07 california USA , ACM 2007.
- Yuntao Qian and Ching Y. Suen ,Clustering Combination Method, IEEE 2000.
- Yuntao Qian and Ching Y. Suen, Sequential combination methods for data clustering, Journal of computer science,2002.
- A. Raftery and N. Dean, variable selection for model based clustering, journa of American statistical association, 101(473):168-178,2006.
- Pankaj K. Agarwal and Nabil H. Mustafa, K-Means Projective clustering, PODS Paris France, ACM 2004.