Clustering Is Mostly One Of First Steps Computer Science Essay

Published:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

A detailed survey, study and analysis of the algorithms has also been done as part of this research work and is presented here. Clustering algorithms can be classified based on the structure of the clusters generated, whether it yields partial or complete clusters, or whether it is exclusive or overlapping. Though there are different ways in which clustering algorithms can be categorized, we would like to present them categorized as partition based algorithms, Hierarchical based algorithms, density based algorithms, model based methods, pattern based algorithms and so on.

2.2 Partition based Clustering Methods:

The most popular class of clustering algorithms is the combinatorial optimization algorithms also referred to as iterative relocation algorithms. These algorithms minimize a given clustering criterion by iteratively relocating data points between clusters until an optimal partition is attained. Partition-based clustering attempts to directly decompose the data set into a set of disjoint clusters. In a basic iterative algorithm, such as K-means or K-medoids, convergence is local and the globally optimal solution cannot be guaranteed. Partition based algorithm tend to create clusters that are spherical in shape as they are distance based.

The k-means algorithm (JB McQueen, 1967), the most widely known, is a typical partition based algorithm. The k-means clustering algorithm has four steps: initialization, assigning, calculation, iteration. Given a prespecified number K, the algorithm partitions the data set into K disjoint subsets. Initially the algorithm starts with a randomly initialized cluster centers. The objects that are closer to the center are grouped together to form a cluster. With the objects assigned to the clusters, new cluster centers are formed. This procedure iterates until there is no change in the cluster centers. The clusters are formed based on an objective function that minimizes the distance between the object and the cluster center. This algorithm is simple and fast. An Analysis on this algorithm shows that K-means algorithm converges after a few iterations.

Several variations of K-means algorithms have been applied to gene expression data. K means+ method (Heng Huang et al) addresses the number of clusters dependency and degeneracy. K-means+ method can automatically partition genes into a reasonable number of clusters and then the informative genes are selected from clusters. An enhanced K-means clustering algorithm (Jirong Gu ;Jieming Zhou 2009) has proposed an algorithm for refining initial points which is capable of reducing execution time. This algorithm improves solutions for large data by refining the initial conditions. This refinement of initial points in the K-means algorithm greatly improves the efficiency of k-means algorithm.

The global k-means clustering algorithm (Likas et al., 2003) presents a deterministic global optimization method that does not rely on any initial parameter values but uses the k-means algorithm as a local search procedure. Instead of arbitrarily selecting initial values for all cluster centers, this technique optimally adds one new cluster center at each stage in an incremental way. Adil M. Bagirov and Karim Mardaneh developed a new or a modified version of the global k-means algorithm. This algorithm computes clusters incrementally and to compute k-partition of a data set it uses k âˆ’ 1 cluster centers from the previous iteration. An important step in this algorithm is the computation of a starting point for the k-th cluster center. This starting point is computed by minimizing so-called auxiliary cluster function. The proposed algorithm computes as many clusters as a data set contains with respect to a given tolerance.

Lu Y et al, 2004 have proposed a new clustering algorithm, Incremental Genetic K-means Algorithm (IGKA). IGKA is an extension of the clustering algorithm, the Fast Genetic K-means Algorithm. The main idea of Incremental Genetic K-means Algorithm is to calculate the objective value, Total Within-Cluster Variation (TWCV) and to cluster centroids incrementally whenever the mutation probability is small. Incremental Genetic K-means Algorithm inherits the salient feature of Fast Genetic K-means Algorithm of always converging to the global optimum. An Efficient Unified K-Means Clustering Technique (P. Valarmathie et al, 2011) is also an enhancement of k-mean algorithm were the initial number of clusters is determined using Expectation Maximization (EM) algorithm, a model based methodology. This approach decides the number of clusters by minimizing the squared error function and maximizing the correctness ratio value. Pavan et al (2010) have proposed a SPSS (Single Pass Seed Selection) algorithm which is an extension of K-means++ which works well with high dimensional data sets.

Hemalatha and Vivekanandan, 2008 have proposed an enhanced version of k-means clustering algorithm which is claimed to be parallel and distributed. Garg and Jain (2006) have done a comparison on some of the existing variations of k-mean algorithms. They have used the synthetic sets of high dimensional data as benchmark for evaluating the algorithms and have also proposed some criteria for comparison of these clustering algorithms. The Cluster Afï¬nity Search Technique (CAST) is an partitional algorithm proposed by [Ben-Dor et al. 1999] to cluster gene expression data. Abdelghani Bellaachia et al, have developed and enhanced CAST algorithm, called ECAST (Enhanced cluster affinity search technique), that uses a dynamic threshold.

A comparative analysis of k-mean based algorithm namely global k-means, efficient k-means, k-means++ and x-means was done by Parvesh Kumar Siri Krishan Wasan, 2010. The analysis shows that performance of these algorithms can be improved further with the help of fuzzy logic and rough set theory to yield better quality of clusters

2.3 Hierarchy based Clustering Methods:

Unlike partition-based clustering, hierarchical clustering generates a hierarchical series of nested clusters which can be graphically represented by a tree, called dendrogram. The branches of a dendrogram denote the formation of clusters by indicate the similarity between the clusters. The levels in the dendogram also mark the number of clusters obtained. Similar objects are place together by reordering the objects such that the branches of the corresponding dendrogram do not cross. Hierarchical clustering algorithms can be further separated into agglomerative approaches and divisive approaches based on how the hierarchical dendrogram is constructed. Agglomerative algorithms follow a bottom-up approach. Initially each data object is considered as an individual cluster, and at each step, the closest pair of clusters are merged together until all the groups are merged into one cluster. Divisive algorithms follow top-down approach. It starts with a single cluster containing all the data objects and, at each step split, until single clusters of individual objects are formed.

Bernard Chen et al have proposed a novel hybrid approach that combines the merits of hierarchical and k-means clustering. This approach is different from other methods as it initially carries out hierarchical clustering to decide location and number of clusters and then run the K-means clustering as the next step. This approach also provides a mechanism to handle outliers.

Ranjan and Khalil (2007) have worked with the statistical approaches in hierarchical clustering and have also done a comparison on the linkage methods which can assist us in knowing the functionalities of many genes. Vijendra (2011) has presented a detailed review of various subspace and density based clustering algorithms, their efficiencies and inefficiencies on different data sets. Zhou et al (2007) have proposed a Join-Prune algorithm that shows momentous gain in runtime and quality.

Md. Nurul Haque Mollah et al have introduced a robust hierarchical clustering algorithm for gene expression data analysis. This algorithm proves to provide improved performance than the traditional hierarchical algorithm in the presence of outliers. Feng Luo et al have proposed a new hierarchical growing self-organizing tree (HGSOT) algorithm to cluster 112 rat CNS genes and have observed that five clusters similar to Wen et al's original HAC result can be successfully obtained.

A new hierarchical clustering algorithm which reduces susceptibility to noise was proposed by Ziv Bar-Joseph et al. This algorithm allows up to k siblings to be related directly and produces a single optimal order tree as resulting tree. a k-ary tree is efficiently constructed, where each node can have up to k children, An optimal ordering of the leaves is done and k clusters are combined at each step. The algorithm proves to be more robust against noise and missing values.

An enhanced hierarchical clustering algorithm designed by Geetha T et al 2010 reduces the time taken to analyze large datasets. The method scans the dataset and calculates distance matrix only once and the result of hierarchical clustering is represented as a binary tree. The algorithm finds the number of clusters with the help of cut distance and measures the quality with validation index in order to obtain high quality clusters.

2.4 Density based clustering methods:

Density based clustering algorithms are appropriate when the clusters have irregular shapes. The method works by putting points together in high density areas as members of the same cluster and treating low density areas as boundaries separating two clusters. There are two approaches commonly used to identify high and low density areas. One approach is to define a neighbourhood with a small radius around each data point and the minimum number of objects to be placed in that area for being considered as high density area. The points at the edges of the cluster are put to a density test and if they are found to be in the high density area, the cluster grows. The second approach is to define an influence function to each point. The total influence at any point is the sum of the influences from all the points. The influence will be high from nearby points and low from far away points. A threshold value should be defined to separate high influence areas from low influence areas.

Daxin Jiang et al 2003 have dealt with the problem of effectively clustering time series gene expression data by proposing an algorithm DHC, a density-based, hierarchical clustering method. The density-based approach has been used to produce clusters of high quality and robustness. The algorithm adopts the structure of hierarchical clustering and mining result is in the form of a two trees namely density tree and attraction tree .The density tree is the final result and it uncovers the embedded clusters in a data set. The attraction tree is an intermediate result that helps in the further investigation of the inner-structures, the borders and the outliers of the clusters.

Rosy Das et al presents an incremental clustering algorithm based on a density-based algorithm. The method was experimented with real-life datasets and compared with some well-known clustering algorithms in terms of z-score cluster validity measure. Seokkyung Chung et al have presented a novel density clustering algorithm, which utilizes a neighborhood defined by k-nearest neighbors. The clustering algorithm was developed based on KNN density estimation. For an efficient k-nearest neighbor search, different dimensionality reduction methods that are relevant for gene expression data were explored.

Sauravjyoti Sarmah and Dhruba K. Bhattacharyya have suggested a clustering technique (GenClus) for gene expression data which is capable of handling incremental data. It is designed based on density based approach and uses no proximity and therefore avoids the restrictions placed by them. The main advantage is that it retains the regulation information and is able to handle datasets updated incrementally.

2.5 Model based clustering approach:

The model based methods hypothesize a model for each of the clusters and find the best fit of the data to that model. Typical model based methods involve statistical approaches, probability models or neural network approaches.

Kohonen neural network, also called as self organizing maps (SOMs) is a two layer architecture for unsupervised clustering that was proposed by kohonen(1982). Amel Ghouila et al have proposed a multi SOM clustering technique that overcomes the problem of the estimation of cluster numbers. The difficulty to find clear boundaries by SOM is overcome by combining SOM with k-means. Dali Wang et al 2002 applied a novel model of SOM, called double self-organizing map (DSOM) to cluster gene expression data. DSOM finds the appropriate number of clusters clearly and visually depicts the appropriate number of clusters. A novel validation technique, known as figure of merit (FOM) has been employed to validate the clusters. Xiang Xiao et al have proposed a hybrid clustering approach based on Self- Organizing Maps and Particle Swarm Optimization. The algorithm improves the rate of convergence by adding a conscience factor to the Self-Organizing Maps algorithm and attempts to generate a more compact clustering result than SOM.

A model-based clustering method for time-course gene expression data was presented by Fang-Xiang Wu. The presented method uses Markov chain models (MCMs) to account for the inherent dynamics of time-course gene expression patterns. An assumption that expression patterns in the same cluster were generated by the same MCM is made by the algorithm. For the given number of clusters, the presented method computes cluster models using an EM algorithm and an assignment of genes to these models is done by maximizing their posterior probabilities.The quality of the clusters are evaluated by using the average adjusted Rand index (AARI).

2.6 Pattern based clustering techniques:

It is a general characteristic of pattern-based clustering algorithms to treat attributes and objects interchangeably (HANS-PETER KRIEGEL et al 2009). Thus they are also called biclustering, coclustering, two-mode clustering, or two-way clustering algorithms. In the technique suggested by Larry T. H. Yu et al the emerging patterns (EPs) and projected clustering techniques are integrated for effective clustering of gene expression data. The key concept of the resulted EP-based projected clustering (EPPC) algorithm is to introduce the readability and strong discriminatory power of EPs (introduced by Dong and Li) in the dimension projection process of the projected clustering so that the predictability of the projected clusters can be improved.

Sabita Barik et al 2010 have proposed an efficient frequent pattern based clustering to find the gene which forms frequent patterns showing similar phenotypes leading to specific symptoms for specific disease. This hybridized Fuzzy FP-growth approach not only outperforms the Apriori with respect to computational costs, but also builds a tight tree structure to keep the membership values of fuzzy region to overcome the sharp boundary problem and it also takes care of scalability issues as the number of genes and condition increases.

Daxin Jiang et al 2006 have addressed the two important issues of pattern-based clustering. They are large number of highly overlapping clusters which makes it tough to identify interesting patterns and lack of a general model for pattern-based clustering. A general quality-driven approach to mining top-k quality pattern-based clusters has been proposed and experimented with real world microarray datasets. Yinghui Yang et al investigate using a pattern-based clustering approach to group customer web transactions. An objective function has been defined that maximizes in order to achieve a good clustering of customer transactions along with a new algorithm, Greedy Hierarchical Itemset-based Clustering (GHIC), groups customer transactions such that item sets generated from different cluster show completely different patterns.

2.7 Biclustering

Biclustering,Â co-clustering, orÂ two-modeÂ clusteringÂ is aÂ data miningÂ technique which allows concurrentÂ clusteringÂ of the rows and columns of aÂ matrix. The term was first introduced by MirkinÂ and recently by Cheng and ChurchÂ inÂ gene expressionÂ analysis. Different genes have different expression levels according to their specific function at each condition.Â BiclusteringÂ identifies groups ofÂ genes with similar expression patterns under a specific subset of conditions. These conditions may correspond to different time-points, for example in times series expression data. A good number of biclustering algorithms have been proposed for grouping gene expression data.

Sara C. Madeira and Arlindo L. Oliveira have analyzed a large number of existing biclustering techniques and classified them according to the type of biclusters, the patterns of biclusters discovered, the methods used to do the search, the approaches used to assess the solution, and the target applications. K-biclusters clustering (KBC Algorithm), proposed by Tsai and Chiu (2010), minimizes the dissimilarities between genes and bicluster centers. Additionally it tries to minimize the residue within the clusters and to involve as many conditions as possible. An iterative co-clustering algorithm that mainly concentrates on user defined constraints and minimizes the sum squared residue was addressed by Pensa and Boulicaut, (2008). This algorithm does not identify the overlapping biclusters.

Yang et al have proposed a novel bi-clustering algorithm for generating non-overlapping clusters of genes and conditions and this information is used to construct transcription factor interaction networks. Chun Tang et al has presented a new framework for unsupervised analysis of gene expression data which applies an interrelated two-way clustering approach on the gene expression matrices. This algorithm is able to find important gene patterns and to perform class discovery on samples simultaneously.

The concept of biclustering on gene expression data was introduced by Cheng and Church, 2000. Cheng and Church proposed several greedy row/column removal/addition algorithms that are then combined in an overall approach that makes it possible to find a given number of K biclusters. The FLOC (FLexible Overlapped biClustering) algorithm suggested by Jiong Yang et al, 2003 is based on the bicluster definition used by Cheng and Church but performs simultaneous bicluster identification. It is also robust againts missing values, which are handled by taking into account the bicluster volume (number of non-missing elements). The SAMBA algorithm (Statistical-Algorithmic Method for Bicluster Analysis) proposed by Tanay et al 2002 uses probabilistic modeling of the data and graph theoretic techniques to identify subsets of genes that jointly respond across a subset of conditions

Spectral biclustering, is based on the observation that checkerboard structures in matrices of expression data can be found in eigenvectors corresponding to characteristic expression patterns across genes or conditions. In addition, these eigenvectors can be readily identified by commonly used linear algebra approaches, in particular the singular value decomposition (SVD), coupled with closely integrated normalization steps. Kluger et al present a number of variants of the approach, depending on whether the normalization over genes and conditions is done independently or in a coupled fashion and then apply spectral biclustering to a selection of publicly available cancer expression data sets, and examine the degree to which the approach is able to identify checkerboard structures. Kung et al have introduced a new biclustering method based on a modified variant of the Non-negative Matrix Factorization (NMF) algorithm that produces a sparse representation of the gene expression data matrix, making possible in this way, its use as a biclustering algorithm.

S.Y. Kung et al have worked on a multi-metric and multi-substructure biclustering analysis for gene expression data. Multivariate and multi-subscluster analysis is very helpful in identifying and classifying biologically related groups in genes and conditions. The algorithm successfully yields highly discriminant and precise classification based on known ribosomal gene groups.

Akdes Serin and Martin Vingron 2011 present a fast biclustering algorithm called DeBi (Differentially Expressed BIclusters). The algorithm is based on frequent itemset mining, a well known data mining approach. It discovers maximum size homogeneous biclusters in which each gene is strongly associated with a subset of samples. The performance of DeBi is evaluated on a yeast dataset, on synthetic datasets and on human datasets. E. Yang et al have proposed a novel non-overlapping bi-clustering algorithm and show how this information can be interpreted to support in the construction of transcription factor interaction networks.

Anindya Bhattacharya and Rajat K. De, 2011 have come out with a new correlation-based biclustering algorithm called bi-correlation clustering algorithm (BCCA). BCCA produces a diverse set of biclusters of co-regulated genes over a subset of samples where all the genes in a bicluster have a similar change of expression pattern over the subset of samples. The existence of common transcription factors binding sites for all the genes in a bicluster serves as a proof that the group of genes in a bicluster are co-regulated. Biclusters determined by BCCA also express highly enriched functional categories.

Alain B. Tchagang, and Ahmed H. Tewfik developed a Robust Biclustering Algorithm that uses basic linear algebra and arithmetic tools. This algorithm is simple as there is no need to solve any optimization problem. . Noureen et al. (2009) have proposed a simple and efficient biclustering algorithm (BiSim) which proves to be very simple when compared the Bimax algorithm. It reduces the complexity and extra computation when compared to Bimax. Bimax proposed by Prelic et al 2006 uses a simple data model reflecting the fundamental idea of biclustering. This method has the benefit of providing a basis to investigate the usefulness of the biclustering concept, independently of interfering effects caused by approximate algorithms, and the effectiveness of more complex scoring schemes and biclustering methods in comparison to a plain approach.

2.8 Fuzzy clustering methods:

In hard clustering (hierarchical, k-means etc), microarray data is divided into distinct clusters, where each data element belongs to exactly one cluster. InÂ fuzzy clustering (also referred to asÂ soft clustering), microarray data elements can belong to more than one cluster, and associated with each element is a set of membership levels. These indicate the strength of the association between that data element and a particular cluster. Fuzzy clustering is a process of assigning these membership levels, and then using them to assign microarray data elements to one or more clusters.

Seo Young Kim et al 2006 examined clustering methods based on fuzzytype, and compared the performance of fuzzy-possibilistic c-means clustering using DNA microarray data. The examination has shown that fuzzy-possibility c-means clustering substantially improves the findings obtained by others. FPCM clustering proved to be more accurate and consistent than hierarchical clustering or the K-means method.

Fuzzy modifications of K-means include Fuzzy C-Means (FCM)( Dembele Kastner 2003) and Fuzzy clustering by Local Approximations of MEmberships (FLAME) (Fu and Medico 2007). In both, genes are assigned a cluster membership degree indicating percentage association with that cluster, but the two algorithms differ in the weighting scheme used to determine gene contribution to the mean. For a given gene, FCM membership value of a set of clusters is proportional to its similarity to cluster mean. The contribution of each gene to the mean of a cluster is weighted, based on its membership grade. Membership values are adjusted iteratively until the variance of the system falls below a threshold. These calculations require the specification of a degree of fuzziness parameter which is problem specific.

Hybrid fuzzy c-means clustering technique proposed by Valarmathie et al (2009), combines Fuzzy C-Means with Expectation Maximization algorithm to determine the precise number of clusters and to interpret them efficiently

2.9 Rough set based clustering methods:

Of late the concept of Rough sets has also been introduced into clustering and a few clustering algorithms have been developed based on rough set theory.

Pradipta Maji and Sushmita Paul 2010 applied rough-fuzzy c-means (RFCM) algorithm to discover co-expressed gene clusters. The pearson correlation based initialization method is used to select initial prototypes. The effectiveness of the RFCM algorithm and the initialization method, along with a comparison with other related methods has been demonstrated on five yeast gene expression data sets using standard cluster validity indices and gene ontology based analysis.

JUN-HAO ZHANG et al 2010 implemented rough fuzzy k-means clustering algorithm in matlab. With the assistance of the lower and upper approximation of rough sets, the rough fuzzy k-means clustering algorithm improves the objective function and further the distribution of membership function for the traditional fuzzy k-means clustering.

Ruizhi Wang et al 2007 have presented a novel approach (ROB) to find potentially overlapping biclusters in the framework of generalized rough sets. This method mainly consists of two phases. First, it generates a set of highly coherent seeds (original biclusters) based on two-way rough k-means clustering. The membership of data object is the ratio as shown below:

(1)

where d(v,mj) is the distance between itself and the centroid of cluster mj. And then, the seeds are iteratively adjusted (enlarged or degenerated) by adding or removing genes and conditions based on a proposed criterion. The method is illustrated on yeast gene expression data. The result is a set of biclusters of maximum size, with stronger coherence, and particularly with a reasonable degree of overlapping simultaneously. By associating each bicluster with a lower and an upper approximation, the approach dynamically adjusts the memberships of genes and conditions. This approach proves to work better than Cheng & Church biclustering algorithm and FLOC(FLexible Overlapped Biclusters).

Lijun et al used a new method combining correlation based clustering and rough sets attribute reduction together for gene selection from gene expression data is proposed. Correlation based clustering is used as a filter to eliminate the redundant attributes, and then the minimal reduct of the filtered attribute set is reduced by rough sets. The correlation coefficient between two genes is given as

(2)

where var() represent the standard deviation and cov() is covariance. A successful gene selection method based on rough sets theory is presented. The experimental results indicate that rough sets based method has the potential to become a useful tool in bioinformatics[8].

Jung-Hsien et al presents a novel rough-based feature selection method for gene expression data analysis. The method (RBFNN) finds the relevant features without requiring the number of clusters to be known a priori and identify the centers that approximate to the correct ones. The average distances between two seed points is calculated using the following formula,

(3)

For each cluster, the algorithm finds the number of data points in the upper bound and the lower bound. The method introduces a scheme that combines the rough-based feature selection method with radial basis function neural network.

2.10 Validation Techniques

In the previous sections, a review of variety of clustering algorithms was presented. Clustering of gene expression data results in groups of co-expressed genes, groups of samples with common characteristics, or "blocks" of genes and samples involved in specific biological processes. Though Cluster analysis acts as a tool to speed up and automate data processing, most of the cluster analysis carried out on genomic data is quite far from this end. This is mainly due to the properties like containing many more variables than samples, has high levels of noise and may have multiple missing values. These properties cause troubles to many traditional clustering methods and make cluster validation very essential. Another interesting part of this is that different clustering algorithms, or even the same clustering algorithm when using different parameters, generally result in completely different sets of clusters. Therefore, it is very important to compare various clustering results and select the one that best fits the "true" data distribution. Cluster validation is the process of assessing the quality and reliability of the cluster sets derived from various clustering techniques. The quality of a cluster as specified by Daxin et al 2004 is mainly defined based on the similarity of the objects within a cluster (homogeneity) and the dissimilarity between two different clusters(separation).

Maria Halkidi et al 2001 have addressed an important issue of clustering process regarding the quality assessment of the clustering results which is also related to the inherent features of the data set under concern. A review of clustering validity measures and approaches available in the literature has been presented.

Many different indices of cluster validity have been proposed, such as the Bezdek's partition coefficient, the Dunn's separation index, the Xie-Beni's separation index, Silhouette index, Davies-Bouldin's index, and the Gath-Geva's index, etc. A detailed analysis of the indexes is not within the scope of this research work. A few of them are listed here.

The Silhouette index Sj, which characterizes the heterogeneity and isolation properties of a given cluster, Xj (j = 1,â€¦, c), is given as

where m is number of samples in Sj. The Silhouette width s(i) for the ith sample in cluster Xj is defined as

where a(i) is the average distance between the ith sample and all of the samples included in Xj ; 'max' is the maximum operator, and b(i) is the minimum average distance between the ith sample and all of the samples clustered in Xk (k = 1,â€¦, c; kâ‰  j).

The Dunn index identifies sets of clusters that are compact and well separated. For any partition where Xi represents the ith cluster of such partition, the Dunn's validation index, D, is defined as:

where (Xi, Xj) defines the distance between clusters Xi and Xj; (Xk) represents the intracluster distance of cluster Xk, and b is the number of clusters of partition U.

The Davies-Bouldin index aims at identifying sets of clusters that are compact and well separated. The Davies-Bouldin validation index, DB, is defined as:

}

where U, , , and b are defined as in the previous equation. Small values of DB correspond to clusters that are compact, and whose centres are far away from each other. Therefore, the cluster configuration that minimizes DB is taken as the optimal number of clusters, b.

2.11 Summary

A detailed study on the different categories of clustering algorithms and the methods proposed under each and every category has been done and presented here. A vast study was done to figure out how clustering has grown in all dimensions. Clustering algorithms have been evolving throughout the years. New methodologies have been proposed and experimented. The important fact that has been noted is that a clustering algorithm that produces promising results in a particular dataset proves not to be that efficient when applied on a different experimental dataset. In some cases, different algorithm produces entirely different results when applied on the same dataset. The main difficulty faced by bioinformaticians is in selecting an appropriate algorithm that would better suit the dataset. No single clustering algorithm can be chosen as the best algorithm. Researchers usually choose a well known clustering method that is readily available and easy to use.