This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.
Gene Clustering is one among the most popular issues involve in the field of Bio informatics and is defined as the process of grouping related genes in the same cluster. Among the various algorithm proposed for clustering, most of the researchers used the fuzzy c-means and there hybridization with some other methods to deal with the problem of premature convergence in fuzzy c-means clustering algorithm, but the results obtained were not satisfactory because the gene expression has huge amounts of ambiguous and uncertain biological data which requires advanced computing tools for processing such data. Particle Swarm Optimization (PSO) one of the variant of Swarm Intelligence (SI) has recently emerged as a nature inspired algorithms, especially known for their ability to produce low cost, fast and reasonably accurate solutions to complex search problems. PSO based Fuzzy C- Means algorithm were proposed but they all uses the traditional PSO algorithm. In traditional PSO algorithm each particle is attracted toward the best ever position discovered by any particle in the swarm, that limits the exploration capability. Instead if particle learn from the experience of the neighbouring that has better fitness than itself, the swarm can be more effectively and efficiently explored. So a method based on hybridization of fuzzy c-means and Fitness Distance Ratio based PSO is proposed. Initially this approach distributes the membership on the basis of the distance between sample and cluster centre making membership meet the constraints of FCM then the ratio of relative fitness and the distance of other particle is used to determine the direction in which each component of the particle position needs to be changed. The experiments were conducted on four real data sets and results shows that F-FDRPSO performs significantly better than FPSO and FCM algorithm.
Keywords-Fuzzy c-means clustering, Particle Swarm Optimization, Fitness Distance Ratio based Particle Swarm Optimization.
Bioinformatics field is an application of computational methods which makes biological discoveries. During the past few years a tremendous growth have been made for gathering biological information in the form of genomes, protein sequences, gene expression data. These information were analysed, interpreted and apply on the various Bioinformatics problems such as Gene Clustering Problem, Molecular Docking Problem, Multiple Sequence Alignment Problems, Phylogenetic tree construction, RNA Secondary Structure Prediction, Protein Secondary Structure Prediction, and Fragment Assembly Problem. Among the various problems, Gene Clustering is one among the most popular issues involve in the field of Bio informatics and is defined as the process of grouping related genes in the same cluster. A Gene is defined as a sequence of nucleotide bases, which carry the required information for the purpose of protein formation, that helps in the construction of the structural components and Clustering is a process of combining input patterns into a set of disjoint groups based on the similar functionality such that the patters in each cluster are more similar to each other than patterns from different clusters. So, a Gene clustering can be defined as the process of grouping Gene into cluster based on similar Gene expression level. Gene expression refers to a process through which the coded information of a gene is converted into structures operating in the cell. It provides the physical evidence that a gene has been "turned on" or activated. Gene Clustering is becoming popular nowadays because of increasing power of computing and matured microarray technology. Microarray technology allows monitoring huge amount of gene expression level simultaneously for whole genome through a single chip only.
Number of researchers have put their efforts for clustering the Gene Expression Data in which firstly Hartigan.et.al  apply the concept of clustering on Gene expression data using classic K-Means algorithm. But it has the drawback of easily getting trapped in local optima. To overcome this problem Zhihua. et. al  proposed a modified form of K-Means algorithm known as PK-Means in which Particle Pair Optimizer is combined with K-Means clustering algorithm. The PK-Means algorithms outperform K-Means due to a factor of exploration added into it. A modified version of PK means is introduced by Yau-King Lam et al. , in which a concept of clustering matching is introduced. Each cluster centroid in a particle is matched with its corresponding centroid in the best particle based on the closest distance. Yau King Lam et al.  proposes a novel algorithm XK-means which hybridize PSO with K-means so as to integrate the concept of exploration with exploitation. This is done by using K-means and an exploratory vector. The exploratory vector is added to each centroid before K-Means iteration. The algorithm includes two parameters namely ai and bi respectively. These parameters are problem dependent. The value of parameter bi is set in such a way that it decrease monotonically. As a result the search space gets explored and the exploitation level gets increased. The complexity of the proposed algorithm is low in comparison to K-means and PK-means algorithm.
A neural network based Gene clustering approach was introduced by Xiao. et al.  which hybridizes Self Organizing Maps (SOM) and PSO for clustering genetic data. The proposed SOM/PSO algorithm works in two stages. In the first stage of the proposed algorithm, SOM is used to cluster the dataset. In this stage either regular SOM or SOM with a conscience can be used. The SOM normally runs for 100 iterations and generates a group of weights. In the second stage, PSO is initialized with the weights produced by SOM in the first stage. Then PSO is used to refine the clustering process. The experimental result shows that the proposed algorithm outperforms the average quantization error and the typographic error as compared to SOM.
Zhen. et. al.  proposes a Particle Swarm Optimizer (PSO) based memetic algorithm (MA)  and named it as memetic K-means algorithm (MKMA). The aim of proposed algorithm is to minimize the sum of squared distance. MA combines both the global and local search. The global search in MKMA is done with the help of one of the variants of PSO i.e. Comprehensive learning particle swarm optimizer (CLPSO) . CLPSO maintains the diversity of the swarm which help in improving the exploration efficiency of the global search. They conduct experiments on two gene expression datasets and reveal that MKMA has consistently attained better performance in comparison with than of K-means, Fuzzy K-means and PK-means. Although the PSO algorithm converges fast, but lacks in global convergence and may get trapped in local optima. So Sun et.al  proposed the Quantum-behaved Particle Swarm Optimization (QPSO) algorithm. Although QPSO possesses better global convergence behavior than PSO, but it also suffers from premature convergence. To overcome the problem of QPSO Jun Sun et.al  introduced the revised version of QPSO by adding a Multi-Elistic strategy for Quantum-behaved Particle Swarm Optimization (QPSO) known as Multi-Elitist QPSO model, which is drawn from Swagatam et.al , to update the gbest position of the QPSO algorithm. In the MEQPSO algorithm, the particle's search is not based on the global position i.e. gbest but on the bases of the promising search region so that the particle has more chance to search in this region and obtain the global optimal solution. As a result, the MEQPSO have a stronger global search ability and better overall performance that the original QPSO.
This paper presents a novel hybrid Fitness Distance Ratio based Particle Swarm Optimization (FDRPSO)  based fuzzy c-means algorithm for Gene Clustering named as F-FDRPSO. Particle Swarm Optimization is a newly emerged evolutionary technique based on swarm intelligence theory and FDR-PSO is one of the variant of PSO. In FDR-PSO each particle not only learns from itself and group best particle but also learns from the experience of the neighbouring particle that has better fitness than itself. So the search space can be more effectively and efficiently explored.
The rest of the paper is organized as follows. In section 2, the detail of Fuzzy c-means and FDR-PSO is presented and a novel approach F-FDRPSO is proposed. In section 3, we evaluated the proposed algorithm with four test datasets and compare it with FPSO and Fuzzy c-means. In the last section, we summarized the experiment result and reach the conclusions.
2. F-FDRPSO : A HYBRID FDRPSO Based FUZZY C-MEANS Algorithm
Although Fuzzy C-Means is one of the most popular methods for gene clustering due to its simplicity but facing with the problem of premature convergence. To deal with this problem PSO based Fuzzy C- Means algorithm is proposed but it uses the traditional PSO algorithm, where each particle is attracted toward the best ever position discovered by any particle in the swarm. This limits the exploration. So to enhance the performance, a novel method based on hybridization of FCM with FDRPSO is proposed called F-FDRPSO which maintains the merits of both FCM and PSO algorithms.
2.1 FUZZY C-MEANS ALGORITHM
Fuzzy C-Mean (FCM) algorithm  was developed by Dunn in 1973 and improved by Bezdek in 1981. It is basically an unsupervised clustering algorithm which has been applied to wide range of problems involving clustering, classifier design and feature analysis. FCM is a method of clustering which allows one piece of data to belong to two or more clusters. The clusters are formed using distance between data points and cluster centers.
Fuzzy c-means partition n data points d=(p1,p2,p3,..........pn) into c(1<c<n) clusters with Z=(z1,z2,z3.......zn) centroids. The classification result is expressed in terms of matrix U= [Âµik]cÃ-n, where Âµik is the membership degree of data point k with respect to cluster i and satisfies the following conditions:
0 Âµik1 i=1,2,......c, k=1,2,.......n (1)
= 1 k=1,2 ,......n (2)
0 n i=1,2 ,....c (3)
The objective function of FCM algorithm is to minimize the Eq (4):
J(p)=m || xk - Vi||2 (4)
where m>1 is a real number that governs the influence of membership grades, Vi is the cluster center of cluster i and xk is the vector of data point and ||xk - Vi||2 represents the Euclidean distance between xk and Vi.
Vi = (5)
= [ ] -1 (6)
Where is the membership degree of data point k in cluster i.
The Fuzzy c means clustering is defined as follows:
Step 1: Let t = 0. Select an initial fuzzy pseudo-partition.
Step 2: Calculate the c cluster centers by Equation. (5) for and the chosen value of m.
Step 3: Compute by Equation. (6) and update.
Step 4: Compare and If | âˆ’ â‰¤ Îµ, then stop; otherwise, increase t by one and return to Step 2.
In the above algorithm, the parameter m>1 is selected to suit the problem under consideration. The partition becomes fuzzier with increasing m and there is currently no theoretical basis for an optimal choice for its value .
2.2 PARTICLE SWARM OPTIMIZATION
The Particle Swarm Optimization, proposed by James Kennedy and Russel Eberhardt in 1995  is an algorithm for creating swarms of particle which moves in the search space i.e. Problem space searching for their goal (based on problem definition). The concept is stated as, each particle updates its position and velocity based on its own best value and best value in the whole swarm. For a swarm of n particles the ith particle is represented by a position denoted as xi = (xi1, xi2, . . . , xid) where n is the number of particles. Except the position, each particle of a swarm is represented in D dimensional space with a velocity vi = (vi1, vi2, . . . , vid ). The particles explore in the search space with a velocity that is dynamically adjusted according to its own and neighbor's performances. The standard PSO method updates the velocity and position of each particle according to the equations given below.
where c1 and c2 are two positive acceleration constants, Pbest and Gbest are the best positions found so far by the ith particle and all the particles respectively, t is the iteration count and Ï‰ is an inertia weight which is usually, linearly decreasing during the iterations. In Eqation. (7), there are three parts: the first one show the current speed of particle i.e. the present state, the second term is known as the cognition term which shows the thought of the particle itself and the last and third term is a social term that shows the ability of information sharing among the swarms.
2.3 Fitness Distance Ratio based Particle Swarm Optimization
Thanmaya.et.al  proposed a modified particle swarm optimization called Fitness Distance Ratio based Particle Swarm Optimization (FDRPSO) in which each particle is moved towards other nearby particles that have better fitness, instead of just the best position discovered so far.
The learning process defined in the standard PSO is based on a particle's own experience and the best particle in the swarm. The FDRPSO algorithm adds a new dimension to this approach: each particle also learns from the experience of the neighbouring particles that have a better fitness than itself. This approach results in changes in the velocity update equations, although the position update equations remain unchanged. The FDR-PSO algorithm selects only one other particle at a time when updating each velocity dimension. This particle is chosen to satisfy two criteria:
1. It must be near the particle being updated.
2. It should have visited a position of higher fitness.
The simplest and most robust variation was to update each velocity component by selecting a particle that maximizes the ratio i.e. the dth dimension of the ith particle's velocity is updated using a particle called the Nbest, with prior best position Pj, chosen to maximize
Fitness Distance Ratio= (9)
where |...| denotes the absolute value. The above expression is called the Fitness-Distance-Ratio.
The proposed F-FDRPSO algorithm for gene clustering problem can be stated as-
Input: dataset, number of cluster.
Output: Minimized value of Objective Function.
1: Initialize the position and velocity of each particle including Gbest, Pbest, Ï‰, c1, c2, c3 along with the fuzzy parameters as fuzzy index, number of clusters, where c1, c2, c3, Ï‰ are the constants, Pbest is the best position of the particle discovered so far, Gbest is the group best position in the swarm. Nbest is the position of particle calculated by the ratio of the fitness distance.
2: Find cluster centre for each particle using Equation. 5.
3: Using the value of cluster centre compute the value of objective function according to Equation. 4.
4: Store each particle position as best position Pbest and each particle Fitness, then choose the particle that has best fitness as Gbest.
5: Calculate the Fitness Distance Ratio using Equation. 9 and compute the Nbest value accordingly.
6: Update the position and velocity of each particle according to Equation. 10 and 11.
7: Based on the updated value, recomputed the fitness value of each particle compare the fitness of each particle with its previous best fitness Pbest, if better than it, and then set the current position as Pbest.
8: Compare the fitness of each particle with the group best previous fitness, if better than it, then set the current position as; Gbest.
9: The clustering process stops when the maximum number of iterations is reached, or when the objective function improvement between two consecutive iterations is less than the minimum amount of improvement specified, if preconditions not met, then return to Step 2; preconditions are met, then stop iteration, output the solution.
3. EXPERIMENTAL RESULTS
A. PARAMETER SETTINGS
This paper presented a novel F-FDRPSO for clustering Gene Expression Data. The main objective of this study is to access the relative performance of the proposed F-FDRPSO with respect to FPSO and FCM. The performance is measured with respect to the minimization of objective function value as in Equation. (4). For the comparative study, we coded these methods by using the fuzzy tools available in MATLAB. For our experimental tests, we used a Dual Core processor with (CPU 2.60 GHz 4GB RAM) and the values of different parameters used are as follows: w=0.72, c1 = c2 = c3=1.49, m=2, Îµ = 0.00001 and number of iterations = 100.
B. PERFORMANCE METRIX
Mean Squared Error
The Mean Squared Error (MSE) is used to measure the compactness inside the clusters, and is defined as:
where is a data point i , is its cluster's centroid and Cj is its cluster.
C. TEST DATASET
For evaluating the FCM, FPSO, F-FDRPSO methods, four real world dataset were considered:
1. The Iris plants data set is from the UCI Machine Learning Repository. This is perhaps the best known database to be found in pattern recognition literature. The data set has 150 points containing 50 instances of each of three types of Iris plants.
2. The Wine data set is from the UCI Machine Learning Repository. This data set has 178 data points and 13 attributes. Each data point is classified to three classes.
3. The training Image Segmentation data set is from the UCI Machine Learning Repository. This data set has 2310 data points, which contain information of an image in terms of 19 categorical attributes. Each data point is classified to seven classes (i.e., brick face, sky, foliage, cement, window, path, grass).
4. The Reduced Yeast Cell Cycle (RYCC) data set used in this research is from . It is a data matrix with 384 rows and 17 columns. Each row represents a gene with 17 dimensions where each dimension corresponds to a point in the time series. It contains 384 genes that are grouped based on the five phases of the cell cycle: G1/M, G1, S, G2, and M. The microarray samples, collected at 17 time points taken in ten-minute intervals, cover nearly two full cell cycles (170 min).
Although the main objective of this research is to cluster the Genes of Yeast cell cycle dataset but we have also tested the method on three standard dataset. As shown in Table 1, the comparison on three standard dataset is given and In Table 2 the Yeast cell cycle dataset is tested with varying the fuzzyness parameter value as m=2 and m=1.25 respectively. In both cases the result obtained by proposed method outperforms the FCM and FPSO methods.
The average variation of the MSE performance over the iteration of the schemes with the datasets, obtained in a 10 typical runs and with m=2 and m=1.25, is shown in fig.1and fig.2 respectively. From the result depicted in Fig.1, it can be seen that the MSE of the FPSO in compare to F-FDRPSO, converges faster. Also with m=1.25 the graph of F-FDRPSO show that the value of MSE decreasing initially but with the increase in the number of iteration the F-FDRPSO converges to lower value of MSE than FPSO.
Gene cluster analysis is the first step in discovering the function of gene in bioinformatics. Although K-means clustering is one of the most popular clustering algorithm in which the data objects are partitioned into k clusters(where the number of clusters is known in advance), such that each data object is belongs to exactly one cluster still it is not applicable for real data sets having no definite boundaries between the clusters. In such condition Fuzzy C-Means, due to its fuzziness feature has become the most well-known and powerful method in cluster analysis, but the random selection in centre points makes iterative process falling into the local optimal solution easily. To overcome this problem, recently many evolutionary algorithms such as genetic algorithm (GA), simulated annealing (SA), ant colony optimization (ACO), and particle swarm optimization (PSO) have been applied. Evolutionary algorithm, such as Particle Swarm Optimization (PSO) is incorporated with the fuzzy c-means clustering to achieve both exploration and exploitation. This kind of integrated algorithm can give better clustering results in term of compactness within the clusters, but the problem with the PSO based Fuzzy C-Means is that each particle is attracted toward the best ever position discovered by any particle in the swarm which limit the exploration. Instead if particle learn from the experience of the neighbouring particle that has better fitness than itself then the swarm can be more effectively and efficiently explored. This concept is used in the proposed FDRPSO based Fuzzy c-means (F-FDRPSO) method. The experimental results over four well known data set namely iris, wine, image and Yeast cell cycle data set shows that the proposed method is efficient and produce much better results in comparison to PSO based fuzzy c-means and Fuzzy c -means.