# Class Separability Weighted Random Subspace Method Biology Essay

Published:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

This paper proposes one of the class separability criteria, weighting integrated random subspace method to achieve an increased performance and avoid the drawback of RSM by modifying model combination phase. The RSM has an advantage of decreasing error rate and being noise insensitive due to ensemble construction of base-learners with randomly selected feature subsets. However, this randomness can also cause ensemble construction with classifiers trained by low class separability criteria, which damages the final ensemble decision and accuracy.

The proposed J3 weighting integrated RSM is studied to eliminate the drawback of the standard RSM. For this reason, the randomly selected subsets are quantified computing J3 criteria to determine voting weights in the model combination phase giving lower voting right to the classifier trained by poor subsets. Based on this approach, the J3 integration to the standard RSM is investigated, and two models are suggested namely, J3 weighted RSM and optimized J3 weighted RSM. In J3 weighted RSM, computed J3 is directly multiplied by the base-learner's class assignment posterior, and final decision is made based on weighted averaging rule. The second proposed model optimized J3 weighted RSM uses pattern search to find optimal min-max normalization range of J3 criteria before multiplying by each posterior. Thus, the effect of the proposed J3 integrated models and the range of J3 criteria on 10-fold cross validation error rate at various subset dimensionalities are investigated using three data sets from the UCI Machine Learning Repository. The results show that the both model has the advantage of lower error rate at lower subset dimensionality when compared to standard RSM.

Keywords: Random subspace method; Class separability measure; J3 criteria; Pattern search; Weighted averaging.

## 1. Introduction

An ensemble of classifiers is formed of a combination of various classifiers to perform a classification task jointly. The main objective of ensemble construction is to improve the predictive performance of a single learner based classification task, which makes it popular research in the field of Machine Learning (Garcia-Pedrajas and Ortiz-Boyer, 2008; Diaz and Rao, 2007; Okun and Priisalu, 2009; Shang et al., 2011; Ozcift, 2011; Kim and Oh, 2008; Zhu and Yang, 2008).

The use of different resampling, weighting, subspacing techniques, and the effect of these on classification performance has been extensive in research. Breiman (1996), Freud and Schapire (1996) proposed Bagging and Boosting ensemble methods which are based on ensemble construction using observation resampling and weighting. However, Ho (1998) proposed the random subspace method (RSM) which uses random subspaced feature combinations to construct each learner in order to improve the generalization error. For this reason, The RSM rooted in stochastic discrimination theory (Kleinberg, 2000) has been applied to different pattern recognition problems (Kuncheva et al., 2010; Lai et al., 2006; Nanni and Lumini, 2008; Sun and Zhang, 2007; Genuer et al., 2010).

The applications with RSM have showed that random feature subspacing decreases error rates seriously. However, random selection of feature subspaces is the main drawback of the RSM. Because, random selection of features as input of a classifier might have poor discrimination capability. In this case, a poor classifier is constructed that can damage the ensemble due to feature subspace with poor class separability, although the subspace selection has a decreasing effect on error rate compared to bagging and boosting method for noisy data classification task.

Some combined ensemble methods have been studied to eliminate the drawbacks of each method. It has been reported the success of RSM with bagging, although the combination of boosting and RSM has not been successful (Tao et al., 2006). Garcia-Pedrajas and Ortiz-Boyer (2008) have reported a new method to combine RSM and boosting, called Not so Random Subspace Method (NsRSM), which is based on using different subspaces selected by boosting stage to minimize error. However, these types of studies have focused on model generation phases instead of model combination (Merz, 1999). Furthermore, the necessity of classifier combination becomes a conflicting issue, while a classifier is tried to be more accurate.

Our approach is based on optimized classifier combination to decrease the drawback of RSM. Without damaging the nature of RSM, namely, selecting subspaces randomly, it was studied to decrease error rate using weighted voting stage. Class separability criteria based on scatter matrices of each randomly selected subspace is computed, and used as a weight coefficient at the voting stage. Therefore, the drawback of RSM, the damage of a classifier with trained poor feature combination is reduced by giving low voting weight. Furthermore, the normalization range of J3 criteria is investigated using pattern search optimization (Psearch) method to obtain the lower error rate.

This paper is organized as follows: Section 2 summarizes the methods integrated in the proposed model. Section 3 describes the proposed J3 weighted and optimized J3 weighted RSM. Section 4 reports the results and advantages of the proposed models, and finally, Section 5 states the conclusion of our work.

## 2. Methodology

2.1. Random subspace method

The RSM is a type of ensemble construction technique proposed by Ho (1998). Despite the other ensemble methods including bagging and boosting, it uses modified feature space to construct ensembles of learners as described in Algorithm 1 (Panov and Dzeroski,2007) to decrease error rate.

## Algorithm 1.

We select pâˆ- features randomly from the original c class data set S={x1,x2,â€¦xi } with p-dimensional feature vector, where p*<p. Thus, subspaced data are used as inputs of base learners Ci. However, at this stage, one or more subset can have low class separability, which damage majority voting stage to make final decision (Stepenosky et al., 2006). Therefore, the RSM offers an elegant solution for large dimensional and noisy data classification, while the possibility of subspaced poor feature causes the drawback.

In the model combination phase, weighting can be used to reduce the drawback of the RSM called weighted majority voting (Neo and Ventura, 2012).

(1)

where yi is the final decision and Wj is the weight vector. Thus, the damage of poor base-learners can be eliminated giving lower voting right to poor classifier. Furthermore, Bayesian Perspective is a better solution for weighing in the model combination phase. The class assignment posteriors, of each base-learner provide rich information about the probability of belonging to the class, which makes weighting more effective (Lam and Suen, 1997). However, it is a problem to find optimum weighting coefficient for both combination model to reduce the poor feature subset drawback of the RSM (Nanni and Lumini, 2008).

2.2. Class separability measure

Class separability measure (CSM) quantifies the discrimination power of feature subsets, which has popular usage in the field of feature selection (Wang et al., 2011; Song et al., 2007). J3 criteria based on scatter matrix is the most applied technique due to simplicity (Theodoridis and Koutroumbas, 2008), and computed by within-class scatter matrix (SW), between-class scatter matrix (SB) and mixture scatter matrix (SM).

(2)

(3)

(4)

(5)

where Âµ0 is the global mean vector. and are the covariance matrix and priori probability of the class Ï‰Ä°, respectively. Therefore, J3 criteria can be used to quantify the selected feature subsets in the RSM.

2.3. Pattern search optimization

Psearch proposed by Torczon (1997) and improved by Audet and Dennis (2003,2006) is a direct method that does not require the gradient of the problem for searching minima of a function. Thus, it can be successfully applied to non-differentiable, stochastic or discontinuous functions, as opposed to traditional optimization problems. An optimization problem can be considered as follows:

(6)

Where is the vector of the design parameter, is the objective function, and is the constraint set, describes as:

with (7)

In brief, Psearch algorithm can be explained as follows: Psearch computes a sequence of points gets nearer and nearer to the optimal point of the fitness function. The algorithm searches a set of points, namely a "mesh", at each step where the value of the fitness function is lower than the value at current point. This new value is used as the current point at the step. The mesh is constructed by adding a scalar multiple of fixed set of vectors called "Pattern Vector" to the current point. The detailed information, algorithm and flow chart about Psearch can be found in these papers ( GüneÅŸ and Tokan, 2010; CÄƒleanu et., 2011).

In our case, Psearch is used to find optimal J3 criteria normalization range to reduce the damaging effect of a base-learner trained by poor feature subset. Thus, the objective function becomes the weighting J3 criteria to obtain the lower RSM error rate compared to the standard RSM with simple majority voting stage as the details are explained in the next part.

## 3. System Proposed

We suggest a new model combination technique to eliminate the drawback of the RSM caused by the possibility of poor feature subset selection. The proposed model is the combination of J3 criteria, Psearch optimization and the standard RSM. In the RSM stage, k-nearest neighbor (k-NN) classifier is used as a base-learner to construct ensemble, and 10-fold cross-validation (10-Fold CV) method is applied to train and test the proposed system. The suggested system can be constructed by two optional J3 weighting integration models, J3 weighted and optimized J3 weighted RSM. More complex one is the optimized J3 weighted RSM is presented in Fig. 1.

## Fig. 1.

Briefly, italic written steps of the model are integrated to the RSM by obtaining subset information and inserting weighting coefficients. The optional 10-Fold CV is not necessary to construct a classification task using the proposed model, but it's used to determine 10-fold CV error which is well-known and successful method to determine error rate of the model and compare to other methods (Polat and GüneÅŸ, 2009).

To put a formal manner, a binary classification task of the data (x) with i instances and p-dimensional feature vector by using the proposed RSM model to assign unknown instances to the classes, cj where j={0,1} can be described as; in the subspacing stage, learner number, n of p*-dimensional feature subsets, C(p,p*) are randomly selected for each fold until the number of feature subsets, FS=10n are reached. Thus, J3 array is computed to quantify the each subset described in (2, 3, 4 and 5). The output of the k-NN classifier is represented as class posterior probabilities, P(c0|xi) and P(c1|xi) by searching the number of each class among the k-nearest Euclidean distanced samples (number of c0 , and number of c1):

(8)

(9)

Before weighting, the normalization range of the J3 array should be optimized to find out optimal voting weight. That's why, Psearch algorithm is used to find the minimum and maximum points of min-max normalization. In other words, Psearch optimizes the model combination phase searching optimal weighting range to reduce error rate. Afterwards, the weightings are multiplied by the each classifier's posteriors to give lower voting rights to poorly trained classifiers due to the subsets with low class separability criteria. Finally, final class decision (yi) of the ith sample (xi) of the RSM is made based on averaging rule described as:

(10)

(11)

(12)

Psearch with initial points (0.1-1) tries to find optimum J3 normalization range for minimum 10-fold error rate described by

(13)

where, f=1,2,â€¦,10 defines folds for each classifier, and k=1,2,â€¦,n is the number of classifiers as well as,TP,TN, FP, and FN are true positive, true negative, false positive and false negative classification results, respectively (Mert et al., 2011).

It is supposed that Psearch has data and J3 array range independent balancing effect on the averaging stage. However, it is also supposed that this suggested system can be simpler without Psearch step and data dependent normalization step, in case of the classifier output posteriors are equal or nearly equal. To extend this scenario, assume that P(c0|xi)= P(c1|xi) or P(c0|xi)â‰ˆ P(c1|xi), what is the final decision? For this reason, J3 criteria can be directly multiplied by posteriors to assign the instance with equal or nearly equal class assignment posteriors to decisions of the classifiers trained by subsets with higher J3 criteria. In conclusion, class separability of subsets weighted RSM with Psearch or not is proposed to decrease error rate eliminating the drawback of the RSM, and investigated with well-known data in the next section to show the effect of the proposed system.

## 4. Experimental Results

We perform experiments to compare the classification performance of our proposed J3 criteria and Psearch optimized J3 criteria weighted averaging RSM with the standard RSM and k-NN. For this reason, three data sets, Wisconsin Diagnostic Breast Cancer (WDBC) (Street et al., 1993), Parkinsons (Little et al., 2007) and Heart-Statlog from the UCI Machine Learning Repository (Frank and Asuncion, 2010) are used to perform 10-fold CV error of the classification model.

First, classification tasks of these three data sets by k-NN are performed to find the best classifier parameter in the model generating phase as described in section 2.1. Therefore, k value at the lowest 10-Fold CV error is examined for each data set, and the graph of the results is presented in Fig. 2.

## Fig. 2.

The graph indicates that the lowest error rates of WDBC, Parkinson, and Heart data sets are 0.06503, 0.1487, and 0.3298 for 5-NN, 3-NN, and 5-NN classifiers, respectively. Therefore, the resulted k values are used as the parameters of the base-learners, k-NN to generate more accurate individual model in the RSM.

Second, the effect of the J3 weighted RSM on error rate depending on selected subset dimensionality is compared with the standard RSM using the described k-NN models for each data. J3 criteria weighted RSM model without Psearch optimized normalization results are given in the Fig. 3 comparing to the standard RSM for 100 base-learners.

## Fig. 3.

Of course, both of the standard RSM and J3 weighted RSM decrease error rate of the three data sets compared to k-NN classifier. However, J3 criteria weighted RSM has more advantages than the standard RSM, and can be explained as it has the lowest error for WDBC data but, it increases the subset dimensionality from two to three at the lowest error rate. Generally, the classification task of WDBC data with the J3 weighted RSM has lower error. For Parkinsons data, the lowest error rates are equal for both method. However, the J3 weighted one reaches the lowest error rate at lower dimensional subset, which makes it effective for computational cost. In contrast to these two data classification, the J3 weighted RSM causes a slight difference which can be assumed to be equal to the standard RSM. Therefore, the proposed J3 weighted RSM can be an option to reduce the error of the standard RSM classification tasks with output posteriors are equal or nearly equal. However, it is data dependent solution. That's why, Psearch optimization for normalizing J3 criteria range is used to find optimal voting weights according to the subset class separability power. Thus, Psearch optimized J3 criteria weighted RSM is compared to J3 weighted RSM in the Fig. 4.

## Fig. 4.

Psearch optimization of J3 criteria normalization range has an acceptable decreasing effect on error rate for Heart and Parkinsons data sets. In other words, it gives the lowest error rate when six and four subset dimensionalities are selected, and gives lower error rate for the rest of the subsets compared to J3 weighted RSM. However, the lowest error rates of the both J3 and optimized J3 weighted RSM are equal at three dimensional feature subset, while optimized one gives lower error rate at the rest of subset dimensionality for WDBC data. To summarize and compare, the results of the proposed models are given in Table 1.

## Table 1

Finally, the class separability criteria, J3 can be used directly or by optimizing to eliminate the drawback of the standard RSM giving voting weights to the each classifier's posteriors. The proposed model can reduce error rate at selected lower subset dimensionality and all of the subset dimensionality compared to standard RSM.

## 5. Conclusion

Our suggested model is to eliminate the drawback of the random subspace method (RSM) caused by the selection of subsets with poor class separability criteria. For this reason, one of the class separability criteria, J3 is computed to quantify the randomly selected feature subsets, and used as voting weight in the model combination phase in order to prevent from damaging ensemble decision.

The range of the J3 criteria is the most effective point, which should be investigated to make final class decision more accurate. Therefore, the proposed system is divided into two J3 weighting dependent model namely, the J3 weighted RSM and Pattern Search (Psearch) optimized J3 weighted RSM. The experiments on well-known data sets including Wisconsin Diagnostic Breast Cancer (WDBC), Parkinsons, Heart-Statlog (Heart) with J3 weighted RSM has shown that directly multiplying J3 by each class assignment posterior of the base-learners can be data dependent solution, which it can result lower error than standard RSM or reach the lowest error rate at lower selected subset dimensionality. It is supposed that this lower error rate is caused by weighting the posteriors of the samples with equal or nearly equal posteriors to assign according to the classifiers' decisions with trained high class separability criteria. From this point of view, Psearch is used to find optimum J3 normalization range preventing from giving unnecessary voting rights. Therefore, optimized J3 weighted RSM is applied to the datasets to present the effect of J3 range on error rate comparing to standard RSM and the proposed J3 weighted RSM. The resulted 10-fold cross-validation (10-fold CV) error shows that J3 weighting optimization in the model combination phase decreases error rate and required subset dimensionality than the proposed J3 weighted and standard RSMs.

Finally, this proposed models, J3 and optimized J3 weighted RSM can be successfully integrated to the standard RSM to eliminate of the poor subset drawback which damages ensemble final decision. Optimized model can be more successful but, time consuming. Thus, at least, J3 weighted RSM can be used to decrease error rate at lower subset dimensionality, which can also be an option to reduce computational cost and requirement while decreasing error rate.

## Disclosure Statement

The Authors declare that there is no conflict of interest.