This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.
Random Forests is an ensemble classification algorithm with a wide range of advantages. It is robust to noise, does not overfit, offers explanation and visualization of its output, detects variable interactions, estimates the importance of input variables and provides an internal mechanism for the estimation of the generalization error as the forest building progresses. These characteristics differentiate the Random Forests algorithm from other ensemble methods and rank it among the top highly effective classifiers. The aim of this work is twofold. First, it aims to propose modifications of the Random Forests algorithm which improve its prediction performance and second to provide a comparison of the proposed modifications with those already existing in the literature. The modifications proposed in this work intend to increase the strength and decrease the correlation of individual trees of the forest and to improve the function which determines how the output of the base classifiers will be combined. We conduct experiments on a selection of classification datasets, analyzing the resulting accuracy. The experiments demonstrate that the proposed modifications affect positively the performance of the Random Forests algorithm and they provide comparable, and in most of the cases, even higher results than the existing modifications.
Keywords: Random Forests, ensemble methods, weighted voting.
An active area of research in machine learning is the combination of classifiers, the commonly called "ensemble methods". Ensemble methods are learning algorithms that construct a set of classifiers and then classify unknown data by taking a (weighted) vote of their predictions (Dietterich 2000). Several authors have demonstrated the advantage of the ensemble methods over the individual classifier models and have noted that they can significantly improve the performance of learning (Kuncheva 2004, Oza et al. 2005). A necessary and sufficient condition for an ensemble classifier to be more accurate than any of its individual base classifiers is the individual classifiers in the ensemble to be accurate and diverse (make their errors on different parts of the input space).
Many methods have been developed concerning the construction of an ensemble (Dietterich 2000). These methods are grouped in to those that manipulate a) training examples to generate multiple hypotheses, b) the set of input features available to the learning algorithm, c) the output targets that are given to the learning algorithm, d) those that inject randomness into the learning algorithm and e) those that use Bayesian voting. The most popular variants of ensemble methods are Bagging (Breiman 1996), Boosting (Freud et al. 1997), Random Subspace methods (Ho 1998) and, the last decade, Random Forests (Breiman 2001).
Bagging and Boosting rely on "resampling" techniques to obtain different training sets for each one of the base classifies. More specifically, in Bagging each base classifier is trained with a set generated by randomly drawing with replacement of N examples, where N is the size of the original training set. The combination of base classifiers is made by majority voting. Boosting is focused to produce a series of classifiers and the training set used for each classifier is chosen based on the performance of the earlier classifier(s) in the series. The classifiers are added one at a time and are trained on the data on which have been "hard" for the previous ensemble members. In the Random Subspace methods, one also modifies the training data. However, this modification is performed in the feature space. More specifically, each base classifier is built on a different subset of features randomly chosen from the original feature set. The outputs of the models are then combined, usually by a simple majority vote.
Random Forests constructs a set of tree-based learners. Each base learner is grown on a boostrap sample of the dataset. As the tree is constructed, a random sample, of fixed predefined size, of predictors is drawn before each node is split. The split of the node is based on the best of the randomly chosen predictors. The procedure is repeated for each node of the tree which is fully grown and not pruned. Then, each tree of the forest casts a vote for the instance being classified and the predicted class is determined by a majority voting procedure (Breiman 2001).
Random Forests present a variety of advantages over other ensemble methods. It provides estimated about the importance of the input variables and it detects the interactions between them. It incorporates methods for handling missing values and it can balance the error in class population of unbalanced datasets. In Random Forests, there is no need for a separate test set to get an unbiased estimate of the generalization error. It is estimated internally through the use of out-of-bag instances (one third of the training instances not taking part in the construction of the tree). Finally, Random Forests is less sensitive to noise data compare to other ensemble methods and it does not overfit.
Although, the mechanisms that explain the good performance of the Random Forests have been detected, there are not fully exploited in order the potential of this method to be fully utilized. A possible extension of Random Forests concerns the increase of strength, the decrease of correlation or the improvement of combination of tree-based classifiers consisting of the forest.
For this purpose different studies have been reported in the literature each one addressing, separately, one of the above issues. More specifically, Robnik-Sikonja (2004), Bernard et al. (2008a), Rodriguez et al. (2006), Lemmond et al. (2009) concentrated on the construction of the base classifiers in order to be more strength and less correlated. On the other hand, Robnik-Sikonja (2004), Tsymbal et al. (2006), Gunter et al. (2004), Bernard et al. (2008b) and Gatnar (2005) focused on finding the best combination function of base classifiers, since each one of the base classifiers has different impact on processing different instances. The contribution of this work is twofold. First, it proposes modifications of the Random Forests algorithm which aim to improve either the construction of the forest or the voting mechanism as those reported previously and second, and more important, it proposes modifications that address both factors which affect the performance of the algorithm (construction and voting).
The paper is organized into five sections. Section 2 describes the Random Forests classification algorithm. In Section 3 modifications of the algorithm proposed in this work and those reported in the literature are presented. Section 4 presents the results produced by the application of classical Random Forests and its modifications to 7 classification datasets. Section 5 compares the extensions of the Random Forests presented in this work and Section 6 concludes the work.
2. The Random Forests Algorithm
A Random Forests (RF) is a classifier consisting of a collection of tree-structured classifiers. Each classifier votes for one of the classes and an instance being classified is labeled with the winning class (Breiman 2001). The collection of trees is called an ensemble. For the construction of each tree of the forest a new subset of samples is selected from the data set (boostrap sample). The tree is built to the maximum size without pruning. The tree is just grown until each terminal node contains only members of a single class. Only a subset of the total set of features is employed as the candidate splitters of the node of the tree. The number of the selected features () remains constant throughout the construction of the forest. If a popularity of the trees agrees on a given classification, then that is the predicted class of the sample being classified. It must be noted that there is a built-in estimate of the uncertainty in each new classification. More specifically, the distribution of the votes indicated weather nearly all trees agree or weather there are many dissenters.
The Random Forests algorithm has also an internal mechanism for the estimation of the misclassification error, called "out-of-bag" error estimate. The idea is to estimate the expected predictive accuracy by running each observation through all the trees that were formed without using this observation. It has been shown that for each data point, approximately one-third of the boostrap samples will not contain it, so approximately two-thirds of the generated trees can be used for its classification. If be the class that gets most of the votes for the data point, the proportion of times that differs from the true class of the data point is the out-of-bag estimate. The out-of-bag proportion of votes for class at input is given by
where is the indicator function, is the tree classifier and is the set of out-of-bag instances for classifier.
According to Breiman (2001) theorem the generalization error of the Random Forests algorithm (which is the average of the out-of-bag errors of the trees) can be bounded by:
where is the "strength" of the set of classifiers and is a measure of how correlated two randomly chosen trees are on average, standardized by their variability (average correlation of the forest).
The strength is the expected value of the margin and is given by:
where is the space of instances and the target class of the instances.
The average correlation is computed as the variance of the margin over the square of standard deviation of the forest. Thus, is given by:
The is the margin function and it measures the extent to which the average number of votes for the right class exceeds the average vote for any other class. It is given by:
In order the performance of the Ranodm Forests algorithm to be improved some modifications were proposed. The modifications aim to decrease the correlation among the trees of the forest, to increase the strength of the tree classifiers and affect the voting mechanism. They can be grouped into three categories: a) those that modify the construction of the forest, b) those that modify the voting procedure, and c) those that address both construction of the forest and voting mechanism. The modifications are reported on Table I and are going to be described in detail on the next section. As it is indicated in Table I, some of the modifications are reported in the literature and some are proposed in this work. An extended grouping of modifications is provided on Fig. 1.
Group A: Construction of the forest
Proposed in this work
Group B: Voting mechanism
Proposed in this work
Group C: Construction and Voting
Proposed in this work
RF with ReliefF
RF with wv
RF with me and wv 1
RF with me
SFS - RF
RF with me and wv 3
SBS - RF
RK- RF with wv-1
RK - RF with me
Modified SFS - RF
RK- RF with me and wv-1
Modified SBS - RF
Optimal RF and RK-RF
RF: Random Forests
me: multiple estimators
wv: weighted voting
Optimal RF with me and RK- RF
Clustering - RF
Optimal RF with me
Table I: Modifications of the Random Forests algorithm.
Figure 1: Categories of the Random Forests modifications.
3. Modifications of the Random Forests algorithm
3.1. Modifications of the construction of the forest
According to Eq. (2) the performance of the Random Forests algorithm is affected by two factors: a) the strength of each tree, and b) the inter-tree correlation. The methods described in this section aim to increase the first factor and decrease the second one. This is achieved by addressing several issues concerning the construction of the trees of the forest. They are mainly concentrated on modifying the procedure of selecting the features in each splitting node and the evaluation measure that determines the node impurity.
In classical Random Forests algorithm the Gini index is used as the evaluation measure in node splitting. The Gini index is given by:
where is the probability for the class is the probability the feature having the value , is the probability for the class given that the feature has the value and is the number of the different values of the feature .
The Gini index measures the impurity of the class value distribution before and after the split on evaluated attribute. Thus, it cannot detect strong conditional dependencies among attributes. Based on this disadvantage of the Gini index Robnik Sikonja (2004) proposed the replacement of the Gini index by ReliefF (RF with ReliefF). ReliefF evaluates partitioning power of attributes according to how well their values distinguish between similar instances. ReliefF randomly selects an instance and then searches for of its nearest neighbors from the same class, called nearest hits, and also nearest neighbors from each one of the different classes, called nearest misses. It updates the quality estimation of the attribute depending on their values for, hits and misses. The update formula is given by:
where is the prior probability of the class, is the number of randomly chosen instances, and is given by:
The second modification concerns the replacement of the Gini index as the sole attribute evaluation measure with several others, decreasing in this way the correlation but retaining the strength of tree classifiers (RF with me) (Robnik Sikonja 2004). The evaluation measures commonly used in classification problems are Gini index, Gain Ratio , ReliefF, Minimum Description Length and myopic ReliefF . The formulas for the above evaluation measures are given by:
where is the size of the training set, is the number of training instance with -th value of given attribute, is the number of training instances from class, is the number of instances from class with -th value of the attribute.
An appealing characteristic, which makes Random Forests algorithm particularly effective, is the random features selection. Recall that, at each step in growing a tree, classic recursive partitioning examines all variables to determine the best split. By contrast, Random Forests, usually, picks of the variables at random, taking the best split among them. This extra level of selection makes the different trees in the forest less similar by allowing highly correlated variables to play nearly equivalent roles (otherwise, the slightly more predictive variable would always be chosen). Thus, random feature selection de-correlates the trees, lowering the prediction error. Consequently, an important parameter of Random Forests algorithm is the number of features randomly selected at each node during the tree induction process (reducing reduces both the correlation and the strength, increasing increases both). Although the determination of the specific parameter is crucial for the accuracy of the Random Forests algorithm, the choice of the value of is usually made either by a greedy search that tests every possible value to choose the optimal one, either by choosing a priori one of the arbitrary values commonly used in the literature. These values are
A modification proposed by Bernard et al. (2008a) (RK - RF) is focused on the setting of hyperparameter . More specifically, selection is still used, but it is randomly chosen for each splitting node. Instead of fixing the value of, so that is identical for all the decision trees, a new value of is randomly chosen at each node of the trees, and used for this current node splitting only.
Rodriguez et al. (2006) proposed a modification which is based on the utilization of a linear combination of features in each splitting node (Rotation Forest). It consists in splitting the feature set randomly into k subsets, running Principal Component Analysis (PCA) separately on each subset. Then, it reassembles a new extracted feature set, while keeps all the principal components. The data is transformed linearly into the new features. A tree classifier is trained with this new dataset. The reason for keeping all the principal components is because it is preferable the variability information of the data to be preserved in the new feature space. The low inter-tree correlation is achieved by the k rotations that take place to form the new features. A similar approach is the employment of Linear Discriminant Analysis (LDA) for taking linear combination of features (Lemmond et al. 2009). Linear Discriminant Analysis allows decision hyperplanes of any orientation in multidimensional feature space to separate the data, in contrast to the conventional Random Forests algorithm, whose boundaries are limited to hyperplanes orthogonal to the axis corresponding to the feature yielding the best split.
In this work, a combination of Random Forests with multiple estimators and random selection of the number of variables at each node of the tree is proposed (RK with me). The specific modification exploits the advantages of the two combined modification in order the correlation and the strength to be decreased and increased, respectively, to the maximum degree. The proposed combination ensures that each tree of the forest will be different from the rest trees, since it is built using a different evaluation measure. The decrease of the classification error is also achieved by strengthening the classifiers, since the trees are constructed using a different predictive variable at each node.
The modifications described above belong to the first group of modifications, which interfere in the construction of the forest trying to find the best number of selected features, the best node impurity measure or a combination of both. The second category of modifications stems for the voting mechanism.
3.2. Modifications of the voting mechanism
Since not all trees are equally successful in labeling all instances different weighted voting procedures were developed which replace classical majority voting approach and assign weights to the votes of the trees. For this purpose, different weighted voting schemes are employed. Some of them are reported in the literature and some of them are proposed in this work. The modifications of voting mechanism can be grouped into the following categories: a) those that assign weights to the votes of the trees, and b) those that select a subset of trees of the forest that are most predictive and less correlated.
Robnik Sikonja (2004) uses internal estimates to identify the instances that are most similar to the one we wish to classify and then weights the votes of the trees with the strength they demonstrate on these near instances (RF with wv-1). More specifically, for each instance being classified some of its most similar instances are found. The similarity measure that is used is given by Eq. (12):
The classification of an instance, with one of the trees, puts the instance to one of the leaf nodes of the tree. All the training instances that belong to the same leaf node increase their similarity score by one. As the procedure is repeated for all the trees the frequency of co-occurrence forms a measure of similarity between instance and the training instances. The similarity is normalized by dividing with the number of trees. Then most similar instances are selected and classified with each tree where they are in the out-of-bag set. Through this procedure the margin of each tree is computed (Eq. 5). The trees with negative margin are discarded while the rest trees vote using as weights the average margins on the similar instances when they are in the out-of-bag set.
Tsymbal et al. (2006) modifies the combination of trees by taking into account their local predictive performance (RF with wv-5). One such technique is dynamic integration where local predictive performance is first estimated for each base model based on the performance of similar instance, and then it is used to calculate a corresponding weight for combining predictions with Dynamic Selection (DS), Dynamic Voting (DV) or Dynamic Voting with Selection (DVS). In Dynamic Selection the classifier with the lower predicted error is selected. In Dynamic Voting each base classifier receives a weight proportional to its local accuracy and then weighted voting is used, and in Dynamic Voting with Selection base classifiers with the highest local errors are discarded and locally weighted voting is applied with the remaining classifiers. The weight of each base classifier is given by:
where is the indicator function, is the size of neighborhood, is the set of out-of-bag instances for classifier , is a distance-based relevance coefficient and is the margin of model on th nearest neighbor of . The distance-based relevance coefficient reflects the similarity between two instances and is given by:
where is given by:
where expresses the range of values of attribute However, the intrinsic similarity measure of the Random Forests algorithm can be used. Thus, can be given by:
Margin is defined, for a classifier with crisp outputs, by:
The proximity between the instance being classified and the training instances is the core of the next weighted voting scheme proposed by (Cunningham 2008) (RF with wv-2). According to this scheme for each unknown instance we classify the distance between this instance and all training instances as:
where is the unknown instance, is a feature of the set of features , is the weight assigned to feature, is the value of feature in the unknown instance , is the value of feature in the training instance and is given as:
The nearest neighbors are selected, based on this distance metric, to determine the class of . More specifically, the class of is the one which maximizes:
where is equal to one if the class label are matched and zero otherwise.
An important characteristic of the above weighted voting schemes is the distance function which determines the similarity between instances. Difference distance functions were tested in order to conclude to the value difference metric () (RF with wv-3) (Wilson et al. 1997). The distance function is given as:
where two forms of were used:
is the conditional probability for the output class given that the attribute has value . is the number of classes. Using this distance measure two values are considered to be closer if they have similar classifications.
Another approach based on the performance of the base classifiers is the one proposed by Hu et al. (2006). It is based on the maximally diversified multiple decision tree algorithm (RF with wv-4). The input of the algorithm is a set of trained trees and a test instance . The output of the algorithm is the class which maximizes given as:
where is the accuracy of the trained tree The index ranges from 1 to number of trees and ranges from 1 to .
A totally different approach is followed on the sixth weighted voting scheme (RF with wv-6). The weights of the trees are determined through a well-known optimization procedure called genetic algorithms (Gunter et al. 2004). The fitness function is defined as the recognition rate of the forest when using weighted voting.
The second group of modifications that affect voting mechanism are based on the idea of feature selection and clustering techniques. The aim is to determine whether or not it is possible to select a subset of trees from a forest that is able to outperform this forest.
The tree selection approaches proposed by Bernard et al. (2008b) are based on classical feature selection techniques, Sequential Forward Selection (SFS) and Sequential Backward Selection (SBS). In SFS approach, at each iteration, a tree of the forest is added to the current subset. If the tree increases the performance of the ensemble then is retained otherwise is discarded. In the same manner, in SBS approach, at each iteration, a tree is removed from the current subset. If the removal of the tree increases the accuracy of the sub forest then it is definitely discarded otherwise is retained. The iterative procedure is terminated when a maximum number of iterations are reached. This number determines the number of base classifiers in the final subset. Another, commonly used, criterion is the convergence of the accuracy.
Recall that we are interested not only in finding strong classifiers but also less correlated. For this purpose a modified version of the above tree selection processes is proposed in this work. The difference is the criterion that should fulfill the tree that is added or discarded (Modifies SFS and Modified SBS respectively). It should increase the accuracy and decrease the correlation of the current sub forest.
Since the above methods are known to be suboptimal, because the sequential process makes each iteration depend on the previous one and not all possible combinations are explored, another modification is proposed in this work. According to this modification tree is added to the current subset. If the tree increases the accuracy and decrease the correlation then it is retained otherwise all the combinations with previous trees are checked, by adding trees that were discarded and removing trees that were added.
Finally, clustering techniques are employed in order to select trees from the forest. Application of clustering techniques prerequisites an answer to the following questions: a) what should be clustered (natures of data), b) what clustering algorithm should be used, c) which is the number of clusters, and d) how is the number of clusters determined. Gatnar (2005) applied k-means algorithm to the data produced using Yule's and Hamann's diversity measures. The data were grouped into 30 clusters. For each cluster, the tree base classifier with the highest accuracy has been selected to form the committee. Finally, majority voting was used for the final classification. The reason for selecting number of clusters equal to 30 is not justified.
In this work k-means clustering algorithm was applied too. However, eight diversity measures were used to produce eight different datasets on which the algorithm was tested. The diversity measures are the following (Gatnal 2005, Kuncheva et al. 2003):
Yule's Q statistic
It is used to assess the similarity of two classifiers' outputs. It is given by:
where is the number of instances, classified correctly ( or incorrectly () by classifier , and correctly ( or incorrectly () by classifier j.
It expresses the correlation between the outputs of two classifiers:
Double Fault measure
It is the percentage of test instances for which both classifiers make wrong predictions:
Fail non fail disagreement measure
It is defined as the percentage of test instances for which the classifiers make different predictions but for which one of them is correct:
It is a binary similarity coefficient that is given by the difference between the matches and mismatches as a proportion of the total number of entries:
is equal to 0 when the agreement of the two classifiers equals to that expected by chance, and is equal to 1 when the two classifiers agree on every example. It is defined as:
where and are given by Eq. 30a and 30b:
is the number of instances in the data set, recognized as class by the first classifier and as class by the second one, is the number of instances recognized as by the first classifier, and is the number of instances recognized as by the second classifier.
Interrater agreement measure
It measures the level of agreement between two classifiers with the correction for chance:
Plain disagreement measure
It is equal to the proportion of the instances on which the classifiers make different predictions:
where is the class assigned by classifier to instance , and , if , otherwise .
For the determination of the number of clusters nine validity indexes were tested. The validity indexes are the following (Albalate et al. 2009, Sharan et al. 2003, Kim et al. 2001):
where is the optimum number of clusters, is the total number of objects in the dataset, and is the intra-cluster dispersion. It is defined as the total sum of square distances of the objects to their cluster centroids.
where is the intra-cluster distance, defined as the average distance of all the cluster objects to the cluster centroid and is the distance between the clusters and . The optimum number of clusters corresponds to the minimum value of .
Krazanovski and Lai
The represents an object that belongs to -th cluster, the is the centroid of the -th cluster and is the number of features. The optimum corresponds to the maximum of Krazanovski and Lai index.
where is the average distance of the object to all objects of the same cluster, and is the average distance of the object to the objects of the closest cluster.
Calinski and Harabasz
where and are the sum-of-squares of within and between clusters respectively.
where is a clustering solution and is the similarity between fingerprint of element and element .
which measures the average similarity between clusters.
Weighted inter-to-intra- cluster ratio
where and are the similarity between and within clusters respectively.
Under-Over partition measure
where and are the normalized under and over partition measure respectively. The under partition measure is defined as the mean of mean intra cluster distance and the over partition measure is given as the ratio of the number of clusters to the minimum inter cluster distance.
The table below (Table II) denotes for which value (max/min) of the validity indexes the optimal number of cluster is obtained.
Krazanovski and Lai
Calinski and Harabasz
Weighted inter-to-intra- cluster ratio
Under-Over partition measure
Table II: Relation between optimal number of clusters and validity indexes.
3.3. Modification of the Construction and integration of the ensemble
In the above sections, different modifications of the Random Forests algorithm were described. The modifications reported on the first section aimed to increase the strength and diversity of the base classifiers while the modifications described in the second section aimed to optimize the voting mechanism.
Although, increasing strength and diversity of the ensemble or optimizing the voting mechanism, increases the performance of the Random Forests algorithm, is not enough. An ensemble integration method which decides how to combine strength and less correlated base classifiers is needed. The methods of the third category propose an integration method that fulfills the above characteristics of an ensemble. In order to achieve this, combination of methods belonging to the previous categories are employed. The combinations are reported in Table I (Group C).
4. The dataset
For the evaluation of the Random Forests algorithm 7 datasets were used. The 6 datasets are publicly available at the UCI Machine Learning Repository and the last one consists of data from patients with Alzheimer's disease. This dataset is a result of a previous work (Tripoliti E.E. et al. 2009). A short description of the datasets is provided on Table III.
No. of samples
No. of features
No. of classes
Breast Cancer Wisconsin (Diagnostic)
The features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass.
The radar data was collected by a system in Goose Bay, Labrador.
Connectionist Bench (Sonar, Mines vs. Rocks)
The dataset contains signals from a variety of different aspect angles, spanning 90 degrees for the cylinder and 180 degrees for the rock.
Pima Indians Diabetes
Data were collected from female patients at least 21 years old of Pima Indian heritage.
The features correspond to material components of glass such as aluminum, magnesium etc.
Statlog (Vehicle Silhouettes)
The features were extracted from the silhouettes by the HIPS (Hierarchical Image Processing Systems) extension BINATTS, which extracts a combination of scale independent features.
The features were extracted from fMR images of patients with Alzheimer's disease.
Table III: Short description of the datasets.