Decision Tree Learning Algorithm English Language Essay

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

Decision tree induction is a simple yet powerful learning and classification model. Decision tree learning offers tools for discovery of relationships, patterns and knowledge from data in databases. The volume of data in databases is growing to quite large sizes, both in the number of attributes and instances. Decision tree learning using a very large set of records in a database is quite complex task and is usually a very slow process, which is often beyond the capabilities of existing computers. There are various issues related to decision trees and training data sets. To handle these issues various approaches have been proposed in the past by different researchers. This chapter summarizes these proposed approaches to handle various issues related to decision tree learning and issues related to handling problems with data. The emphasis is given on issues which help to optimize the process of decision tree learning. \par A decision tree is a classifier in the form of tree structure that contains decision nodes and leaves. It assigns a class value to an instance. In tree construction process, partitioning of attributes is according to splitting criteria that implements better use of available attributes and also implies computational efficiency in classification. The tree construction takes polynomial time concerning the number of attributes and inputs, as no backtracking is required \cite{7,8,9,10,11}. Some of the important approaches and issues are been introduced in this chapter. \\ The issues are:


\item Decision tree learning algorithm and various splitting criteria such as gain, gini index, twoing rule etc.

\item Pruning techniques- These includes the techniques for pruning overfitted decision tree technique. Some prominent methods includes error based pruning, cost complexity pruning and reduced error pruning.

\item Decision tree learners- Decision tree learning algorithms such as CHAID, ID3, C4.5 and CART.

\item Data preprocessing- Data preprocessing includes techniques for feature subset selection, data sampling, outlier detection and handling missing data. This topic is generalized and is not restricted to techniques for decision tree.

\item Soft computing approach for decision tree learning \& optimization these include use of Neural networks, Evolutionary techniques and Fuzzy logic.

\item Handling large data set approaches such as parallel, distributed, scalable and meta decision tree.

\item Surveys on decision tree learning.

\item Decision tree learning softwares available and some of the commonly used benchmark datasets.

\item Other issues like incremental induction of decision tree and oblique decision trees.

\item Applications of decision trees.


\par Decision tree algorithms construct trees by recursively partitioning a training set. A training set consists of set of attributes and a class label. An attribute can have real, Boolean or ordinal values. A decision node states a test to be carried on a particular attribute value of an instance. A branch is present for each probable output of the test. Thus, a tree is traversed from the root to a leaf of the decision tree to identify the class of the instance. The specified class at the leaf is the classification by the decision tree. The classification accuracy, defined as the percentage of correctly classified instances in the test data, specifies the performance of decision tree. The generalized decision tree algorithm is explained here.\\

\textbf{The Tree Construction Algorithm} \\

The tree construction algorithms use a divide and conquer approach to construct a decision tree. It evolves a decision tree for a given training set \emph{T$_t$} consisting of set of training instances. An instance denotes values for a set of attributes and a class. Let the classes be denoted by \{\emph{C$_1$, C$_2$, $\cdot$$\cdot$$\cdot$, C$_n$\}}.

\par Initially, the class frequency is computed for instances in training set \emph{T$_t$}. If all instances belong to same class, node \emph{K} with that class is constructed. However, if set \emph{T$_t$} contains instances belonging to more than one class, the test for selecting attribute for splitting is executed and the attribute satisfying splitting criteria is chosen for the test at the node. The training set \emph{T$_t$} is then partitioned into k exclusive subsets \{\emph{T$_t$$_1$, T$_t$$_2$, $\cdot$$\cdot$$\cdot$, T$_t$$_k$\}} on the basis of this test and the algorithm is recursively applied on each nonempty partition. The algorithm for construction of a decision tree is given below. \\

\par Construct (\emph{T$_t$})


\item Calculate freq(\emph{C$_i$,T$_t$)}.

\item If (all instances belong to same class), \\ return leaf.

\item For every attribute \emph{A} test for splitting criteria.\\

Attribute satisfying test is test node \emph{K}.

\item Recur construct (\emph{T$_t$$_i$}) on each partition \emph{T$_t$$_i$}.\\

Add those nodes as children of node \emph{K}.

\item Stop.


\section{Selecting the Best Attribute for Classification}

The selection of test attribute at each node in the decision tree is an important process. Various approaches have been proposed to select the best possible attribute. The approaches are categorized by Ben-Bassat \cite{12} as use of information theory, distance measure and dependence measure. Some of the approaches are discussed below.

\subsection{Using Information Theory}

Last and Maimon \cite{13} expressed that objective of process of data classification is to maximize the information gain as it leads to increase in classification accuracy. Quinlan \cite{8} used information gain and gain ratio in decision tree algorithm. The gain is defined as the information obtained from a message based on its probability \emph{P}. For any set of instances \emph{T$_t$}, the probability that an instance belongs to class \emph{C$_i$} is given as


P=\frac {freq(C_i,T_t)}{|T_t|} %


\par Where \emph{$|$T$_t$$|$} is number of instances in set \emph{T$_t$} and freq(C$_i$,T$_t$) denotes the number of instances in \emph{T$_t$} that belong to class \emph{C$_i$}. Now, the average information contained in set \emph{T$_t$} regarding class association of instances, called entropy of set \emph{T$_t$}, and is calculated in bits as


info(T_t)= \sum^{k}_{i=1}{P_i} \times log_2(P_i)%


Where \emph{k} is the number of classes in set \emph{T$_t$}. The test \emph{X} performed at a node on the preferred attribute provides subsets \{\emph{T$_t$$_1$, T$_t$$_2$, $\cdot$$\cdot$$\cdot$, T$_t$$_k$\}}. The information by this partitioning process is calculated as the sum over these subsets, is given as


info_x(T_t)= \sum^{k}_{i=1}\frac{|T_t_i|} {|T_t|}\times info(T_t_i) %


The reduction in entropy due to partitioning of \emph{T$_t$} with test \emph{X} on the preferred attribute, denoted as \emph{gain}(\emph{X}), is calculated as


gain(X)= info(T_t)-info_x(T_t) %


The attribute which provides maximum information gain is selected. The problem with above approach for selection of test attribute at a node is that it is biased towards attributes with many values as compared to attributes with less values and it leads to large decision trees that poorly generalize the problem. This problem can be eliminated using normalization of gain criterion and use of gain ratio. The gain ratio calculates ratio of information generated by partitioning \emph{T$_t$} and is expressed as


gain\; ratio(X)= gain(X)/split info(X)%


The \emph{split info(X)} calculates information gained by splitting training set \emph{T$_t$} into \emph{k} subsets on test \emph{X}. The equation is


split\; info(X) = -\sum^{k}_{i=1}\frac{|T_i|} {|T|}\times log_2\left(\frac{|T_i|} {|T|}\right)%


The attribute on which test obtains maximum gain ratio is selected. This approach has problem that it tends to favour attributes for which split info(\emph{X}) is very small. Another problem is that gain ratio can be calculated only when the split info(\emph{X}) is nonzero. To overcome this problem, Quinlan suggested computing information gain over all attributes and considering attributes with information gain which is at least as large as average of information gain over all attributes. The use of gain ratio provides better accuracy and complexity of classifier.

Jun et al. \cite{14} suggested a modification in above entropy calculation where the base of the logarithm is the number of successors to the node and have shown that this approach can handle huge amount of data efficiently.

\subsection{Using Distance Measure}

A classification and regression tree uses \emph{gini} index as impurity measure for selecting attribute that is based on distance measure. These attribute evaluation criteria computes separability, deviation or discrimination between classes \cite{15}.

\emph{Gini} index for a data set \emph{T$_t$} is defined as,


gini(T_t) = 1-\sum^{k}_{i=1}{P^2_i}%


Where \emph{p$_i$} indicates the relative frequency of class \emph{i} in the data set \emph{T$_t$}.

The attribute with the largest reduction in impurity is used for splitting the node's records. After splitting \emph{T$_t$} into two subsets \emph{T$_t$$_1$} and \emph{T$_t$$_2$} with sizes \emph{N$_1$} and \emph{N$_2$} respectively then the \emph{gini} index of the split data is defined as


gini(T_t)_ s_p_l_i_t = \frac{N_1}{N}gini(T_1)\times \frac{N_2}{N}gini(T_2)%


Breiman et al. \cite{15} identified that the \emph{gini} index has a problem in handling large number of classes. In such a case, binary criterion called \emph{twoing} index is used which is based on dividing the multiple classes into two super classes and then calculating the best split on the attribute based on these super classes. Murthy et al. \cite{16} explained this \emph{twoing} rule as follows. In the beginning the set is the complete training set \emph{T}, is divided into two non-overlapping subsets, \emph{T$_L$} and \emph{T$_R$} by hyperplane \emph{H}. The impurity measure at the start checks if \emph{T$_L$} and \emph{T$_R$} are homogeneous and belongs to the same category and in that case return minimum zero impurity. The value to be computed is defined as


Twoing\; value=\frac{|T_L||T_R|}{n^2} \left(\sum^{k}_{i=1} \left| \frac{L_i}{|T_L|} - \frac{R_i}{|T_R|}\right|\right)}^2 %


Where \emph{$|$T$_L$$|$} and \emph{$|$T$_R$$|$} represents the number of instances on the left and right of a split at node \emph{T$_t$} and the number of instances at node \emph{T$_t$} are represented by \emph{n}. The number of instances in category \emph{i} on the left and right of the split are represented by \emph{L$_i$} and \emph{R$_i$} respectively.

Mantras \cite{17} introduced distance based attribute selection measure. The attribute selected with this criterion is in partition, a partition which is closest to the correct partition of the subset of training set related to the node. Chandra et al. \cite{18} proposed a novel node splitting criteria called as distinct class based splitting measure (DCSM) for decision tree induction. It is motivated by concept of \emph{gini} index. The measure gives importance to the number of distinct classes in a partition. The DCSM criterion is combination of the product of two terms. The first term handles the number of diverse classes in every child partition. With increase in number of different classes in a partition the first term increases. Consequently these purer partitions are favored. The second term decreases when there are more instances of a particular class as against the total number of instances in that partition. As a result amalgamation favors purer partitions.

Rounds \cite{19} presented attribute selection criteria based on the Kolmogorov-Smirnov distance that keeps an optimal classification decision at each node. Utgoff and Clouse \cite{20} used the same Kolmogorov-Smirnov distance measure with improvements to take care of attribute with multiclass and missing attribute values.

Martin \cite{21} has done empirical comparative analysis of splitting methods like distance, orthogonality, a Beta function and two chi-squared tests. Buntine and Niblett \cite{22} recommended alternative investigational methods and presented additional results for splitting rules for decision tree induction.

\section{Decision Tree Pruning}

Decision trees are often large and complex and as a result they may become inaccurate and incomprehensible. The causes of overfitting in decision trees are the noisy data, unavailability of training samples for one or more classes and insufficient training instances to represent the target function. Decision tree pruning removes one or more subtrees from a decision tree. It makes overfitting trees more accurate in classifying unseen data.

\par Various methods have been proposed for decision tree pruning. These methods selectively replace a subtree with a leaf, if it does not reduce classification accuracy over pruning data set. Pruning may increase the number of classification errors on the training set but it improves classification accuracy on unseen data.

\par Pruning techniques can be divided into two groups. The techniques in first group approximately compute the probability of misclassification of a subtree and then make pruning decision using an independent test set called pruning data set. In second group, iterative grow and prune method is used during the construction of a tree. Some important pruning methods are reviewed here briefly.

\subsection{Cost Complexity Pruning}

Cost complexity pruning \cite{15} is used in the CART system. Starting with initial unpruned tree \emph{T$_0$} that is constructed from complete training set, this algorithm constructs a chain of progressively smaller pruned trees \emph{T$_0$, T$_1$, T$_2$, ..., T$_k$} by replacing one or more subtress in the predecessor tree \emph{T$_i$} with best possible leaves. The method prunes those subtrees that give the lowest increase in error for the training data. The increase in error α is defined to be the average increase in error per leaf of the subtree as


\alpha=\frac{M}{N(L-1)} %


Where \emph{N} represents the number of training instances, \emph{M} represents the number of instances misclassified as a result of pruning and \emph{L} represents number of leaves in the pruned subtree. The cost complexity of a tree is defined as ratio of number of correctly classified instances to misclassified instances in training data plus number of leaves in that tree multiplied by some parameter $\alpha$. To prune a tree, we calculate values of $\alpha$ for each non-leaf subtree of tree \emph{T$_i$} and prune all the sub trees having low value for $\alpha$ such that the resulting tree \emph{T$_i$} has the same cost-complexity as \emph{T$_i$}. If the size of training set is large, separate training and pruning sets are used, otherwise the use of cross validation method is suggested \cite{23,24,25}.

\subsection{ Reduced-Error Pruning}

The reduced error pruning proposed by Quinlan \cite{25} is a bottom-up approach in which the non-leaf subtrees are replaced with best possible leaf nodes provided that these replacements reduce the classification error on the pruning data set.

The process continues towards the root node until the pruning decreases error. The process assures smallest and most accurate decision trees with respect to the test data \cite{26}.

\subsection{Critical Value Pruning}

Mingers \cite{26} proposed critical value pruning, which uses the information gathered during tree construction. It sets a threshold called a critical value to select a node for pruning. Various measures such as gain, info gain etc. can be used to select the best attribute at the test node. If the value of selection criterion is smaller than this threshold value the subtree is replaced with a best possible leaf. However, if the subtree contains at least one node having value greater than the threshold, the subtree cannot be pruned \cite{23,24}.

\subsection{Pessimistic Error Pruning}

Pessimistic error pruning, a top down approach proposed by Quinlan \cite{25} uses error rate estimates to make decisions concerning pruning the subtrees similar to cost complexity pruning. It calculates classification errors on training data and does not require separate pruning set. Since the classification errors estimated from training set cannot provide best pruning results for unseen data, this pruning technique assumes that each leaf classifies a certain fraction of instances with error. To reflect these errors, it adds a continuity correction for binomial distribution to the derived training error of a subtree \emph{t}. This continuity correction is calculated as


r'(t) =\frac{[e(t)+0.5]}{N}%

\end{equation} Where \emph{N} represents the total number of instances covered by that node and e(t) represents the derived error rate of a subtree on training data. However, as the corrected misclassification estimation by a sub tree is expected to be optimistic, the algorithm calculates standard error at node from previous estimation of classification error by equation (2.11). Quinlan recommends pruning a sub tree if its corrected estimate of error is lower than that for the node by at least one standard error \cite{24,26}.

\subsection{Error-Based Pruning}

Error-based pruning is the default pruning method for the well-known C4.5 decision tree algorithm \cite{8}. Instead of using a pruning set it uses error estimates. The method assumes that the errors are binomially distributed and calculates error estimates from the training data. The number of instances covered by the leaf of the tree are used to estimate errors. The equation for confidence \cite{23} \emph{CF} (Certainty Factor) provided by Quinlan is as follows.


CF=\sum^{E}_{x=0} {N \choose x} p^x(1-p)^{N-x}%


Where \emph{N} and \emph{E} represent number of instances covered and misclassified by a leaf node. The upper limit of confidence level \emph{U$_C$$_F$} is the solution for \emph{p}. The number of predicted errors for that leaf are calculated by multiplying number of cases covered by that leaf with upper limit of confidence level. The number of predicted errors of subtree is calculated as sum of predicted errors by its branches.

\par A bottom-up approach \cite{24} is used for error-based pruning. If the number of predicted errors for the leaf is not greater than the sum of the predicted errors for the leaf nodes of that subtree then subtree is replaced with that leaf.

\subsection{Minimum Description Length Pruning }

Mehata et al. \cite{27} and Quinlan and Rivest \cite{28} utilized MDL principle for decision tree pruning. The principle of minimum description length used here states that a classifier that compresses the data is a preferable inducer. The MDL pruning method selects decision tree with less number of bits required to represent it. The size of the decision tree is measured as number of bits required for encoding the decision tree. The method searches for decision tree that maximally compresses the data \cite{24}.

\subsection{Optimal Pruning}

Optimal pruning algorithm constructs smaller pruned trees with maximum classification accuracy on training data. Breiman et al. \cite{15} first suggested a dynamic programming solution for optimal pruning algorithm. Bohanec and Bratko \cite{29} introduced an optimal pruning algorithm called OPT which gives better solution. Almuallim \cite{30} proposed an enhancement to OPT called OPT-2. It is also based on dynamic programming and has flexibility in various aspects and is easy to implement.

\subsection{Improvements to Pruning Algorithms}

Matti Kääriäinen \cite{31} analyzed reduced error pruning and proposed a new method for obtaining generalization of error bounds for pruning the decision trees. Error-based pruning has been blamed for the general effect of under-pruning. Hall et al. \cite{32} proved that if the certainty factor value \emph{CF} is appropriately set for the data set, error-based pruning constructs trees that are essentially steady in size, in spite of the amount of training data. The \emph{CF} calculates the upper limit of the probability of an error at a leaf. Oates and Jenson \cite{33} presented improvements to reduced error pruning to handle problems with large data sets. Macek and Lhotsk \cite{34} presented a technique for pruning of decision trees based on the complexity measure of a tree and its error rate. The technique utilizes the Gaussian complexity averages of a decision tree to compute the error rate of classification. Frank \cite{35} enhanced performance of standard decision tree pruning algorithm. The performance is enhanced with statistical significance of observations. Bradford et al. \cite{36} proposed pruning decision trees with misclassification cost with respect to loss. Scott \cite{37} proposed algorithms for size-based penalties and sub-additive penalties.

\par Bradley and Lovell \cite{38} proposed a pruning technique that is sensitive to the relative costs of data misclassification. They implemented two cost-sensitive pruning algorithms, by extending pessimistic error pruning and minimum error pruning technique. Cai1 et al. \cite{39} proposed cost-sensitive decision tree pruning CC4.5 to deal with misclassification cost in the decision tree. It provides three cost-sensitive pruning methods to handle with misclassification cost in the decision tree. Mansour \cite{40} proposed pessimistic decision tree pruning based on tree size. A graphical frontier-based pruning (FBP) algorithm is proposed by Huo et al. \cite{41} which provides a full spectrum of information while pruning the tree. The FBP algorithm starts from leaf nodes and proceeds towards root node with local greedy approach. The authors further proposed combination of FBP and cross validation method.

In decision tree learning pre-pruning handles noise and post-pruning handles the problem of overfitting. Faurnkranz \cite{42} proposed two algorithms to combine pruning and post pruning operations. A method for pruning of oblique decision trees was proposed by Shah and Sastry \cite{43}.

\subsection{Comparison of Pruning Methods}

The empirical comparative analysis is one of the important methods to compare the performance of various available algorithms. Quinlan [25] examined and empirically compared tree cost complexity pruning, reduced error pruning and pessimistic pruning on some data sets. These methods have demonstrated significant improvement in terms of size of the tree. Cost complexity pruning tends to produce smaller trees than reduced error pruning or pessimistic error pruning where as in case of classification accuracy, reduced error pruning is somewhat superior to Cost complexity pruning.

\par Esposito et al. \cite{24} presented comparative analysis six well-known pruning methods. Each method has been critically reviewed and its performance has been tested. The paper provides study of theoretical foundations, computational complexity and strengths and weaknesses of the pruning methods. According to this analysis, reduced-error pruning outperforms other methods. In addition, MEP, CVP and EBP tend to under prune whereas reduced-error pruning tends to over prune.

\par Similarly, Mingers \cite{26} analyzed five pruning methods with four different splitting criteria. The author has provided the analysis based on size and accuracy of the tree. This work showed that minimum-error pruning is extremely sensitive to the number of classes in the data and is the least accurate method. Pessimistic error pruning is bad on certain datasets and needs to be handled with care. Critical value, cost complexity, and reduced-error pruning methods produced trees with low error rates on all the data sets with consistency. He further clarified that there is no evidence of relation between splitting criteria and pruning method. Windeatt \cite{23} presented empirical comparison of pruning methods for ensemble classifiers. It has been proved that error based pruning performs best for ensemble classifiers. From above studies we can conclude that reduced error pruning and cost complexity pruning methods are the promising pruning methods as compared to other available methods.

\section{Decision Tree Learners}

Researchers have developed various decision tree algorithms over a period of time with enhancement in performance and ability to handle various types of data. Some important algorithms are discussed below. \\

\textbf{CHAID}: CHAID (CHi-squared Automatic Interaction Detector) is an initial decision tree learning algorithm, which is an extension of the AID (Automatic Interaction Detector) and THAID (Theta Automatic Interaction Detector) procedures. It works on principal of adjusted significance testing. It was developed by Kass \cite{44} in 1980. CHAID is easy to interpret and can be used for classification and detection of interaction between variables. After detection of interaction between variables it selects the best attribute for splitting the node, such that each child node is made of a collection of homogeneous values of the selected attribute. The method can handle missing values. It does not imply any pruning method.\\

\textbf{ID3}: ID3 (Iterative Dichotomiser 3) decision tree algorithm is developed by Quinlan \cite{45}. It is based on Occam's razor, which states that simplest hypothesis should be adopted. Here Occam's razor is incorporated through use of information entropy. ID3 uses information gain as splitting criteria. Information gain decides how effectively the attribute separates training instances according to their class. ID3 does not use pruning method. It cannot handle numeric attributes and missing attribute values. When training data contains noise the performance of the algorithm is degraded.\\

\textbf{C4.5}: The C4.5 algorithm \cite{8} is improvement over ID3 algorithm. The algorithm uses information gain as splitting criteria. It can accept data with categorical or numerical values. To handle continuous values it generates threshold

and then divides attributes with values above the threshold and values equal to or below the threshold. The default pruning method is error-based pruning. As missing attribute values are not utilized in gain calculations the algorithm can easily handle missing values.\\

\textbf{CART}: Classification and regression tree (CART) proposed by Breiman et al. \cite{15} constructs binary trees. The word binary implies that a node in a decision tree can only be split into two groups. CART uses gini index as impurity measure for selecting attribute. The attribute with the largest reduction in impurity is used for splitting the node's records. It can accept data with categorical or numerical values and also handle missing attribute values. It uses cost-complexity pruning. It can also generate regression trees.

\section{Data Pre-Processing}

The knowledge discovery process consists of an iterative sequence of subtasks such as selection of data subset handling noise and missing data etc. Data pre-processing techniques used in decision trees are discussed here.

\subsection{Feature Subset Selection}

The real-world applications provide us numerous attributes that can be used for learning. When given data set contains large number of attributes the classification performance of inductive method may degrade. The solution is to use a quality subset of those attributes. Feature selection improves performance of learning algorithms by finding a minimal subset of relevant features. Feature selection removes irrelevant, noisy and repeated features and keeps the most relevant features.

\par Ron Kohavi \cite{46} proposed the use of a search with probabilistic estimates for probing a space. The method results in accurate and comprehensible trees. Caruana and Freitag \cite{47} proposed a caching system with hill climbing in attribute space for feature subset selection and found that hill climbing in attribute space can get significant improvement in classification performance. Mark Last et al. \cite{48} proposed an algorithm that computes fuzzy information gain as quality measure. The resulting tree size is reduced. Wu and Flach \cite{49} proposed merging of heuristic measures and exhaustive search method to get optimal subset. Goodman-Kruskal measure and Fisher's exact test are used to rank the feature.

\par Huan et al. \cite{50} proposed a monotonic measure, which is accurate and fast. The proposed method reduces error rate. Grabczewski and Jankowski \cite{51} proposed two feature selection algorithms that are based on separability of spilt value criteria and verified that they are good alternatives to the available most popular methods with respect to classification accuracy. Bins and Draper \cite{52} proposed three-stage algorithm for large data sets. The algorithm is combination of the relief algorithm to remove irrelevance, K-means to remove redundancy and a standard combinatorial feature selection algorithm.

\par Several authors have proposed evolutionary computation based methods for feature selection. Guerra-Salcedo et al. \cite{53} proposed hybrid genetic feature selection approach that is fast and accurate. Landry et al. \cite{54} proposed a feature selection technique based on the genetic programming. Bala et al. proposed \cite{55} use of genetic algorithms for evaluations of features.

Filter and wrapper approaches are also used for feature subset selection. Filters remove irrelevant features from data set. Filters work independently of any induction and it takes place before induction process. The filter approach does not consider the effect of subset selection on performance of induction algorithm.

\par The Wrapper approach uses a statistical re-sampling technique such as cross validation along with the induction algorithm. The induction algorithm with some objective function is used to evaluate the selected feature subset. Legrand and Nicolas \cite{56} proposed a hybrid technique that combines filter and wrapper approaches with principle of preferences aggregation. Hall and Smith \cite{57} proposed a correlation based filter approach to select feature subset. Duch et al. \cite{58} proposed inexpensive filters based on information theory. Hall \cite{59} proposed faster and accurate filter algorithm useful for continuous and discrete domains. Yuan et al. \cite{60} proposed a two-stage feature selection algorithm of filter and wrapper approach to get benefit of both approaches. Initially the filter approach eliminates irrelevant features and then the wrapper approach eliminates redundant features. Lanzi \cite{61} proposed genetic algorithm with filter that is faster than standard genetic algorithm for feature selection.

\subsection{Outlier Detection }

The problem of outlier detection and noise elimination is an important issue in data analysis. The removal of outliers improves data quality and hence classification performance. Several researchers have proposed various approaches for data cleaning. These include use of MDL principle, Neural networks, filters, Occam's razor and some other methods. The approaches are discussed below.

\par John George \cite{62} proposed a method that removes a misclassified training instance from training data and rebuilds tree on filtered data, the process is repeated till all such instances are removed from training data. Misclassified instances are identified using tree classifier as a filter. The classifier built on clean data improves prediction accuracy. Arning et al. \cite{63} proposed framework for the problem of outlier detection. Similar to human beings, it observes all instances for similarity with data sets and it treats dissimilar data set as an exception. A dissimilarity function is used to find out an outlier. Guyon et al \cite{64} proposed training of convolutional Neural networks with local connections and shared weights. The Neural network is trained with minimizing mean-square-error cost function with backpropogation algorithm. Unclean data is applied as input and several increasing strict levels of cleaning are forced. The classification error is used as an objective function.

\par Gamberger and Lavrac \cite{65} proposed conditions for Occam's razor applicability in noise elimination. The Occam's principle states that when there are many hypotheses, one should choose simplest hypothesis that is correct for all the training instances or maximum training instances. This hypothesis is expected to capture the information inbuilt in the problem and provide accurate classification accuracy on unseen data. Knorr and Ng \cite{66} proposed unified outlier detection system. Subsequently, they \cite{67} proposed and analyzed some algorithms for detecting distance-based outliers. Tax and Duin \cite{68} proposed outlier detection that is based on the instability of the output of simple classifiers on new objects.

\par Broadly and Friedl \cite{69} introduced a method for detecting mislabeled instances. The approach is to employ a set of learning algorithms to build classifiers to detect mislabeled instances. The method is based on the technique of removing outliers in regression analysis \cite{70}. An outlier in this case is an instance that comes from a different probability distribution. Gamberger et al. \cite{71} proposed saturation filter. It is based on principle that detection and removal of noisy instances from training data to induce less complex and more accurate hypothesis. The removal of noisy instances from training data reduces the complexity and the Less Complex Correct Hypothesis (CLCH) value. The saturation filter checks the saturation of training data with use of CLCH value. A source of all possible correct instances in a given problem domain is called target concept. A good representation of target concept in the inducted hypothesis is called target theory. The saturated training data set can be employed for induction of stable target theory.

\par Schwarm and Wolfman \cite{72} proposed Bayesian methods for data cleaning; the method detects errors and corrects errors using Bayesian methods. The Bayesian methods utilizes dependencies between attributes in participated manner and uses expert knowledge of the relationships between the attributes. Ramaswamy et al. \cite{73} proposed algorithm for distance-based outliers that is based on the distance of a point from its nearest neighbor. The solution ranks each point based on its distance to its nearest neighbor. The higher points in this ranking indicate outliers.

\par Raman and Hellerstein \cite{74} proposed an interactive framework for data cleaning that integrates transformation and discrepancy detection. This framework progressively constructs transformations by adding or undoing transforms, in an instinctive, graphical manner using a spreadsheet-like interface. The effect of a transform on instances is available immediately on screen. In the background, the system continues to infer the structure of the data in terms of user-defined domains and uses appropriate algorithms to check it for outliers. The proposed frame structure progressively constructs a transformation as outliers are found, and cleans the data without scripting complex programs or enduring long delays.

\par Kubika and Moore \cite{75} presented system for learning explicit noise. The system detects corrupted fields and uses non-corrupted fields for consequent modeling and analysis. Verbaeten and Assche \cite{76} proposed three ensemble based methods for noise elimination in classification problems. The first one is base classification algorithm in which ILP extension of C4.5 is used. This extension uses logical queries as test values instead of attribute values at test nodes. The second filter technique proposed is voting filter removes outlier if all or majority of classifiers misclassify the instance. The last technique is boosting filters in this method Adaboost is used and after n number of rounds instances with highest weighs are removed. Loureiro et al. \cite{77} proposed a method that applies hierarchical clustering methods for outlier detection. Xiong et al. \cite{78} proposed a hyperclique-based noise removal system to provide superior quality association patterns. The hyperclique pattern contains items those are strongly correlated to each other. The existence of an item in one matter strongly indicates the existence of every other item that belongs to the same hyperclique pattern. The h-confidence threshold designates the strength of this association. The higher the threshold, the stronger is the relationship. The system discovers all hyperclique patterns for a given h-confidence threshold and removes any objects that are not belonging to any hyperclique pattern.

Seung Kim et al. proposed \cite{79} fast outlier detection for very large log data S. Hido et al. \cite{80} proposed statistical outlier detection using density ratio estimation.

\subsection{Data Sampling}

Sampling is the process of taking a subset of instances that represents the entire population. The representativeness is the primary concern in statistical sampling. Sampling is done since it is impossible to test every single individual in the population. It is also desirable to save time, resources and effort while conducting the research. The sample must have sufficient size to justify statistical analysis. George John and Pat Langley \cite{81} experimented on static and dynamic sampling and found that dynamic sampling is robust as compared to static sampling. Jenson and Oates \cite{5} experimented on data sampling and proved that as size of the training dataset increases, size of tree also increases where as classification accuracy does not increase significantly. Foster et al. \cite{82} concluded that progressive sampling can be remarkably efficient.

\subsection{Handling Missing Attribute Values}

Missing attribute values is one of the important and common problems in the real world data sets. While collecting data some attribute values from a tuple are lost. It creates problem for training as well as testing, because it reduces classification accuracy. Several researchers have addressed the problem of handing missing attribute values. Little and Rubin \cite{84} divided the methods for handling missing data into three categories; the categories are ignoring and discarding data, parameter estimation, and imputation. Imputation is procedure of substituting missing values of attributes with some plausible values. Imputation is further divided as case substitution, mean or mode imputation, hot and cold deck and prediction model. Batista \cite{85} experimented k-nearest neighbour imputation and found that it performs better than mean or mode imputation method of C4.5 algorithm. Friedman et al. \cite{86} suggested ignoring every tuple with missing attribute values from training instances. Authors found that it may result in loss of bias information due to ignoring.

\par Quinlan experimented on problem of missing attribute values \cite{87} and concluded that ignoring cases with missing values hampers the classification accuracy and depends on attribute to attribute. Quinlan substituted missing values with most common test outcomes and found that it performs well in some cases but poorly in others.

Kuligowski and Barros \cite{88} proposed a use of a back propagation Neural network for estimation of missing data by using concurrent rainfall data from neighboring gauges. Brockmeier et al. \cite{89} experimented empirical comparative analysis of deletion and imputation Techniques.

Abebe et al. \cite{90} proposed a use of a fuzzy-rule-based model for substitution in missing rainfall data using data from neighboring stations. The authors have provided empirical comparative analysis of results using the fuzzy-rule-based model and results using an ANN model and a traditional statistical model. The fuzzy-rule-based model performs slightly better.

\par Sinharay et al. \cite{91} experimented on the use of multiple imputations for the analysis of missing data. Khalil et al. \cite{92} proposed cyclic federation of data intended for budding ANN models to estimate missing values in monthly surplus datasets. Bhattacharya et al. \cite{93} used ANN models to substitute the missing values of wave data. Fessant and Midenet \cite{94} proposed use of a self-organizing map (SOM) for imputation of data along with the multilayer perceptron (MLP) and hot deck methods.

\par Musil et al. \cite{95} provided empirical comparative analysis on list wise deletion, mean substitution, simple regression, regression with an error term and the EM algorithm. Junninen et al. \cite{96} experimented on univariate linear, spline and nearest-neighbor interpolation algorithm, multivariate regularized expectation-maximization algorithm, nearest-neighbor, self-organizing map, multilayer perceptron (MLP) as well as hybrid methods.

M. Subasi, et al. \cite{97} proposed new imputation method for incomplete binary data. Amman Mohammad Kalteh \& Peder Hjorth \cite{98} experimented on imputation of missing values with self organizing map, multilayer perceptron, multivariate nearest neighbor, regularized expectation maximization algorithm and multiple imputation for precipitation runoff process data set. Daniel and Kenward \cite{99} proposed a method for increasing the robustness of multiple imputations.

\section{Class Distribution}

The classifier performance is affected by varying class distribution of the training instances. The experiments by Gary Mitchell Weiss \cite{101} specify that the naturally happening class distribution is not all the time best for learning. A balanced class distribution should be preferred to make a robust classifier. To reduce learning cost it is essential to control the quantity of training data used for learning. Gary Mitchell Weiss proposed a budget-sensitive progressive-sampling algorithm for selecting training instances in such circumstances. The proposed algorithm forms a class distribution that performs fairly well for near optimal learning.

\section{Soft Computing Approaches in Decision Tree Learning}

Decision tree algorithm requires enhancement pertaining to different problems. The problems and use of soft computing techniques as a solution are discussed here.

\subsection{Evolutionary Techniques}

A decision trees is called optimal if it correctly classifies the data set and has minimal number of nodes. The decision tree algorithms use local greedy search method by means of information gain as target function to split the data set. The decision trees generated by these methods are efficient with classification accuracy but they often experience the disadvantage of excessive complexity. Construction of optimal decision tree is identified as NP-Complete problem [9]. This fact leads the use of genetic algorithms that provide global search through space in many directions simultaneously. The genetic algorithm is used to handle combinatorial optimization problems. Different authors have proposed a use of methodologies that integrates genetic algorithms and decision tree learning in order to evolve optimal decision trees. Although the methods are different the goal is to obtain optimal decision trees.

\par A. Papagelis and D. Kalles \cite{102} proposed GATree, a genetically evolved decision trees. The genetic algorithms use binary string as initial populations but GATree uses binary decision trees as initial populations. A binary decision tree that includes one decision node with two different leaves. Initially to construct such initial trees a random attribute is selected, if that attribute is nominal valued one of its possible values is randomly selected and in case of continuous attributes an integer value from its minimum to maximum range is randomly selected. Thus the size of the search space is reduced. Two arbitrary nodes from population of subtrees are selected and nodes of those subtrees are swapped to perform crossover operation. In view of the fact that a predicted class value depends just on leaves, the crossover operator does not affect the decision trees consistency. An arbitrary node of a preferred tree is selected and it substitutes the node's test-value with a new arbitrary chosen value to perform mutation. In case if the arbitrary node is a leaf, it substitutes the installed class with a new arbitrary chosen class. Validation is performed after crossover and mutation to get final decision tree. The fitness function for evaluation is percentage of correctly classified instances on the test data set by the decision tree. The results show compact and equally accurate decision trees as compared to standard decision tree algorithms.

\par Similarly Z. Fu proposed GAIT \cite{103} algorithm. The algorithm constructs a set of different decision trees from different subsets of the original data set by using a decision tree algorithm C4.5, on small samples of the data. The genetic algorithm uses these trees as its initial populations. The selection operation selects decision trees from pool by random selection mechanism. The crossover operation exchanges subtrees between the parent trees whereas mutation exchanges subtrees or leaf inside the same tree. The fitness criterion for evaluation is the classification accuracy. The validation on fitness function is performed after crossover and mutation to get final decision tree that are smaller in size. Similar approaches are proposed and experimented in \cite{104,105}.

\subsection{Neural Networks}

Neural networks can be used to enhance the decision tree learning. Zhou and Jiang \cite{106} proposed a variation of C4.5 decision tree algorithm named NeC4.5 that utilizes Neural network ensemble to preprocess the training data for decision tree construction. The training set may contain noise and thus the classification accuracy of the data set is reduced.

\par The algorithm trains a Neural network ensemble and the trained ensemble is used to produce a new training set. It substitutes the preferred class labels of the original training tuples with the output from the trained ensemble. Some extra training tuples produced by the trained ensemble are also added to the new training set. The new training set is used for training C4.5. The processed training data by Neural network improves classification accuracy of the decision tree classifier.

\subsection{Fuzzy Decision Tree}

The fuzzy decision tree provides elevated comprehensibility and the elegant performance of fuzzy systems. Fuzzy sets and fuzzy logic permits the modeling of language related uncertainties. In addition to this, it provides a symbolic outline for knowledge comprehensibility, the capability to represent fine knowledge details and the ability in dealing with problems of noise and inexact data.

\par The tree construction procedure for fuzzy decision tree is similar to decision tree construction algorithm. The splitting criteria are based on fuzzy boundaries but the procedures for inference are dissimilar in fuzzy decision tree. The fuzzy decision trees \cite{107} have fuzzy decisions at each branching point. It makes calculating the best split difficult to some extent if the attributes are continuous valued or multivalued. Constructing smaller fuzzy decision trees is valuable as they contain more information in internal nodes.

\par The enhancements to the fuzzy decision tree algorithms are as follows. Zeidler and Schlosser \cite{108} proposed use of membership function to discretize the attributes for handling continuous valued attributes. Janikow \cite{109} optimized the fuzzy component of a fuzzy decision tree using a genetic algorithm. Myung Won Kim et al. \cite{110} proposed an algorithm that determines an appropriate set of membership functions for each attribute. The algorithm uses histogram analysis with application of the genetic algorithm to tune the initial set of membership functions. A fuzzy decision tree with given set of membership functions is constructed. Fajfer and Janikow \cite{111} described bottom-up fuzzy partitioning in fuzzy decision trees, a complement of top down technique. The proposed algorithm is useful to partition continuous valued attributes into fuzzy sets. Guetova et al.\cite{112} proposed Incremental fuzzy decision trees. The algorithm gets equivalent results to non-incremental methods.

\section{Handling Large Data Set}

Handling large size data on currently available computing systems is a challenging task. Approaches like parallel, scalable and Meta decision tree are disused here in brief.

\subsection{Parallel and Distributed Decision Tree Algorithms}

Parallelization is a renowned, conventional means to speed up classification tasks with large amounts of data and complex programs. In the data mining applications, the size of dataset is growing that leads us to find out computationally efficient, parallel and scalable algorithms with the objective to get optimal accuracy in a reasonable amount of time with parallel processors. The algorithms work in parallel using multiple processors to construct a single reliable model.

\par Kubota et al. \cite{113} explained two methods for parallelizing decision tree algorithm, intra-node and the inter-node parallelization. Intra-node parallelization practices the parallel processing in single node and Inter-node parallelization practices the parallel processing among multiple nodes. Intra-node parallelism is further classified in record parallelism, attribute parallelism and their combination. Authors have implemented and experimented these four types of parallelizing methods with four kinds of test data. The performance analysis from these experiments states that there is a relation between the characteristics of data and the parallelizing methods. The combination of various parallelizing approaches is the most effectual parallel method

\par Kufrin \cite{114} proposed a framework for decision tree construction on shared and distributed memory multiprocessor. The method builds parallel decision trees that overcome limitation of serial decision tree on large-scale training data. Narlikar \cite{115} proposed parallel structure of a decision tree-based classifier for memory-resident datasets on SMP. The structure uses two types of divide-and-conquer parallelism, intra-node parallelization and the inter-node parallelization with lightweight Pthreads. Experimental verification on large datasets signifies that the space and time performance of the tree construction algorithm scales with the data size and number of processors.

\par Joshi et al. \cite{116} proposed ScalparC, a scalable parallel classification algorithm for mining large datasets with decision trees using MPI on Cray T3D system. This implementation confirms scalability and efficiency of ScalparC for wide range of training set and wide range of processors. Hall et al. \cite{117} presented combining decision trees learned in parallel. The proposed algorithm builds decision trees with n disjoint data subsets of a complete dataset in parallel and after that converts them into rules to combine into a single rule set. The experiments on two datasets illustrate that there is enhancement of around 40\% in quantity of rules generated by decision tree. Zaki et al. \cite{118} proposed parallel algorithm for building decision tree on shared memory multiprocessors and it was verified that it achieves good speedup. Srivastava et al. \cite{119} presented two parallel formulations for decision tree induction as synchronous tree induction and partitioned tree induction. Authors proposed a hybrid method that implements the high-quality features of these formulations. The experimental results illustrate the high speedups and scalability in processing.

\par Kubota et al. \cite{120} proposed a parallel decision tree algorithm on a PC cluster. Plain parallelization of decision tree is not efficient due to load imbalance. The proposed algorithm is a better parallel algorithm with data redistribution. The parallel algorithm's performance on benchmark data demonstrate that it provides an improvement in speed of 3.5 times, in the best case and equal performance even in the worst case.

Jin and Agrawal \cite{121} proposed parallel decision tree construction with memory and communication efficiency. The approach achieves very low communication volume; no need to sort the training records during the execution and combining shared memory and distributed memory parallelization. Jin et al. \cite{122} proposed use of SMP machines with a chain of techniques that includes full replication, full locking, fixed locking, optimized full locking and cache-sensitive locking for parallelization of data mining algorithms. The results state that among full replication, optimized full locking and cache sensitive locking, there is no clear conqueror. Any of these three techniques can outperform other technique depending upon machine and dataset. These techniques perform considerably better than the other two techniques. In decision tree construction, combining different techniques is found to be critical for obtaining high performance. Li Wenlong et al. \cite{123} proposed parallel decision tree algorithm based on combination.

\par Similarly distributed decision trees learning algorithms are proposed. Jie Ouyang et al. \cite{124} proposed Chi-Square test based decision trees induction in distributed environment. Kanishka Bhaduri et al. \cite{125} proposed distributed decision-tree induction in peer-to-peer systems. Bin Liu et al. \cite{126} proposed data mining in distributed data environment.

\subsection{Scalable Decision Trees}

Advances in technologies create a large volume of data. The large amount of knowledge in this data can be utilized to improve decision-making process of an organization. Scalability is one of the important issues in decision tree learning, a brief review of scalability handled by researchers is presented here.

\par SLIQ \cite{127} a fast scalable decision tree classifier adopts the presorting scheme in the growing phase of the tree that eliminates the sorting of data at each node of decision tree. The training data are sorted just once for each numeric attribute at beginning of tree growth phase. In this method a separate list for each attribute of the training data and a separate list for the class labels of each instance is created. An entry in an attribute list has two columns, first contains an attribute value and the second contains an index into the class list. An entry of the class list also has two columns, first one contains a class label and the second contains a reference to a leaf node of the decision tree. As every leaf node represents a partition of the training data on the path from the node to the root, the class list identifies the partition to which an instance belongs. The presorting process combined with a breadth first tree growing strategy enables SLIQ to scale for disk resident large data sets and can handle both numerical and categorical data. SLIQ uses gini index as splitting criteria, uses a new tree-pruning algorithm that is inexpensive and results in compact and accurate trees.

\par Shafer et al. \cite{128} proposed SPRINT that provides scalability, parallelism and removes memory restriction. It achieves parallelism by its design, which allows multiple processors to work together. In this algorithm list are created as are created in SLIQUE. Initially an attribute list for each attribute in the data set is created. The entries in these lists called attribute records that consists an attribute value, a class label and the index of the record. Initial lists for continuous attributes are sorted by attribute value. If the complete data does not fit in memory, attribute lists are kept on disk. Thus memory restrictions are solved. The initial lists formed from the training set are linked with the root of the classification tree. As the algorithm executes the tree is grown and nodes are split to create new children, the attribute lists for each node are partitioned and associated with the children. The order of the records in the list is maintained while partition and thus partitioned lists never require resorting. The algorithm uses gini index as splitting criteria. The results demonstrate good scale up and speedup on large data set. The size up performance is also good because the communication costs for exchanging split points and count matrices does not change as the training set size is increased.

\par Gehrke et al. \cite{129} proposed a framework called Rainforest that provides approach for implementing scalability in decision tree algorithms with large data sets. Rainforest makes refinement to some initial steps of decision tree construction algorithm. Algorithm creates only one attribute list for all categorical attributes jointly. It creates the histograms for splitting information and thus avoids additional scan. The refinement is made up to this step, afterwards remaining part conventional decision tree algorithm proceeds.

\par The improvements claimed are as follows. The best splitting criteria available can be exploited for classification in a scalable manner. The algorithm claims performance improvements of greater than a factor of five over the Sprint algorithm, which is the known fastest scalable classification algorithm. Alsabti et al. proposed CLOUDS \cite{130} a decision tree classifier for large datasets. The proposed algorithm samples for splitting points on numeric attributes followed by estimate procedure to narrow search space of best split. CLOUDS reduces computational and I/O complexity as compared to benchmark classifiers with quality in terms of accuracy and tree size. Gehrke et al. proposed \cite{131} BOAT an approach for optimistic decision tree construction. It uses small subset of data for initial decision tree construction and refines it to construct final tree. With only two scans of training data it can construct several levels of decision tree and thus it is claimed to be faster by factor three. It can handle insertion and deletion of data in dynamic databases and thus it is first scalable incremental decision tree approach.

\subsection{Meta Decision Trees}

Meta learning \cite{132} technique integrates distinct learning processes. Several meta-learning methods are been proposed for integrating autonomously learned classifiers in a parallel or distributed computing environment.

\par The process of constructing meta classifiers can be divided into two sub-processes. First process is to build a diverse set of base-level classifiers and second process is to combine predictions by base-level classifiers. Several approaches can be used to generate base-level classifiers to single data set. The approaches are using different learning algorithms or using a single learning algorithms. Voting, stacking and cascading are combining techniques. Meta decision tree combines multiple base level decision tree classifiers. The Meta decision tree leaf points the base level classifier, whereas ordinary decision tree leaves specify classification. Todorovski and Dzeroski \cite{133,134} modified C4.5 to develop algorithm MLC4.5 for learning Meta decision tree.

\par Todorovski and Dzeroski compared meta decision trees with ensemble learning methods bagging and boosting and found to perform better \cite{135}. The important issues, approaches and applications of meta-learning are discussed in \cite{136,137}.


The work on literature review of decision trees and related issues is reviewed here. Murthy \cite{9} in his paper covered multi-disciplinary existing work on decision trees. The basics and terminologies of decision trees construction, details of tree construction methods are reviewed. The paper provides summarized existing surveys and also discusses work on determining splits at tree nodes, various pruning techniques. The other sections on several different topics relevant to tree construction such as sample size and dimensionality, work on improving greedy induction, incorporating costs, estimating probabilities from decision trees are provided. The work also compares tree based data exploration to alternative methods such as multivariate statistical methods and Neural networks. At the end some recent, real-world applications of decision trees are discussed.

Safavian and Landgrebe \cite{138} presented a survey of available design approaches for decision tree classifier. The paper provides potential advantages of decision tree classifier over single-state classifiers. The observations concerning the relation between decision trees and Neural networks are also presented. This review also presents the different issues in decision tree classifiers along with the probable lacunas of each method.

Rokach and Maimon \cite{139} analyzed various issues in decision tree construction. The survey of basics of decision trees, decision tree complexity metrics, algorithmic framework for decision trees, details of various splitting criteria, pruning methods are presented. In addition the issues like missing attribute values, misclassification cost, various decision tree inducers, handling large database with decision trees are presented.

\section{Other Issues}

Induction of incremental decision trees and oblique decision trees are some of the important issues in decision tree learning and these issues are discussed here.

\subsection{Incremental Tree Induction}

The normal decision tree-learning algorithm builds decision tree on currently available training set. If new training cases are offered, the previously built decision trees are not useful and we have to execute the decision tree algorithm once again from scratch to add new cases in decision tree. Utgoff \cite{140} developed incremental decision tree learning algorithm. The incremental decision tree algorithm adds new training instances to current decision tree without reconstruction of the decision tree. The results produced are equivalent to the standard decision tree learning algorithms.

Reynolds and Shehri \cite{141} proposed use of cultural algorithm based evolutionary programming to direct incremental decision tree learning. The proposed algorithm constructs a tree with minimum number of nodes and uses of few variables for its construction. It is found that evolutionary approaches enhance the quality of built trees over incremental tree approach ITI in certain cases.

\subsection{Oblique Decision Tree}

The Oblique decision tree is proposed for applications where the instances have numeric attribute values. The oblique classifier builds decision trees that contain linear combinations of one or more attributes at each internal node. Most of the decision tree induction algorithms generate tests at each internal node that involve a single attribute of the data at each internal node the trees. These tests are equivalent to hyperplanes that are parallel to one of the axes in the attribute domain and the resulting trees are called as axis-parallel trees. The oblique decision tree partitions the space of examples with both oblique and axis-parallel hyperplanes. The oblique hyperplanes are more complicated than finding axis-parallel partitions, demanding greater computational effort.

\par Murthy et al. \cite{142} extended the work of Breiman et al. \cite