# Decision Tree Classification Models Computer Science Essay

**Published:** **Last Edited:**

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

Traditional data analysis methods are obsolete and have been superseded by data mining. Data-mining can be divided into two camps, predictive and descriptive. Of the different types of descriptive data-mining types, decision tress are considered due to them being easy to understand. The evolution of the univariate decision tree from ID3 and CART to C5.0 are discussed. The use of multivariate decision trees are also considered. The hierarchy of one method over another based on predictive accuracy is still open to much debate.

## Introduction

The traditional way to analyze data, relies at its core on a data analyst with domain knowledge becoming intimately familiar with the data. This analyst acts as an interface between the data and the wider users community, who look to use this generated analysis of the work to solve real-world problems. An example of this would be a scientist who reviews papers in his subject are, creates a literature review and disseminates this amongst his peers within his chosen research field. Likewise in the field or business analysts who carry out analysis on market conditions, and then write detailed reports on current trends on markets for large multinationals.

For these individuals, this form of manual sifting through data and undertaking analysis is slow, expensive, mundane and the results returned are highly subjective. To automate data analysis as much as possible, technology must be used to take advantage of these huge repositories of information. To process the large amounts of data that they can hold, it would be impossible to do so by conventional means. It is imperative from both an economic and scientific perspective to employ computational techniques to help to discover useful patterns and structures from these massive volumes of data held in these databases so as to aid analysis. The technique for carrying out this is data-mining. "Data mining is a process of nontrivial extraction of implicit, previously unknown potentially useful in formation from data in databases", Fayed et al (1996). Data mining is all pervasive and one can find its application in all fields of human endeavour. In business where it can be used in customer relationship management to model customer churn. To the sciences where it is used to analyze interstellar bodies or used to target molecules for drug discovery or in medicine to aid in diagnosis.

Data mining can be divided into two main types. The division is based on the goal of the data mining exercise; these are prescriptive data-mining or descriptive data mining. Predictive models are used to anticipate explicit values, formed from patterns that are taken from known results. Descriptive data mining goals are different. They try to diagnose patterns in the data, that render information concerning what the data contains. One of the most commonly used types of descriptive data-mining is data classification. "Data classification is the ability to classify a set of data based on their values in certain attributes", Agrawal et al (1993). In classification, a training set of data is derived from a current population, each record in the training set having a class identity associated with it.

The goal of the classification is firstly to examine the training data and originate a precise and succinct description, or a model for every class using the characteristics in the data. Put more simply, to train a model to recognize the class attribute based on the values of the other attributes. This classification rule is then proved using a different subset of the entire data set known as the test set. Once a model is developed, proved and shown to be accurate enough for its desired use, it can be utilized to categorize future unclassified and unknown records. This may vary depending on the field it is used in, some fields for example medical diagnosis would require extreme levels of accuracy. A number of classification models exist these include neural networks, statistical models like linear/quadratic discriminants, decision trees and genetic models. For the purpose of this study, decision trees have been chosen because they are particularly useful in data-mining. They can be constructed comparatively fast when considered with other means of classification. They are straight forward and easy to understand. They are easy to follow and allow for human interpretation and allow it to be feasible for an analyst to acquire knowledge of what factors are critical for a specific classification, Quinlan (1993).

## Methods

All decision trees work the same, that is they employ the idea of "recursive partitioning, or divide-and-conquer". Decision trees split the data into subsets by selecting a collection of conditions on an independent variable. The decision tree that is created can be viewed as a diagram that holds one node for each subset. The node pertaining to the first collection of cases is known as the root node. For every node resulting from this, there is a boundary to each subset derived from this node. This is associated with the appropriate characteristics of the split for that subset. This process of splitting proceeds recursively within every subgroup until some preconditions are satisfied, resulting in the error not being able to be decreased anymore, Breiman et al (1984).

## Discussion, The evolution of decision tree algorithms

The use of decision trees in data mining can trace its development back to the work of Hunt, Marin, and Stone (1966). From this early work a number of rudimentary systems for generating decision trees were developed. These included CART Breiman et al. (1984) and ID3 Quinlan (1986). These were latter being followed by the C4.5 Quinlan (1993) and C5.0, Quinlan (2003). One of the first decision tree data-mining algorithms was ID3 and it was developed by Quinlan (1993). ID3 works by using a set of pre-classified training data which is used to train the selection process, or splitting. ID3 then utilizes an entropy function measure, the mean uncertainty in the classification when an additional test is employed, to divide the node. Through out each step, ID3 selects the test that returns the highest drop in classification uncertainty.

Following on from the development of ID3 came CART,(Classification and Regression tree). CART procedures were developed by Morgan and Sonquistin (1963) in the early 1960s. These were used by Breiman et al. (1984) for data-mining in the 1980's. The aim of the CART data mining algorithm is the keeping down as much as possible of the average misclassification cost which are caused by incorrect classifications. This is achieved through the use of the Gini diversity index. CART algorithm decision trees can be defined by every branch of the tree ending in a terminal node with every observation a part of one and precisely one terminal node. Therefore every terminal node is uniquely identified by a set of rules. CART was the first algorithm to attempt to overcome the problem of over-fitting.

To avert over-fitting, there are two general approaches pre-pruning and post pruning. The first is by use of a stopping rule that forbids some collections of training cases from being separated (pre pruning). Secondly by taking away some of the composition of the decision tree after it has been synthesized (post pruning). CART utilizes a pruning method known as minimal cost complexity pruning, which is based on the inherent prejudice that the re-substitution error of a tree increases linearly proportionally to the amount of leaf nodes, Prodromidis and Stolfo (1999). This type of pruning functions by relating a complexity parameter (Î±) to the decision trees amount of terminal nodes. The tree is then pruned by lessening as much as possible the linear combination of both the size and error rate. The level of pruning employed is varied by the changing the weighting of the complexity parameter.

When a large data set is at your disposal, selection of the most apt complexity parameter so as to keep to a minimum the true error can be achieved by use of a holdout set. By application of each decision tree to the holdout set, a judgement as to the true error of each tree can be made. The complexity parameter relating to the tree that decreases the error the most are exploited as the pruning parameter to crop the tree created from the entire data set. The pruning step is comparatively quick, but due to the use of a holdout set, construction of a second tree is necessary to be built involving the whole data set in order to make effective use of all the data. This need for the building of a second tree increases the induction time by a factor of two. When using small data sets, to increase the size of data consumed in the creation of the tree and better the error estimates, a 10 fold cross validation method is employed, Stone (1974) and Kohavi (1995). This involves the creation of ten trees with each validated from a tenth of the available data (testing data) and trained using the other 90 % (training data). Collating the results of the ten trees a very trustworthy evaluation of the optimal tree size can be made. The benefits to using 10 fold cross validation are that all data is used for both training and validation and every observation is used to test the models validity once. This does lead to disadvantages in terms of the time taken to build the tree and the consumption of resources as it necessitates multiple trees being synthesized and pruned. However it has a preference or aptitude over C4.5 to synthesize smaller trees, Oates and Jensen (1997).

C4.5 like its predecessor ID3, constructs decision trees utilizing the idea of information entropy,

Quinlan (1993). The aim of creation of all decision tree's should be to create a decision tree that is as small as feasibly possible. This is due to smaller trees being easier to understand and in agreement with Occam's razor, 'the simplest explanation is usually the correct one' and are biased to return higher predictive accuracy. As it is not possible to ensure the tree is configured as small as possible, C4.5 (as does ID3) utilizes a greedy search algorithm. A greedy algorithm is an algorithm that makes decisions optimized locally as opposed to utilizing globally optimal decisions, Quinlan (1986). At each node, the attribute chosen is the optimal fit for that subset at that moment in time. Both information gain and gain ratio are used in the C4.5 algorithm as splitting criterion. C4.5 also attempts to deal with missing values. It does this by not building a separate decision tree for the subset of cases which it cannot be determined what class to place them in, in the training set of data. Instead these data are divided into smaller subgroups of cases and added to the known subsets relating to previously identified outcomes.

Another benefit of C4.5 over ID3 and like CART is that it attempts to avoid over-fitting.

It does this by removing branches which do not help in the categorization by exchanging them for leaf nodes in a single from the bottom up pass, after the tree has been created. The criterion for which branches to remove is drawn up through the use of error-based pruning. Error based pruning works on the following assumption, that is that the error in the set of patterns wrapped by a leaf of a tree proceeds by a binomial distribution, Quinlan (1993). The upper limit of confidence of the probability of misclassification can then be computed from an already defined confidence level (default 25%), Windeatt and Ardeshir ( 2001). The predicted error rate is derived from the product of the upper bound of confidence by the number of patterns wrapped by the leaf. If the amount of predicted errors is less than that for the sub-tree comprising the leaf, then the sub-tree is substituted for the leaf.

C5.0 is the commercial and updated version of C4.5 (which is an open-source implementation and subsequently free) algorithm from Rulequest Research, Quinlan (2003). Rulesoft assert that their implementation offers a large number of advantages over the original C4.5 algorithm. These are accuracy, speed and memory usage (Quinlan 2003). C5.0 algorithm also offers new functionality over C4.5 algorithm in terms of variable misclassification costs. This allows the construction of classifiers which reduce the anticipated misclassification costs as oppossed to error rates.

C5.0 algorithm also uses boosting, based on the work of Freund and Schapire (1995), by implementation of the AdaBoost algorithm. Boosting is a means of bringing together rules that exhibit some amount of success to create highly accurate predictions. At the core of boosting is the synthesis of a number of classifiers as opposed to one. Upon classification every classifier votes for its own predicted class. The votes are then tallied and a classification is made based upon this, Kohavi and Kunz (1997). Due to decision trees developed using C5.0 and boosting being large and difficult to understand. Freund and Schapire(1997) developed alternating decision trees. An early implementation of this was the ADTree algorithm, Freund and Mason (1999). The structure of alternating decision trees is somewhat different to that of decision trees. It is characterized by the use of decision nodes and prediction nodes, Schapire and Singer (1998). As in decision trees a decision node divides sets of data into two segments, each associated with a specific prediction node. The classification of an instance can then be calculated by the sign of the sum of the prediction values on the decision path. In this way, both classification and a measure of confidence in the prediction is delivered.

C5.0 also supports the use of winnowing, Quinlan (2003). Winnowing is the ability of the classifier to disregard or ignore attributes which have only a minimal relevance before a classifier is created. The purpose of this is it leads to smaller easier to understand decision trees with higher predictive accuracy.

For all the decision trees so far discussed, the decision trees derive tests at every node which pertains to only one attribute of the data. They are contained within hyperplanes that are parallel to a single axis in the attribute space, these trees are referred to as axis-parallel. What makes these types of decision trees so compelling is that these unsophisticated univariate tests are easy to interpret by an analyst. The problem however with these trees is that they can cause highly complex and imprecise trees to be created, if the data is better split by hyperplanes that are not parallel to a single axis. Use of algorithms utilizing multivariate splits can result in construction of more accurate trees. They do have some disadvantages. They consume more computational resources than the axis-parallel algorithms and are not as favoured as the axis-parallel trees due to the tests being harder to understand. The trees produced though may be relatively simple to comprehend. CART with linear combinations, CART-LC was the original oblique decision tree algorithm to be synthesized.

CART-LC, works in the following manner. Generation of Hyperplanes occurs followed by testing them until the benefits of using multivariate tests are realized by comparison to a previously defined constant. Due to oblique splits being more difficult to understand than a more basic univariate split, Breiman and co-workers (1984) innovated the employment of a "backward deletion procedure", Cantu-Paz and Kamath (2003) . This modifies the composition of the split so as to simplify it. This is achieved by removing variables that make little or no contribution to the potency of the split. This is known as hill climbing. This iterative simplification procedure, works by taking away one variable at a time as to such a time that no more variables may be gotten rid of. CART-LC algorithms suffers from a number of limitations such as CART-LC being fully deterministic. What this means is as there is no inherent procedure for escaping local minima of the impurity measure, although such minima may be a common ocurrence. CART-LC also suffers from the phenomena that sometimes it makes changes to its decision tree that increments upwardly the impurity of a split. This feature was provided to allow it to avoid some local minima. Due to this feature, no upper bound on the time expended at any node in the synthesis of the decision tree is found. Another limitation of CART-LC is that it will produce only one tree for each data set. OC1 can be considered an improvement of CART-LC by including some substantial enhancements which were designed to overcome these problems. Murthy and co-workers(1994) created OC1, which utilizes an improvised pairing of hill-climbing as used in previously described algorithm and randomization. Randomization is used to counter the problems CART-LC developed escaping local minima of the impurity measure. Randomization can be achieved in two ways. Firstly by retrying the perturbation algorithm through the use of another ad-hoc generated hyperplane. Alternatively, if and when hill climbing hits a local minimum of the impurity measure, the hyperplane can be disturbed in any direction by the required amount. It also ensures that the time expended at any node in the creation of tree is bounded.

The core benefits to using decision tree classifiers are the high velocity in the construction of the model, the ability to process large volumes of data and the high amount of predictability in the classifier generated. However whether Decision Tree algorithms are more accurate than other similar techniques is up to debate and seems more dependent on a number of factors including to what domain the application is being applied to and and the effectiveness of the particular implementation. Lim,Loh's and Shih's (2000), comparitive study of twenty two Decision Tree methods and an assorted collection of other data mining algorithms including statistical and ANN methods on thirty two data sets found that the mean error rates of most of the algorithms surveyed are sufficiently alike to deduce that their fluctuations are statistically insignificant. Similarly the StatLog Project, carried out a comparison on the accuracy of several decision tree algorithms as opposed to some non-decision tree algorithms on a significant number of data sets, Mitchie et al (1994). The results of the study found that no algorithm was inherently more accurate over another on the data set studied.

Other advantages of decision trees is the decision tree produced are relatively easy to understand and can therefore be turned into code remarkably easily, this ability lends itself to the decision tree being transformed into SQL statements which can then be applied to a database. Thus allowing the easy efficient classification of large record sets, Agrawal et al (1992). However this may not be the case when oblique decision trees are used as with multivariate the tests they create are harder to comprehend.

## Conclusion

Decision trees are attractive to use in datamining for a number of reasons. They are simple to understand and resulting tress can be easily translated into rulesets, which can then be implemented programatically to solve real world problems. Decision trees have developed into two basic camps, univariate and multivariate. Multivariate decision trees however are not as widely used due to them being more resource hungry then univariate trees and the tests they develop being harder to discern. There still rages much debate to which type of decision can be considered most accurate.