# Decision tree algorithms

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

Abstract-Decision tree algorithms have very important place at classification model of data mining. In literature, algorithms use entropy concept or gini index to form the tree. The shape of the classes and their closeness to each other some of the factors that affect the performance of the algorithm. In this paper we introduce a new decision tree algorithm which employs data (attribute) folding method and variation of the class variables over the branches to be created. A comparative performance analysis has been held between the proposed algorithm and C4.5.

Keywords-Classification, decision tree, split, pruning, entropy, gini.

### I. INTRODUCTION

Classification has gained a lot of importance in literature and it has a great deal of application areas from medicine to astronomy, from banking to text classification. The aim of the classification is to find the similar data items which belong to the same class. The term class may be considered as the dependent variable in statistics. For example, as the tail length, ear size, number of teeth etc are the variables which may vary from one specie to another, the variables 'cat' and 'dog' will be determined according to the values of the other variables. Classification is a predictive model of data mining that predicts the class of a dataset item using the values of some other variables. [2],[3]

If

D = {t1, t2, .....,tn}.

is a dataset and each ti is a record in the dataset with more than one attributes (data fields) and as

C = {C1, C2, .....,Cm}

C Represents m number of the classes. Then,

f: D ® C and each ti &Icirc; any Cj

Each Cj is a different class and has its own records, i.e.

Cj = { ti ½ f(ti) = Cj , 1 &pound; i &pound; n, and ti &Icirc; D}.

So far, many algorithms have been introduced with different models. Classification algorithms may be categorized as follows.

• Distance based algorithms
• Statistical algorithms
• Neural networks
• Genetic algorithms.
• Decision tree algorithms

K-Nearest Neighbor is one of the best known distance based algorithms, in the literature it has different version such as single link, complete link, K-Most Similar Neighbor etc. Distance based algorithms use Euclidian and Hamming distance measures or any similarity measure such as Jaccard, Cosine, Dice etc. to determine the class of each item [6],[17].

Regression and Bayesian algorithms are some of the most frequently used statistical classification algorithms in the literature. Bayesian algorithms predict the class depending on the probability of belonging to that class. On the other hand, regression, after determining the model i.e linear or non-linear, predicts the class with the coefficients produced from the data set [7].

Neural Networks and Genetic algorithms are other methods of clustering [4],[5]. Neural Network algorithms, predict the class of a data item with the help of the w weights calculated with lots of scans over the dataset[15].

Genetic algorithms, introduced by John Holland in 1975 [1],[2] are numerical optimization algorithms inspired by the nature evolution process and directed random search techniques; using, cross over, mutation techniques genetic algorithms create the best population to classify the data items [3],[4],[6].

Decision tree algorithms build a tree which also yields a series of if- else-then rules to classify the data items. ID3, C4.5, CART and sprint are among the best known decision tree algorithms. Since we propose a decision tree algorithm in this paper, we will discuss the decision tree model and some of the algorithms in the next section [3].

### A. Tree Based Classification

A tree approach is used for classification which is a model of data mining [8],[9]. To reach the classes through the shortest path the most suitable attribute is chosen to be the root and recursively the algorithm make calculations to determine the most suitable attribute in dataset to be the next node. Here 'the most suitable' adjective is to divide the rest of the database roughly into two or more equal parts. The attribute which is the closest to this is chosen as the most suitable attribute.

After choosing the root node, the algorithm adds arcs to the root for each predicate. Adding node to new branches is carried on as it is done for the first (root) node. During the processes if the pre-assigned criterion is reached, the arcing stops and the end of the final branch is labeled as one of the classes. Here, an algorithm differs from one another with the splitting criteria it exercises. In the literature, entropy and gini index are mostly employed by different algorithms as spitting criteria.

### B. Tree Based Classification Algorithms

In literature, there are different approaches to build a tree. One is to use entropy concept. ID3 and C4.5 algorithms use entropy to find the node representatives [10]. If we represent the probabilities as then the sum equals to 1 [10],[11].

### III. OUR CONTRIBUTION

We have adapted an algorithm to choose the right attribute to split the dataset into equal parts. The algorithm has been introduced for the databases which has categorical values. Obviously, categorical values need special treatment since it is not possible to calculate the average, standard deviation, sum of the values etc. for the entries.

We firstly determine the center of each attribute (CA). CA is calculated as in Eq. 9.

Here N represents the total number of cases or the number of rows in the dataset. This value is divided by NV, that is the number of variables which reside in the current attribute. For example if the attribute is considered as a vector

Vector = {A,B,A,B,A,A,A,A,A,B,A,C,C,C}

Here N= 14 because there are 14 different values in the vector. NV equals 3 because there are 3 different variables in the vector , they are A, B and C.

Equal-Split Parameter is the sum of the absolute difference between CA and NVi.

NVi represents the number of each variable of the attribute. For the vector above we have NVi values for i = 3. NVA, NVB, NVC . Thus,

NVA = 8; NVB = 3 and NVC = 3.

Without regarding the class attribute EP values give the rate to fold the attribute into the number of variable types. If the attribute has two different variables, it will be folded in two, if it has three, four or five different variables, the attribute will be folded in three, four or five accordingly. When the attribute is folded with number of the variables, the length of the folded attribute is supposed to be minimum in size in order to generate an efficient decision tree which reaches the leaves through the shortest path available. So, minimum EP value will give us the right attribute which reduces the attribute to the minimum length.

EP value may give us an idea to choose the right attribute for splitting the dataset, however EP value is something to fold the attribute only not the dataset itself. When the class attributed is taken into account, EP value would not be enough whereas it is for clustering. Thus, the distribution of the class attribute over the branches is an important parameter to determine, if the attribute really folds the whole dataset into a minimum length. So, over the branches we calculate the variation of the class attribute.

### IV. A Comparative Analysis

C4.5 algorithm is one of the most frequently used algorithms in literature, in this part of the paper we will discuss classification performance of C4.5 and the proposed algorithm. C4.5 algorithm classifies the data more accurately then ID3 does , because it employs a gain ratio by

This put an advantage on C4.5 if the class borders are very close to each other. In this case C4.5 gives a better performance. Likewise our algorithm is superior to C4.5 when the distance is not much between the classes.

Consider the Table II for a decision tree. If we apply C4.5 algorithm the root node will be A1 attribute with a gain ratio of 0.1245 whereas A2 attribute will have a lower gain ratio which is 0.0608.

When the proposed algorithm is applied, since A1 attribute has a BC value of 1, A2 attribute will have BC value, 2.5396 the same result as the C4.5 algorithm has produced will be held.

If fig. 3 and fig.4 are examined it is clear that both algorithm have chosen the right attribute as the root node, because A1 attribute produces a two-branch and better balanced tree in comparison with attribute A2. Attribute A2 not only arcs three branches but also the branches it gives are not as balanced as A2 attribute's branches are.

### References

1. Coley, A D., (1999). An Introduction to Genetic Algorithms for Scientists and Engineers. World Scientific, Singapore, 188p.
2. Pham, D.T., and Karaboga, D., 2000. Intelligent Optimization Techniques. Springer, London, Great Britain, 261p.
3. Han J., & Kamber, Micheline. (2001). Data Mining Concepts and Techniques, Morgan Kaufman Publishers Academic Press.
4. Authori Fausett L.(1994). Fundamentals of Neural Networks, Prentice-Hall, New Jersey.
5. Maulik U. & Sanghamitra B.(2000). Genetic Algorithm-based clustering technique, Journal of the Pattern Recognition, Pergamon, issue: 33.
6. Bill F. (Ed.) (1992). Information retrieval : data structures & algorithms. Prentice Hall.
7. Breiman, L., J. H. Fried man, R. A. Ol shen, and C. J. Stone. (1984). Classification and regression trees. Mon terey, Calif., U.S.A.Wadsworth, Inc.
8. Shafer J.C., Agrawal R., Mehta M.: "SPRINT: A Scalable Parallel Classifier for Data Mining", Proc. of the 22th International Conference on Very Large Databases, Mumbai (Bombay), India, Sept. 1996.
9. Mitchell T.(1997). Machine Learning, McGraw-Hill International.
10. Quinlan,J.Ross. (1987). Simplifying decision trees, International Journal of Man-Machine Studies,issue: 27(3), (pp. 221 - 234).
11. Breiman L., & Friedman J. H., & Olshen R. A., & Stone C. J. (1984). Classification and Regression Trees, Wadsworth, Belmont.
12. Mehta M., & Agrawal R., & Rissanen J. (1996). SLIQ: A Fast Scalable Classifier for Data Mining, Proceedings of 5th International Extending Database Technology Conference.France. (pp. 18-32). Springer-Verlag, London.
13. Agrawal R. & Shafer J.C. (1996). Parallel Mining of Association Rules, Proceedings. of IEEE Transactions on Knowledge and Data Engineering, Vol. 8, No. 6. (962- 969). IEEE Educational Activities Department. USA.
14. Hettich, S. , & Bay, S. D. (1999). The UCI KDD Archive, Department of Information and Computer Science, University of California, Irvine, CA. Retrieved September 1, 2008, from http://kdd.ics.uci.edu.
15. Pham D.T., & Chan A.B.(1998). Control Chart Pattern Recognition using a New Type of Self Organizing Neural Network. Proceedings of the Institution of Mechanical Engineers, Part I: Journal of Systems and Control Engineering. Vol 212, No 1, (pp. 115-127). Professional Engineering Publishing.
16. Keogh, E. & Pazzani, M. (2001). Derivative Dynamic Time Warping. In First SIAM International Conference on Data Mining (SDM'2001), Chicago, USA.
17. Alcock R.J., & Manolopoulos Y.(1999). Time-Series Similarity Queries Employing a Feature-Based Approach. 7th Hellenic Conference on Informatics. Ioannina,Greece.