A Model Of Decision Tree Classification Computer Science Essay

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

Classification is a classical problem in machine learning and data mining. For given set of training data tuples, each having a class label, represented by a feature vector, here the role is to algorithmically build a model that predicts the class label of an unknown test tuple based on the tuple¿½s feature vector.We know that the most popular classification models is the decision tree model. Decision trees are widespread because they are practical and easy to understand. Rules can also be mined from decision trees easily. Many algorithms, such as ID3[2] and C4.5[3] have been devised for decision tree creation. These algorithms are widely adopted and used in a wide range of uses such as image recognition, medical diagnosis[4], credit rating of loan applicants, scientific tests, fraud detection, and target marketing. In traditional decision-tree classification, a feature (an attribute) of a tuple is either categorical or numerical. For the latter, a precise and definite point value is usually supposed. In many applications, however, data uncertainty is common. The value of a feature/attribute is thus best captured not by a single point value, but by a range of values giving rise to a probability distribution. A simple way to handle data uncertainty is to abstract probability distributions by immediate statistics such as means and variances. We call this method Averaging. Another method is to consider the complete information carried by the probability distributions to build a decision tree. We call this method Distribution-based. In this paper we study the problem of creating decision tree classifiers on data with uncertain numerical attributes. Our goals are (1) to formulate an algorithm for building decision trees from uncertain data using the Distribution-based method; (2) to investigate whether the Distribution-based method could lead to a higher classification accuracy compared with the Averaging method; and (3) to establish a theoretical establishment on which pruning techniques are derived that can meaningfully improve the computational efficiency of the Distribution-based algorithms. Before we investigate into the details of our data model and algorithms, let us discuss the sources of data uncertainty and give some examples. Data uncertainty rises naturally many applications due to various causes. We briefly discuss three classes here: dimension errors, data staleness, and repeated measurements. a) Measurement Errors: Data obtained from measurements by physical devices are often imprecise due to measurement errors. As an example, a tympanic (ear) thermometer measures body temperature by computing the temperature of the ear drum via an infrared sensor. A typical ear thermometer has a frequent calibration error of _0:2_C, which is about 6.7% of the normal range of operation, For example, it is about 24% of measurements are off by more than 0:5_C, or about 17% of the operational range. Another source of error is quantization errors introduced by the digitization Process. Such errors can be properly handled by assuming an appropriate error model, such as a Gaussian error distribution for random noise or a uniform error distribution for quantization errors. b) Data Staleness: In some uses, data values are nonstop changing and recorded information is always stale. For this One example is location-based tracing system. The whereabouts of a mobile device can only be guessed by imposing an uncertainty model on its last reported location[6]. A typical uncertainty model requires information about the moving speed of the device and whether its movement is limited (such as a car moving on a road network) or unlimited (such as an animal moving on plains). Typically a 2D probability density function is defined over a bounded c) Repeated Measurements: Perhaps the most common source of uncertainty comes from repeated measurements. For example, a patient¿½s body temperature could be taken several times during a day, When we inquire about a patient¿½s temperature, or wind speed, or the temperature of a certain section of the aircraft, which values shall we use? Or, would it be better to apply all the information by considering the ¿½Breast Cancer¿½ dataset reported in [7]. This dataset contains a number of tuples. Each tuple matches to a microscopic image of stained cell nuclei. A characteristic image contains 10¿½40 nuclei. One of the features mined from each image is the average radius of nuclei. We mention that such a radius measure contains a few sources of uncertainty: (1) an average is taken from a large number of nuclei from an image, (2) the radius of an (irregularly-shaped) nucleus is took by averaging the length of the radial line segments defined by the centroid of the nucleus and a large number of sample points on the nucleus¿½ perimeter, and (3) a nucleus¿½ perimeter was drawn by a user over a fuzzy 2D image. Yet another source of uncertainty comes from the limitation of the data collection process. For example, a study may ask a question like, ¿½How many hours of TV did you watch each month?¿½ A usual defendant would not reply with an strict answer. Rather, a range (e.g., ¿½6¿½8 hours¿½) is generally replied, why because the respondent is not so sure about the answer. here the survey can restrict an answer to fall into a few pre-set categories (such as ¿½2¿½4 hours¿½, ¿½4¿½7 hours¿½, etc.). However, this restriction needlessly limits the defendants¿½ choices and adds noise to the data. Also, for preserving privacy, sometimes point data values are transformed to ranges on purpose before magazine.

This paper is organized as follows. In the next section we describe some related work. In the section 3, we will introduce how to handle the uncertain data. We will introduce the concept of Averaging and UDT algorithms. Section 4 contains the details of the GUI Based Tool. Section 5 contains the experimental results. Section 6 contains the conclusions and summary


There has been significant research interest in uncertain data organization in recent years. Data uncertainty has been broadly classified as existential uncertainty and value uncertainty. Existential uncertainty seems when it is uncertain whether an object or a data tuple exists. For example, a data tuple in a relational database could be related with a probability that represents the confidence of its presence[8]. ¿½Probabilistic databases¿½ have been applied to semi-structured data . Value uncertainty, on the other hand, appears when a tuple is known to exist, but its values are not known precisely. A data item with value uncertainty is usually denoted by a pdf over a finite and bounded region of possible values. Well-studied topic on value uncertainty is ¿½imprecise queries processing¿½. The answer to such a query is associated with a probabilistic surety on its correctness. For example, indexing solutions for range queries on uncertain data, solutions for aggregate queries such as nearest neighbor queries, and solutions for imprecise location-dependent queries[11] have been proposed. There has been a growing interest in uncertain data mining. In [12], the well-known k-means clustering algorithm is extended to the UK-means algorithm for clustering uncertain data. As we have explained, data uncertainty is usually captured by pdf¿½s, which are generally represented by sets of sample values. Mining uncertain data is therefore computationally costly due to information explosion (sets of samples vs. single values). To improve the performance of UK-means, pruning techniques have been proposed. Examples include min-maxdist pruning[15] and CK-means[16]. Apart from studies in partition-based uncertain data clustering, other directions in uncertain data mining include density-based clustering (e.g., FDBSCAN[17]), frequent itemset ining[18] and densitybased classification[19]. Density-based classification requires that the joint probability distribution of the data attributes be known. In [19], each data point is given an error model. Upon testing, each test tuple is a point-valued data. These are very different from our data model, as we do not require the knowledge of the joint probability distribution of the data attributes. Each attribute is handled independently and may have its own error model. Further, the test tuples, like the training tuples, may contain uncertainty in our model. Decision tree classification on uncertain data has been addressed for decades in the form of missing values[2], [3].Missing values appear when some attribute values are not available during data collection or due to data entry errors. Solutions include approximating missing values with the majority value or inferring the missing value (either by exact or probabilistic values) using a classifier on the attribute (e.g., ordered attribute tree[20] and probabilistic attribute tree[21]). In C4.5[3] and probabilistic decision trees[22], missing values in training data are handled by using fractional tuples. During testing, each missing value is replaced by multiple values with probabilities based on the training tuples, thus allowing probabilistic classification results.

The advantage of our approach is that the tuple splitting is based on probability values, giving a natural interpretation to the splitting as well as the result of classification. Building a decision tree on tuples with numerical, point valued data is computationally demanding [27]. A numerical attribute usually has a possibly infinite domain of real or integral numbers, inducing a large search space for the Best ¿½split point¿½. Given a set of n training tuples with a numerical attribute, there are as many as n ?? 1 binary split points or ways to partition the set of tuples into two non-empty groups. Finding the best split point is thus computationally expensive. To improve efficiency, many techniques have been proposed to reduce the number of candidate split points[28], [27], [29]. These techniques utilize the convex property of well-known evaluation functions like Information Gain[2] and Gini Index[30]. For the evaluation function TSE (Training Set Error), which is convex but not strictly convex, one only needs to consider the ¿½alternation points¿½ as candidate split points.[31] An alternation point is a point at which the ranking of the classes (according to frequency) changes. In this paper, we consider only strictly convex evaluation functions. (See Section VII-D for a brief discussion on how non-convex functions can be handled.) Compared to those works, ours can be considered an extension of their optimization techniques for handling uncertain data.


In this section, we propose two approaches for handling uncertain data. The first one is called ¿½Averaging¿½, converts an uncertain dataset to a point-valued one by replacing each pdf with its equal mean value. More specifically, for each tuple ti and attribute Aj , we take the mean value vi;j = R bi;j ai;j xfi;j(x) dx as its representative value. The feature vector of ti is thus transformed to (vi;1; : : : ; vi;k). A decision tree can then be built by applying a traditional tree construction algorithm. To replace the full information carried by the pdf¿½s, our second technique is called ¿½Distribution-based¿½ , considers all the sample points that constitute each pdf. The challenge here is that a training tuple can now ¿½pass¿½ a test at a tree node probabilistically when its pdf properly contains the split point of the test. Also, a slight change of the split point modifies that probability, potentially altering the tree structure. We present details of the tree-construction algorithm as follows


AVG is a greedy algorithm that builds a tree top-down. When processing a node, we examine a set of tuples S. The algorithm starts with the root node and with S being the set of all training tuples. At each node n, we first check if all the tuples in S have the same class label c. If so, we make n a leaf node and set Pn(c) = 1, Pn(c0) = 0 8c0 6= c. Otherwise, we select an attribute Ajn and a split point zn and divide the tuples into two subsets: ¿½left¿½ and ¿½right¿½. All tuples with vi;jn _ zn are put in the ¿½left¿½ subset L; the rest go to the ¿½right¿½ subset R. If either L or R is empty (even after exhausting all possible choices of Ajn and zn), it is impossible to use the available attributes to further discern the tuples in S. In that case, we make n a leaf node. Moreover, the population of the tuples in S for each class label induces the probability distribution Pn. In particular, for each class label c 2 C, we assign to Pn(c) the fraction of tuples in S that are labelled c. If neither L nor R is empty, we make n an internal node and create child nodes for it. We recursively invoke the algorithm on the ¿½left¿½ child and the ¿½right¿½ child, passing to them the sets L and R, respectively.


In this section we present the detail working of the GUI tool which uses Averaging and UDT concept and C4.5 algorithm. Below figure shows GUI tool for construct decision trees for uncertain data

Fig.1 GUI Tool For Constructing decision trees


All the experiments were done on 2.50GHz Intel Core i5 machine with 4GB main memory running Window 7 operating system. We implemented the C4.5 Algorithm using Java. Below experiment results shows that Decision Trees construction using UDT is more accurate than constructing Decision Trees using Averaging technique.

Fig.Time chart for Averaging vs UDT Processes

Fig. Accuracy chart for Avering vs UDT Processes


We have extended the model of decision-tree classification to accommodate data tuples having categorical attributes with uncertainty described by arbitrary pdf¿½s. We have modified classical decision tree building algorithms (based on the framework of C4.5[3]) to build decision trees for classifying such data. We have found empirically that when suitable pdf¿½s are used, exploiting data uncertainty leads to decision trees with remarkably higher accuracies. We therefore advocate that data be collected and stored with the pdf information intact. Performance is an issue, though, because of the increased amount of information to be processed, as well as the more complicated entropy computations involved. Therefore, we have devised a series of pruning techniques to improve tree construction efficiency. Our algorithms have been experimentally verified to be highly effective. Their execution times are of an order of magnitude comparable to classical algorithms. Some of these pruning techniques are generalisations of analogous techniques for handling point-valued data. Other techniques, namely pruning by bounding and end-point sampling are novel. Although our novel techniques are primarily designed to handle uncertain data, they are also useful for building decision trees using classical algorithms when there are tremendous amounts of data tuples