Data Classification Deals With Process Computer Science Essay

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

The first dimension refers to the distribution of data. Some of the approaches have been developed for centralized data, while others refer to a distributed data scenario. Distributed data scenarios can also be classified as horizontal data distribution and vertical data distribution. Horizontal distribution refers to these cases where different database records reside in different places, while vertical data distribution, refers to the cases where all the values for different attributes reside in different places.

The second dimension refers to the data modification scheme. In general, data modification is used in order to modify the original values of a database that needs to be released to the public and in this way to ensure high privacy protection. It is important that a data modification technique should be in concert with the privacy policy adopted by an organization. Methods of modification include:

• Perturbation, which is accomplished by the alteration of an attribute value by a new value (i.e., changing a 1-value to a 0-value, or adding noise),

• blocking, which is the replacement of an existing attribute value with a "?",

• Aggregation or merging which is the combination of several values into a coarser category,

• swapping that refers to interchanging values of individual records, and

• Sampling, this refers to releasing data for only a sample of a population.

The third dimension refers to the data mining algorithm, for which the data modification is taking place. This is actually something that is not known beforehand, but it facilitates the analysis and design of the data hiding algorithm. We have included the problem of hiding data for a combination of data mining algorithms, into our future research agenda. For the time being, various data mining algorithms have been considered in isolation of each other. Among them, the most important ideas have been developed for classification data mining algorithms, like decision tree inducers, association rule mining algorithms, clustering algorithms, rough sets and Bayesian networks.

The fourth dimension refers to whether raw data or aggregated data should be hidden. The complexity for hiding aggregated data in the form of rules is of course higher, and for this reason, mostly heuristics have been developed. The lessening of the amount of public information causes the data miner to produce weaker inference rules that will not allow the inference of confidential values. This process is also known as "rule confusion".

The last dimension which is the most important refers to the privacy preservation technique used for the selective modification of the data. Selective modification is required in order to achieve higher utility for the modified data given that the privacy is not jeopardized. The techniques that have been applied for this reason are:

• Heuristic-based techniques like adaptive modification that modifies only selected values that minimize the utility loss rather than all available values

• Cryptography-based techniques like secure multiparty computation where a computation is secure if at the end of the computation, no party knows anything except its own input and the results, and

• Reconstruction-based techniques where the original distribution of the data is reconstructed from the randomized data.

It is important to realize that data modification results in degradation of the database performance. In order to quantify the degradation of the data, we mainly use two metrics. The first one, measures the confidential data protection, while the second measures the loss of functionality.

Association rule mining is one of the most important tasks of data mining to find patterns in data. Association rules can be briefly expressed in the form of XÞY, where X and Y are sets of items. Association rule mining stems from the analysis of market-basket datasets. We take Naive Bayes algorithm as an example.

The Naive Bayes algorithm gives us a way of combining the prior probability and conditional probabilities in a single formula, which we can use to calculate the probability of each of the possible classifications in turn. Having done this we choose the classification with the largest value.

4.11.1 Naïve Bayes Classification

Given a set of k mutually exclusive and exhaustive classifications c1, c2… ck which have prior probabilities P(c1), P(c2), …, P(ck), respectively, and n Attributes a1, a2, …, an which for a given instance have values v1, v2, …, vn respectively, the posterior probability of class ci occurring for the specified instance can be shown to be proportional to P(ci) ´ P(a1 = v1 and a2 = v2 … and an = vn| ci) with the assumption that the attributes are independent. Then the value of this expression can be calculated using the product P(ci) ´ P(a1 = v1 | ci) ´ P(a2 = v2 | ci) ´ ... ´ P(an = vn | ci) We calculate this product for each value of i from 1 to k and choose the classification that has the largest value.

The association rule mining problem can be formally described as follows: let I = {i1, i2… im} be a set of literals, called items. Let D be a set of transactions, where each transaction T is a set of items such that TÍI. A unique identifier, called TID is linked to each transaction. A transaction T is said to contain X, a set of some items in I, if XÍT. An association rule is an implication of the form XÞY, where XÌI, YÌI, and XÇY=O. The rule XÞY holds in the transaction set D with confidence c if c% of transactions in D that contain X also contain Y. The rule XÞY has support s in D if s% of the transactions in D contains XÈY. Association rule mining finds interesting associations and/or correlation relationships among large sets of data items [119].

Association rules show attributes value conditions that occur frequently together in a given dataset. A typical and widely-used example of association rule mining is Market Basket Analysis [120]. For example, data are collected using bar-code scanners in supermarkets. Such market basket databases consist of a large number of transaction records. Each record lists all items bought by a customer on a single purchase transaction. Managers would be interested to know if certain groups of items are consistently purchased together. They could use this data for adjusting store layouts (placing items optimally with respect to each other), for cross-selling, for promotions, for catalog design and to identify customer segments based on buying patterns. Association rules provide information of this type in the form of "if-then" statements. These rules are computed from the data and, unlike the if-then rules of logic, association rules are probabilistic in nature.

4.12 Cryptographic Technique:

The common definition of privacy in the cryptographic community limits the information that is leaked by the distributed computation to be the information that can be learned from the designated output of the computation. Privacy preserving protocols are designed in order to preserve privacy even in the presence of adversarial participants that attempt to gather information about the inputs of their peers. There are, however, different levels of adversarial behavior. Cryptographic research typically considers two types of adversaries: A semi-honest adversary (also known as a passive, or honest but curious adversary) is a party that correctly follows the protocol specification, yet attempts to learn additional information by analyzing the messages received during the protocol execution. On the other hand, a malicious adversary may arbitrarily deviate from the protocol specification. (For example, consider a step in the protocol where one of the parties is required to choose a random number and broadcast it. If the party is semi-honest then we can assume that this number is indeed random.

On the other hand, if the party is malicious, then he might choose the number in a sophisticated way that enables him to gain additional information.) It is of course easier to design a solution that is secured against semi-honest adversaries, than it is to design a solution for malicious adversaries. A common approach is therefore to first design a secure protocol for the semi-honest case, and then transform it into a protocol that is secure against malicious adversaries. This transformation can be done by requiring each party to use zero-knowledge proofs to prove that each step that it is taking follows the specification of the protocol. More efficient transformations are often required, since this generic approach might be rather inefficient and add considerable overhead to each step of the protocol.

We remark that the semi-honest adversarial model is often a realistic one. This is because deviating from a specified program which may be buried in a complex application is a non-trivial task, and because a semi-honest adversarial behavior can model a scenario in which the parties that participate in the protocol are honest, but following the protocol execution an adversary may obtain a transcript of the protocol execution by breaking into a machine used by one of the participants.

4.12.1 Distributed data mining in Cryptographic technique:

Privacy-preserving data mining provides methods that can compute or approximate the output of a data mining algorithm without revealing at least part of the sensitive information about the data. Existing solutions can primarily be categorized into two approaches. One approach adopts cryptographic techniques to provide secure solutions in distributed settings [121]. Another approach randomizes the original data so that certain underlying patterns, such as the distribution of values, are retained in the randomized data [122].

Generally, the cryptographic approach can provide solutions with perfect accuracy and perfect privacy. In contrast, the randomization approach is much more efficient than the cryptographic approach, but appears to suffer a tradeoff between privacy and accuracy. In principle, the elegant and powerful paradigm of secure multiparty computation provides general-purpose cryptographic solutions for any distributed computation [123], [124]. However, because the inputs of data mining algorithms are huge, the overheads of the general-purpose solutions are intolerable for most applications. Instead, research in these areas seeks more efficient solutions for specific functions. Privacy-preserving algorithms have been proposed for different data mining applications, including privacy-preserving collaborative filtering [125], decision trees on randomized data [122], association rules mining on randomized data [126] [127], association rules mining across multiple databases [128] [129], clustering [130], [131], [132], and naive Bayes classification [133] , [134]. Additionally, several solutions have been proposed for privacy-preserving versions of simple primitives that are very useful for designing privacy-preserving data mining algorithms. These include finding common elements [135], [136], computing scalar products [137], [138], [139], [140], [141], [142], and computing correlation matrices [143]. We define privacy by adapting the general privacy definition in secure multiparty computation [144], [145], and [146]. As usual, we make the distinction between semi-honest and malicious adversaries in the distributed setting. Semi-honest adversaries only gather information and do not modify the behavior of the parties. Such adversaries often model attacks that take place after the execution of the protocol has completed. Malicious adversaries can cause the corrupted parties to execute some arbitrary, malicious operations. Cryptographic techniques provide the tools to protect data privacy by exactly allowing the desired information to be shared while concealing everything else about the data. To illustrate how to use cryptographic techniques to design privacy-preserving solutions to enable mining across distributed parties, we describe a privacy-preserving solution for a particular data mining task: learning Bayesian networks on a dataset divided among two parties who want to carry out data mining algorithms on their joint data without sharing their data directly.

Bayesian networks

A Bayesian network (BN) is a graphical model that encodes probabilistic relationships among variables of interest. This model can be used for data analysis and is widely used in data mining applications. Formally, a Bayesian network for a set V of m variables is a pair (Bs, Bp). The network structure Bs = (V,E) is a directed acyclic graph whose nodes are the set of variables. The parameters Bp describe local probability distributions associated with each variable. There are two important issues in using Bayesian networks: (a) Learning Bayesian networks and (b) Bayesian inferences. Learning

Bayesian networks include learning the structure and the corresponding parameters. Bayesian networks can be constructed by expert knowledge, or from a set of data, or by combining those two methods together. Here, we address the problem of privacy-preserving learning of Bayesian networks from a database vertically partitioned between two parties; in vertically partitioned data, one party holds some of the variables and the other party holds the remaining variable.

The BN Learning Protocol

A value x is secret shared (or simply shared) between two parties if the parties have values (shares) such that neither party knows (anything about) x, but given both parties' shares of x, it is easy to compute x. Our protocol for BN learning uses composition of privacy-preserving sub protocols in which all intermediate outputs from one sub protocol that are inputs to the next sub protocol are computed as secret shares. In this way, it can be shown that if each sub protocol is privacy-preserving, then the resulting composition is also privacy-preserving. Our solution is a modified version of the well known K2 protocol of Cooper and Herskovits [147]. That protocol uses a score function to determine which edges to add to the network. To modify the protocol to be privacy-preserving, we seek to divide the problem into several smaller sub problems that we know how to solve in a privacy-preserving way. Specifically, noting that only the relative score values are important, we use a new score function g that approximates the relative order of the original score function. This is obtained by taking the logarithm of the original score function and dropping some lower order terms.

4.12.2 Partitioning data mining in cryptographic techniques:

The problem of privately computing association rules in vertically partitioned distributed data has also been addressed. The vertically partitioned problem occurs when each transaction is split across multiple sites, with each site having a different set of attributes for the entire set of transactions. With horizontal partitioning each site has a set of complete transactions. In relational terms, with horizontal portioning the relation to be mined is the union of the relations at the sites. In vertical partitioning, the relations at the individual sites must be joined to get the relation to be mined. The change in the way the data is distributed makes this a much different problem from the one we address here, resulting in a very different solution. Cryptographic tools can enable data mining that would otherwise be prevented due to security concerns. Observing the above it can be predicted that distributed association rule mining can be done efficiently under reasonable security assumptions.

4.12.3 Secure Multiparty Computation data mining in cryptographic techniques:

The basic idea of Secure Multiparty Computation is that a computation is secure if at the end of the computation, no party knows anything except its own input and the results. One way to view this is to imagine a trusted third party everyone gives their input to the trusted party, who performs the computation and sends the results to the participants. Obviously, some communication between the parties is required for any interesting computation how can it be ensured that this communication doesn't disclose anything? The answer is to allow non-determinism in the exact values sent in the intermediate communication (e.g., encrypt with a randomly chosen key), and demonstrate that a party with just its own input and the result can generate a \predicted" intermediate computation that is as likely as the actual values. This has been shown possible however the general method given does not scale well to data mining sized problems.

There are two distinct problems that arise in the setting of privacy-preserving data mining. The first is deciding which functions can be safely computed ("safely" meaning that the privacy of individuals is preserved). For example, is it safe to compute a decision tree on confidential medical data in a hospital, and publicize the resulting tree? For the most part, it can be assumed that the result of the data mining algorithm is either safe or deemed essential. Thus, the question becomes how to compute the results while minimizing the damage to privacy. For example, it is always possible to pool all of the data in one place and run the data mining algorithm on the pooled data. However, this is exactly what we don't want to do (hospitals are not allowed to hand their raw data out, security agencies cannot afford the risk, and governments risk citizen outcry if they do). Thus, the question we address is how to compute the results without pooling the data in a way that reveals nothing but the final results of the data mining computation.

This question of privacy-preserving data mining is actually a special case of a long-studied problem in cryptography called secure multiparty computation. This problem deals with a setting where a set of parties with private inputs wishes to jointly compute some function of their inputs. The aim of secure multiparty computation is to enable parties to carry out such distributed computing tasks in a secure manner. Whereas distributed computing classically deals with questions of computing under the threat of machine crashes and other inadvertent faults, secure multiparty computation is concerned with the possibility of deliberately malicious behavior by some adversarial entity. That is, it is assumed that a protocol execution may come under "attack" by an external entity, or even by a subset of the participating parties.

The aim of this attack may be to learn private information or cause the result of the computation to be incorrect. Thus, two important requirements on any secure computation protocol are privacy and correctness. The privacy requirement states that nothing should be learned beyond what is absolutely necessary; more exactly, parties should learn their output and nothing else. The correctness requirement states that each party should receive its correct output. Therefore, the adversary must not be able to cause the result of the computation to deviate from the function that the parties had set out to compute. The setting of secure multiparty computation encompasses tasks as simple as coin-tossing and broadcast, and as complex as electronic voting, electronic auctions, electronic cash schemes, contract signing, anonymous transactions, and private information retrieval schemes. Consider for a moment the tasks of voting and auctions. The privacy requirement for an election protocol ensures that no parties learn anything about the individual votes of other parties; the correctness requirement ensures that no coalition of parties has the ability to influence the outcome of the election beyond simply voting for their preferred candidate. Likewise, in an auction protocol, the privacy requirement ensures that only the winning bid is revealed (if this is desired); the correctness requirement ensures that the highest bidder is indeed the winning party (and so the auctioneer, or any other party, cannot bias the outcome). Due to its generality, the setting of secure multiparty computation can model almost every cryptographic problem. To be even more concrete, let us consider the two-party problem of securely computing the median. Here, we have two parties with separate input sets X and Y. The aim of the parties is to jointly compute the median of the union of their sets X and Y, without revealing anything about each other's set that cannot be derived from the output itself. Here, the parties' private inputs are X and Y, respectively, and their output is the median of X and Y. In order to obtain this output, they run an interactive protocol which involves those sending messages to each other according to some prescribed specification, which in turn should result in them learning the output as desired.

4.13 k-means algorithm:

K-means clustering is a simple and very commonly used clustering algorithm in data mining. It starts with an unclustered dataset with n elements and one attributes and outputs cluster assignments of each data element in the set. It requires prior knowledge of the number of clusters (k). There are some numbers of recent on privacy-preserving k-means clustering algorithm in the semi-honest model, where the input dataset is partitioned among two parties. The protocol described in Jagannathan and Wright (2005) uses arbitrarily partitioned data where any data element and/or attribute may be partitioned among parties. The attribute names and size of the data set are known to both parties. The k-means algorithm involves mainly two sub protocols (Jagannathan and Wright, 2005): secure dot product and secure equality. The privacy-preserving algorithm utilizes secure dot product protocol while calculating cluster centers and making cluster assignments. For each dot product, input sets of size 1 are used where 1 is the number of attributes, which is usually less than 20. Thus we would expect running times much lower than the ones displayed in Figure 1. In addition to this, the secure equality protocol is run only once per iteration of the k-means algorithm. We can argue that the secure dot product and equality protocols that we defined in Section 3 can be used as sub protocols in a privacy-preserving k-means algorithm in the malicious model efficiently. Conversion from semi-honest model to malicious model may bring additional complexities for the k-means algorithm, which will not be discussed here. However in an ordinary setting with ten attributes and five clusters k-means clustering in the malicious model takes more than 300 times longer than it takes in the semi-honest model.

The task is to group similar items in a given data set into clusters with the goal of minimizing an objective function. The error-sum-of-squares (ESS) objective function is defined as the sum of the squares of the distances between points in the database to their nearest cluster centers. The k-clustering problem requires the partitioning of the data into k clusters with the objective of minimizing the ESS. Lloyd's (k-means) algorithm for k-clustering and Ward's algorithm for hierarchical agglomerative clustering make use of the notion of ESS. Although Ward's algorithm has been observed to work well in practice, it is rather slow (O (n3)) and does not scale well to large databases. Recently there have been a number of data mining algorithms (e.g., BIRCH and STREAMLS) designed for input that is too large to fit entirely in main memory.

A privacy-preserving version of the Re-cluster algorithm, for two-party horizontally partitioned databases. This protocol is communication efficient and it reveals the cluster centers (or the cluster assignments to data, if both parties desire) to both parties only at the end of the protocol. Unlike existing privacy-preserving protocols based on the k-means algorithm, this protocol does not reveal intermediate candidate cluster centers. These existing solutions can be made more secure but only at the cost of a high communication complexity. An alternate solution would be to develop privacy-preserving versions of other k-clustering algorithms. However, these algorithms do not scale well to large databases, involve complicated data structures, or can be complicated to transform into a privacy-preserving protocol. In comparison, our privacy-preserving version of Re-cluster is simple and communication efficient, and produces good clusters.

Consider the vertically partitioned data: The data for a single entity are split across multiple sites, and each site has information for all the entities for a specific subset of the attributes. It can be considered that the existence of an entity in a particular site's database may be revealed, it is the values associated with an entity that are private.

In order to cluster the known set of common entities without revealing any of the values that the clustering is based on. K-means clustering is a simple technique to group items into k clusters. The k-means algorithm also requires an initial assignment (approximation) for the values/positions of the k means. This is an important issue, as the choice of initial points determines the final solution. Research has led to mechanisms producing a good initial assignment. Their technique uses classic k-means clustering done over multiple subsamples of the data, followed by clustering the results to get the initial points. For simplicity, we assume that the k means are selected arbitrarily. Since the underlying operations in involve k-means clustering, it is quite easy to extend our algorithm to search for and start off with good initial means.

Decision tree

Definition: Decision Tree

A decision tree is a tree in which each branch node represents a choice between a number of alternatives, and each leaf node represents a decision.

Decision tree are commonly used for gaining information for the purpose of decision-making. Decision tree starts with a root node on which it is for users to take actions. From this node, users split each node recursively according to decision tree learning algorithm. The final result is a decision tree in which each branch represents a possible scenario of decision and its outcome.

Decision tree learning is a method for approximating discrete-valued target functions, in which the learned function is represented by a decision tree. Decision tree learning is one of the most widely used and practical methods for inductive inference'.

Decision tree learning algorithm has been successfully used in expert systems in capturing knowledge. The main task performed in these systems is using inductive methods to the given values of attributes of an unknown object to determine appropriate classification according to decision tree rules. Decision trees classify instances by traverse from root node to leaf node. We start from root node of decision tree, testing the attribute specified by this node, and then moving down the tree branch according to the attribute value in the given set. This process is the repeated at the sub-tree level.

The decision tree learning algorithm suited for the following:

1. Instance is represented as attribute-value pairs. For example, attribute temperature and its value 'hot', 'mild', 'cool'. We are also concerning to extend attribute -value to continuous-valued data (numeric attribute value) in r project.

2. The target function has discrete output values. It can easily deal with instance which is assigned to a Boolean decision, such as 'true' and 'false', 'p(positive)' and 'n(negative)'. Although it is possible to extend target to real valued outputs, we will cover the issue in the later part of this report.

4.14.1 Decision Tree formation

Decision trees classify examples according to the values of their attributes. They are constructed by recursively partitioning training examples based each time on the remaining attribute that has the highest information gain. Attributes become nodes in the constructed tree and their possible values determine the paths of the tree. The process of partitioning the data continues until the data is divided into subsets that contain a single class, or until some stopping condition is met (this corresponds to a leaf in the tree). Typically, decision trees are pruned after construction by merging children of nodes and giving the parent node the majority class. The preceding describes in specifically how decision trees, in particular C5.0, operate and are constructed.

Decision tree Techniques:

In order to enhance the overall productivity and for decreasing the system complexity, a number of algorithms were developed. Few of them were effective but could not fulfill every aspects of optimized and efficient data mining operation. Even for bulk data those techniques were found ineffective. Therefore in order to achieve some better options and techniques a continuous researches have been continuing. In the way of obtaining better techniques for data classification, which ultimately reduces the data complexity and decreases the time for data computations; algorithms like ID3, C4.5, and C5.0 have been developed. Here in this research work the till optimized classification algorithm called C5.0 has been implemented.

In the preceding section a brief of these classification algorithms have been discussed.

4.15.1 ID3 Basic

ID3 is a simple decision tree learning algorithm developed by Ross Quinlan (1983). The basic idea of ID3 algorithm is to construct the decision tree by employing a top-down, greedy search through the given sets to test each attribute at every tree node. In order to select the attribute that is most useful for classifying a given sets, we introduce a metric---information gain.

To find an optimal way to classify a learning set, what we need to do is to minimize the questions asked (i.e. minimizing the depth of the tree). Thus, we need some function which can measure which questions provide the most balanced splitting.

The ID3 technique builds decision tree using information theory. The fundamental strategy used by ID3 is to choose splitting attributes from a data set with the highest information gain. The amount of information associated with an attribute value is related to the probability of occurrence. The concept used to quantify information is called entropy. This is used to measure the amount of randomness from a data set. When all the data in a set belong to a single class, there is no uncertainty then the entropy is zero. The intention of decision tree classification is to iteratively partition the given data set into subsets where all the elements in each final subset belong to the same class. In order to choose the splitting attribute ID3 uses information gain measure. Decision tree Construction using ID3

The following algorithm has been used for the construction of the decision tree


Data partition, D (training dataset) Attribute list

Attribute selection method Output: Decision tree


Create a node N;

If tuples in D are all of the same class, C then

Return N as a leaf node labeled with the class C; If attribute list is empty then

Return N as a leaf node labeled with the majority class in D;

Apply attribute selection method to find the best splitting rule;

Label node N with splitting criterion;

Attribute list = attribute list- splitting attribute; For each outcome j of splitting criterion

Let Dj be the set of data tuples in D satisfying outcome j;

If Dj is empty then

Attach a leaf labeled with the majority class in D to node N;

Else attach the node returned by generating decision tree to node N;

Endfor Return N;

Entropy measures the amount of information in an attribute.

Given a collection S of c outcomes

Entropy (S) = ∑-p (I) log2 p (I)

where p(I) is the proportion of S belonging to class I. ∑ is over c. Log2 is log base 2.

Note that S is not an attribute but the entire sample set.

where p(I) is the proportion of S belonging to class I. ∑ is over c. Log2 is log base 2.

Note that S is not an attribute but the entire sample set. ID3 Algorithm for Decision Trees Creation

ID3 (Examples, Target_Attribute, Attributes)

Create a root node for the tree

IF all examples are positive, Return the single-node tree Root, with label = +

If all examples are negative, Return the single-node tree Root, with label = -

If number of predicting attributes is empty, then Return the single node tree Root, with label = most common value of the target attribute in the examples

Otherwise Begin

A ƒŸ The Attribute that best classifies examples

Decision Tree attribute for Root ƒŸ A

For each positive value, vi, of A,

Add a new tree branch below Root, corresponding to the test A = vi

Let Examples(vi), be the subset of examples that have the value vi for A

If Examples(vi) is empty

Then below this new branch add a leaf node with label = most common target value in the examples

Else below this new branch add the subtree ID3 (Examples(vi), Target Attribute, Attributes - {A})


Return Root

4.15.2 C4.5 Algorithm

For rule base classification, C4.5 is one of the most popular algorithms. This algorithm is an extension to ID3 developed by Quinlan Ross. For the purpose of building a decision tree C4.5 handles both categorical and continuous attributes. Continuous attributes are handled by C4.5 in such a way that it splits the attribute values into two partitions depending upon the chosen threshold such that all the values above the threshold as one child and the remaining as another child. It also handles missing attribute values. The C4.5 algorithm improves ID3 through Gain Ratio. Largest Gain Ratio is used by C4.5 for the purpose of splitting which ensures a larger than average information gain. Handle the data file is very complex when dimensionality expands enormously during the process of rule generation. Select the attribute with the highest information gain. Let pi be the probability that an arbitrary tuple in D belongs to class Ci, estimated by Expected information (entropy) needed to classify a tuple in D.

Gain (D) =1- ∑(p12, p22 … pn2) where p1 …n are the frequency ratio of class 1 ..n in D

4.15.3 C5.0 Algorithm

The research work presented here considers the C5.0 Algorithm for data mining. The enhancement and the optimization of the C4.5 emerge as algorithm C5.0, which exhibits the better performance as compared to the other existing mining algorithms. C5.0 algorithm to build either a decision tree or a rule set. A C5.0 model works by splitting the sample based on the field that provides the maximum information gain. Each subsample defined by the first split is then split again, usually based on a different field, and the process repeats until the subsamples cannot be split any further. Finally, the lowest-level splits are reexamined, and those that do not contribute significantly to the value of the model are removed or pruned.

C5.0 can produce two kinds of models. A decision tree is a straightforward description of the splits found by the algorithm. Each terminal (or "leaf") node describes a particular subset of the training data, and each case in the training data belongs to exactly one terminal node in the tree. In other words, exactly one prediction is possible for any particular data record presented to a decision tree.

In the preceding chapter the algorithm C5.0 has been implemented for data classification and for tree formation. This algorithm (C5.0) has been discussed in details in the next chapter.

4.16 Information Gain and the Entropy Measure

Information gain is used to determine how well an attribute separates the training data according to the target concept. It is based on a measure commonly used in information theory known as entropy. Defined over a collection of training data, S, with a Boolean target concept, the entropy of S is defined as:

Where p(+) is the proportion of positive examples in S and p(-) the proportion of negative examples. The function of the entropy measure is easily described with an example. Assume that there is a set of data S containing ten examples. Seven of the examples have a positive class and three of the examples have a negative class [7+, 3-]. The entropy measure for this data set S would be calculated as:

Note that if the number of positive and negative examples in the set were even (p(+) = p(-) = 0.5), then the entropy function would equal 1. If all the examples in the set were of the same class, then the entropy of the set would be 0. If the set being measured contains an unequal number of positive and negative examples then the entropy measure will be between 0 and 1.

Entropy can be interpreted as the minimum number of bits needed to encode the classification of an arbitrary member of S. Consider two people passing messages back and forth that are either positive or negative. If the receiver of the message knows that the message being sent is always going to be positive, then no message needs to be sent. Therefore, there needs to be no encoding and no bits are sent. If on the other hand, half the messages are negative, then one bit needs to be used to indicate that the message being sent is either positive or negative. For cases where there are more examples of one class than the other, on average, less than one bit needs to be sent by assigning shorter codes to more likely collections of examples and longer codes to less likely collections of examples. In a case where p(+) = 0.9 shorter codes could be assigned to collections of positive messages being sent, with longer codes being assigned to collections of negative messages being sent.

Information gain is the expected reduction in entropy when partitioning the examples of a set S, according to an attribute A. It is defined as:

Where Values (A) is the set of all possible values for an attribute A and Sv is the subset of examples in S which have the value v for attribute A. On a Boolean data set having only positive and negative examples, Values(A) would be defined over [+,-]. The first term in the equation is the entropy of the original data set. The second term describes the entropy of the data set after it is partitioned using the attribute A. It is nothing more than a sum of the entropies of each subset Sv weighted by the number of examples that belong to the subset. The following is an example of how Gain(S, A) would be calculated on a fictitious data set. Given a data set S with ten examples (7 positive and 3 negative), each containing an attribute Temperature, Gain(S, A) where S= Temperature and Values (Temperature) = {Warm, Freezing} would be calculated as follows:

S = [7+, 3-]

SWarm = [3+, 1-]

SFreezing = [4+, 2-]

Information gain is the measure used by ID3 to select the best attribute at each step in the creation of a decision tree. Using this method, ID3 searches a hypothesis space for one that fits the training data. In its search, shorter decision trees are preferred over longer decision trees because the algorithm places nodes with a higher information gain near the top of the tree. In its purest form ID3 performs no backtracking. The fact that no backtracking is performed can result in a solution that is only locally optimal. A locally optimal solution is known as overfitting.

4.17 Over fitting and Decision Trees

Over fitting is not a problem that is inherent to decision tree learners alone. It can occur with any learning algorithm that encounters noisy data or data in which one class, or both classes, are underrepresented. A decision tree, or any learned hypothesis h, is said to overfit training data if there exists another hypothesis h¢ that has a larger error than h when tested on the training data, but a smaller error than h when tested on the entire data set. At this point the discussion of overfitting will focus on the extension of ID3 that is used by decision trees algorithms such as C4.5 and C5.0 in an attempt and avoid overfitting data.

There are two common approaches that decision tree induction algorithms can use to avoid overfitting training data. They are:

Stop the training algorithm before it reaches a point in which it perfectly fits the training data, and,

Prune the induced decision tree.

The most commonly used is the latter approach Mitchell [149]. Decision tree learners normally employ post-pruning techniques that evaluate the performance of decision trees as they are pruned using a validation set of examples that are not used during training. The goal of pruning is to improve a learner's accuracy on the validation set of data.

In its simplest form post-pruning operates by considering each node in the decision tree as a candidate for pruning. Any node can be removed and assigned the most common class of the training examples that are sorted to the node in question. A node is pruned if removing it does not make the decision tree perform any worse on the validation set than before the node was removed. By using a validation set of examples it is hoped that the regularities in the data used for training do not occur in the validation set. In this way pruning nodes created on regularities occurring in the training data will not hurt the performance of the decision tree over the validation set.

Pruning techniques do not always use additional data such as the following pruning technique used by C4.5.

C4.5 begins pruning by taking a decision tree to be and converting it into a set of rules; one for each path from the root node to a leaf. Each rule is then generalized by removing any of its conditions that will improve the estimated accuracy of the rule. The rules are then sorted by this estimated accuracy and are considered in the sorted sequence when classifying newly encountered examples. The estimated accuracy of each rule is calculated on the training data used to create the classifier (i.e., it is a measure of how well the rule classifies the training examples). The estimate is a pessimistic one and is calculated by taking the accuracy of the rule over the training examples it covers and then calculating the standard deviation assuming a binomial distribution. For a given confidence level, the lower-bound estimate is taken as a measure of the rules performance. A more detailed discussion of C4.5's pruning technique can be found in Quinlan, [148].