# Classification Techniques In Data Mining Computer Science Essay

**Published:** **Last Edited:**

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

Data mining is extracting the hidden information from large databases. The application of data mining is highly visible in commerce, education, multimedia, medicine, etc. Classifying data is very essential to make predictions for future purposes. An appropriate model has to be designed for attaining accurate results. The classifier learns through the training set and assigns test dataset into appropriate classes. The class of training set is known where as the class of test set is unknown. This article presents a review of classification techniques in data mining. It is appropriate due to the exponentially increasing number of research works carried out about data mining classification in recent years. This will guide the researcher in interesting research directions and theoretical issues in classification algorithms so as to apply specific algorithm to a problem.

Keywords: Classification; data analysis; data mining; decision tree; perceptron; predictive model; rule based classifiers; support vector machine.

## Introduction

Machine Learning incorporates many applications. The most striking one is Data Mining. It is hard for human beings to analyze enormous data in a rapid amount time without any mistakes and to handle data with multiple characteristics. Data mining techniques can be efficiently applied to these problems. Its tasks are classified into two categories. They are descriptive (Clustering or unsupervised) and predictive (Classification or supervised) methods. Clustering deals with a class of problems in which one seeks to determine how much data are organized. Classification is the machine learning technique for deducing a function from training data. The training data consists of pairs of input objects and desired outputs [1]. It is used to partition data into disjoint group as well as to predict the future outcome. The dataset contains decision attribute that is considered for classification. This study shows the importance of modern data mining classification techniques. There may be any number of instances and attributes in the dataset. Database may contain discrete, nominal, ordinal, categorical or continuous data. The decision attribute or a class variable may be discrete or categorical. Table 1 shows the format of dataset for classification. The data mining process consists of various steps: Data Selection, Pre-processing, Data Transformation, Data Mining, Data Evaluation / Interpretation [2]. Section 2 discusses about the selection of dataset.

## Instance No.

## Attribute1

## Attribute2

## ...........

## Attribute n

## Decision Attribute

## or Class Variable

1

a11

a12

## ..........

a1n

Yes

2

a21

a22

## ..........

a2n

No

## :

## :

## :

## ..........

## :

## :

m

am1

am2

## ..........

amn

Yes

Table 1: Format of Dataset for Classification

## Selection of Dataset

Before applying any data mining technique, the target data should be assembled. The target data should be concise enough to be mined with the minimum time complexity. Pre- processing is important to analyze multivariate datasets. The target set is cleaned by removing noisy data and handling missing values. Parsing, data transformation, duplicate elimination and statistical methods like mean, standard deviation and range are the popular methods used to clean the data. The most challenging process is removing duplicates and invalid entries. Deletion of such data will lead to loss of information. The loss will be costly if a large amount of data is deleted. [3] Fig. 1 shows the task of cleaning of data. Some of the handling methods are listed for each type of data. The size of data may be very large. Considering all the data for analysis may take long time for completion. Acquiring smaller set of data and producing same analytical results is another one challenging job. Data reduction strategies are useful in reducing the volume of data. Data cube aggregation, Dimensionality reduction, Numerosity reduction, and Discretization and concept hierarchy generation are some of the data reduction strategies [4]. Feature selection is selecting the subset of the relevant features from the large dataset. Discarding irrelevant attributes will reduce the database size, so as to operate the algorithm with minimum time complexity. Meaningful features will give more exact classifiers. Decision Tree Classifiers, Neural Networks, Support Vector Machines and Fuzzy classifiers are different techniques applied for classification problem. The main objective of these classification algorithms is to build a predictive model that accurately predicts the class labels of previously unknown records. In the following sections the abovementioned classifiers are discussed.

## Data to be cleaned

## Noisy Data

## Duplicate Data

## Missing Values

## Handling Method

## - Binning

## - Outlier Detection

## - Human Inspection

## - Regression

## Handling Method

## Duplicate

## Elimination

## Algorithm

## Handling Method

## Inference

## Based

## Method

## Cleaned Data

Figure 1 - Task of Cleaning of Data

## Decision Tree Classifiers

Decision tree [5] classifier generates a tree and set of rules. These rules represent a set of classes. Decision trees are used to gain more information for the purpose of decision making. It is a flowchart like tree structure [6]. Each internal node is an attribute and leaves show the classes. It starts with a root node. From this node each node is partitioned recursively according to the decision tree algorithm. It is one of the widely used practical methods for inductive reference [7]. C4.5 [8] and CART [9] are the other algorithms that follow decision tree approach and it is the extended version of ID3 algorithm. Some other algorithms are Hunt's algorithm and SPRINT. These algorithms are broadly used in image recognition, selection of therapy [10, 11], fraud detection and credit rating.

## Entropy and Information Gain

The algorithm mainly used for building decision tree is called ID3, which employs a top down, Greedy search with no backtracking.

Entropy: It is a measure of variability in a random variable and it computes how pure or impure a variable is.

(1)

From the training set S, Entropy is calculated by calculating proportion of S to the class I i.e. P (I).

Information Gain: To minimize the depth of decision trees, optimal attribute for splitting the tree node should be selected. Information gain is the expected reduction of entropy related to specified attribute when splitting a decision tree node

(2)

Algorithm for decision tree is presented below.

## Algorithm: Decision Tree Classifier

## Input: Learning set S, Attribute set A, Values V ; Output: Decision Tree

Begin

Set up Classification Attribute.

Compute Entropy.

For each attribute A in S

Calculate Information Gain using classification attribute.

Assign the attribute with highest gain as the root.

Select Attribute with the next highest gain to be the next Node in the tree.

End For

End

Fig. 2 depicts the example of a decision tree for the training set S from the data available in Table 2.

## S. No.

## Model

## Engine

## SC / Turbo

## Weight

## Fuel Eco

## Fast

1

Prius

Small

No

Average

Good

No

2

Civic

Small

No

Light

Average

No

3

WRX STI

Small

Yes

Average

Bad

Yes

4

M3

Medium

No

Heavy

Bad

Yes

5

RS4

Large

No

Average

Bad

Yes

6

GTI

Medium

No

Light

Bad

No

7

XJR

Large

Yes

Heavy

Bad

No

8

S500

Large

No

Heavy

Bad

No

9

911

Medium

Yes

Light

Bad

Yes

10

Corvette

Large

No

Average

Bad

Yes

11

Insight

Small

No

Light

Good

No

12

RSX

Small

No

Average

Average

No

13

IS350

Medium

No

Heavy

Bad

No

14

MR2

Small

Yes

Average

Average

No

15

E320

Medium

No

Heavy

Bad

No

Table 2 - Car Dataset to Find Whether the Car is fast or slow

Figure 2 -Decision Tree

## 3.2 Rules

Decision trees are the well-known and simple classifiers. Ease of understanding the rules is the reason for the familiarity. These rules can be used to retrieve data from the database using SQL. Pruning decision can be made easy in converting decision trees into rules [12].

A classification rule is of the form

Rule: X (Condition) ïƒ Y (Decision).

X is the condition and Y is the decision or class written in IF...THEN form. Here X can be called as antecedent and Y is called as consequent. The value of the antecedent implies the value of consequent. The antecedent is the Left Hand Side of a rule and the consequent is the Right Hand Side of the rule. These rules are based on the various paths of the decision tree from root to leaf nodes. A decision tree can easily be converted to a set of rules by taking nodes from the root to leaf for mapping. Sample rules are generated from the above decision tree (figure II)

R1: IF Fuel Eco = Good THEN Fast = NO

R2: IF Fuel Eco = Bad AND Weight = Average THEN Fast = YES

Decision trees generate understandable rules and classification is done without any complex computation. It can handle continuous and categorical variables. The relevant fields are designated for prediction and classification. Though it can be applied for prediction, it is not suitable for handling continuous attributes. Training and pruning are more expensive. Also non-rectangular regions cannot be well treated. Finding the optimal decision tree is an NP-complete problem. To enhance the search process, most of the tree algorithms make use of a greedy strategy or a heuristic-based approach. C4.5 has the time complexity of O(m.n2), where m is the size of training data and n is the number of attributes in the database [13]. The size of the database is growing enormously. The main attention of improving decision tree algorithms is that it can handle large amount of data well, while comparing with naive bayes algorithm. Naive bayes algorithm works well on smaller datasets [14, 15]. Numbers of methods have been developed to enhance the decision tree learning, like parallelization, data partitioning and fast tree growing. Much concentration is given to minimize the computational cost, like SLIQ and SPRINT. A fast tree growing algorithm was developed by [16] for searching unrestricted model space with an efficient heuristic that can be computed efficiently. This algorithm is based on the conditional independence assumption and computes with O (m.n) time complexity.

## Neural Networks

Neural Network is a branch of Artificial Intelligence and it is an interconnected group of neurons that uses a mathematical or computational model for information processing. It simulates the human brain. It is often referred as Artificial Neural Networks (ANN) and it is generally used to refer a network of biological neurons [17]. Each processing elements are called neurons or nodes. The main property of ANN is the ability to learn from surroundings. The performance is improved through learning. This improvement occurs in accordance with some prewritten rules. Artificial Neuron was first proposed by Warren McCulloch, a neurophysiologist and Walter Pitts, a logician, in 1943[18]. Applications of ANN include decision-making and classification, medical diagnosis, system identification and control, game playing, financial application and Data mining. The construction of any ANN contains three categories of computational neurons. They are input, output and hidden. Input node is the entry point to the network and gets information from environment. The output node transmits the answer of the ANN to the environment. The hidden units are inside the net and they do not make contact with outdoor. The sample ANN model is depicted in Fig. 3. ANNs are generally used to predict the future events based on the patterns observed from the training data, classify the unseen data into predefined groups based on the observed training data and cluster the training data into groups based on the likeliness of the training data [19]. The following subsections deal how the classification problem is managed by ANNs.

Figure 3 - Sample ANN model

## 4.1 Perceptron based methods

## 4.1.1 Single layer perceptron

The perceptron is a supervised classification algorithm. It is a linear classifier and predictions are based on linear predictor function. Input weights with a feature vector are described using Kruskal's algorithm. This algorithm was invented in 1957 by Frank Rosenblatt [20].

A single layered perceptron contains single layer of output nodes. For each node the sum of product of weight and input is calculated. If the value is above the threshold value(0) neuron will be activated, if the value is below the threshold neuron will be deactivated.

This computation can be shown as ,

where wi is the connection weight or prediction vector, xi is the input feature value and i ranges from 1 to n. The computation is in the interval [1,-1]. The perceptron does not terminate if the learning set is not linearly separable. Learning will reach a point only after classifying all the vectors properly. To solve this problem multilayered perceptron was created.

## 4.1.2 Multi layer perceptron (MLP)

MLP is a feed forward ANN that maps set of input data into right output. It includes several layers of nodes in a directed graph. It employs supervised learning technique called back propagation for the training network [21]. It classifies data that are not linearly separable [22]. It consists of three or more layers (input, output and hidden layers) of non-linearly activating nodes. A four layer MLP (2 hidden layers) is illustrated in Fig. 4. Learning is taken place by changing connection weight after each data is processed. The amount of error in the output is calculated and compared to the expected result. This is done through back propagation.

## x

## h1

## h3

## h2

## k2

## k1

## f

Figure 4- MLP with 2 Hidden Layers h and k

## 4.1.3 Radial Basis Function Networks (RBF)

RBF comes under the feed forward network and it uses radial function as activation function and with only one hidden layer. It has any number of inputs and one hidden layer with a number of units. It is theoretically similar to K- Nearest Neighbour models. RBF is applied in various science and engineering fields [23]. It is having the similar layer-by-layer topology as MLP networks. So it is sometimes considered that RBF networks belong to MLP networks. It was confirmed that RBF networks can be implemented by MLP networks with increased input dimensions [24]. It is easier than MLP networks because of the layers and training is faster. Neural Networks can be applied in places where useful information from large amount of data is extracted. It is successfully applied in numerous real world arenas. Some of them are pattern recognition, image compression, medicine, prediction and character recognition. Also significant development has happened in classification related areas. Many types of neural networks can be used for classification purposes [25]. MLP is the most widely studied and used method in neural network classifiers. Bayesian classification theory is the basis of all statistical probability model for classification procedures. [26]. Also neural network provides direct estimation of posterior probabilities [27]. A neural network for classification problem can be viewed as mapping functions, F: Rd ïƒ RM, where d is the number of dimensions for input x and M is the vector for output y. Dimension reduction plays a significant role in classifiers. In recent times, much concentration is given for feature selection or dimension reduction. The most popular method is Principle Component Analysis (PCA). It reduces dimensions without loss of the intrinsic information contained in the original database. PCA works as a pre-processing method in neural classifiers useing linear dimension reduction technique. This is overcome by neural networks [28]. The misclassification cost will lessen the optimal classification. It is difficult to assign classification costs for real world problems. It will have a severe impact on classification. Neural classifiers minimize the classification errors. They may not be efficient to handle misclassification errors with uneven consequences [29]. Recently developments were made on classification with missing data and handling misclassification errors.

## K Nearest Neighbor Classifier (KNN)

K-NN is the supervised learning algorithm in which the result of the new instance is categorized based on the majority of the K-NN category. It is used to classify a new object based on attributes and the training set. It is very simple and works based on the minimum distance from the new instance to the training samples to find out Kth nearest neighbours. After collecting K-nearest neighbours, the majority of these K-nearest neighbours are taken for predicting the class of the new instance. Here the choice of K depends on the data. The larger value of K reduces noise in the classification, but the distance between boundaries will be lessened. Various heuristic techniques are used to select the value of K. The accuracy of this algorithm is degraded by the occurrence of irrelevant or noisy features or attributes. More research work has been carried out on feature selection for efficient classification.

## 5.1 Distance measurement

The distance is measured by distance metric. The chosen metric should minimize the distance between two similarly classified instances and maximize the distance between two different classes. The algorithm is as follows,

## Algorithm: K-Nearest Neighbor Algorithm

## Input: Learning set S, Attribute set A, Values V

## Output: Class of New Instance

Begin

Determine parameter K = Number of nearest neighbors

Calculate the distance between the new instance and all the training samples.

Sort the distance

Determine the nearest neighbors based on the K-th minimum distance.

Find the category of the result class.

Use the majority of the category of the nearest neighbors as the prediction value -of the new instance.

End

Minkowsky, Manhattan, Chebychev, Euclidean, Camberra and Kendall's Rank Correlation are the popularly used distance metrics in KNN. A survey in feature weighting methods was given by Wettschereck [30]. Dimension reduction is the most important problem in large datasets. KNN assists to select set of prototypes for classification. Simple KNN classifiers use simple Minkowsky distance metric and it has the time complexity of O(|D||F|) where D indicates the training set and F shows the features of the dataset. Here the distance metric is linear. There has been substantial research on reducing the training data by reducing number of features, also for search strategy. Some of the search strategies are case-based retrieval nets [31,32], Footprint-based retrieval [33], Fish and Shrink [34] and Cross Trees for Nearest Neighbor [35]. The influence of KNN has been shown in many fields and it has many constructive values, like i) Method is apparent, so it is easy to implement and debug, ii) Some noise reduction techniques that will work effectively in KNN. iii) Run time performance is greatly improved. Some imperative shortcomings are i) If the training set is large, it has poor run time performance, ii) Very sensitive to irrelevant and redundant features, iii) The difficult classification tasks like Support Vector Machine (SVM) or Neural Networks can outdo performance.

## Association Rule Classifers

Association rule mining is the well-known method to find out interesting relations in attributes from large database. The main idea is to discover rules from the database [36, 37]. The association rule mining is of the form XY where X and Y are attributes. To select interesting rules from the set of all rules, some constraints should be applied. The well-known constraints are the degrees of thresholds on support and confidence. Apriori [38] algorithm is primarily used to mine rules with these constraints. It follows Breadth-First Search strategy to count the support of item sets. Though it is easy to implement, transaction database is memory resident and needs many database scans to find the frequent item set. Most of the previous studies have concentrated in showing binary data [39].

## 6.1 Process

Association rule mining is described as follows. Let I = {i1, i2, ..., in} is the set of n items or binary attributes. D is the database which contains set of all transactions, D = { t1, t2, ... , tn }. Each transaction in the database has unique ID. The rule is defined as XY, where X and Y are in the itemset I, X, YI and XY = Ø. X is called antecedent and Y is consequent. A number of measures are used to find the interesting rules from the set of possible rules. The well known constraints are support and confidence.

Support: The support, supp(X) is the percentage of transactions in the dataset which contains the itemset.

Confidence: conf(XY) = supp(XY) / supp(X). It shows the percentage of transactions that contains both X and Y.

The process of mining association rules contains two steps. First the frequent itemsets are identified; second association rules are generated. Apriori algorithm is the well-known one. Algorithms Eclat and FP-growth are producing the result of first step only. Another process is needed to complete the second. GUHA and OPUS search are the famous (Frequent Pattern) FP-Growth algorithms. GUHA is applied for exploratory data analysis. OPUS is used to find rules. Some of the important algorithms are listed below.

Table I - Association Rule Mining Algorithms

AIS in 1993 [37]

The first algorithm that is introduced for generating association rules. It uses candidate generation to detect frequent itemsets. It produces excessive number of candidate itemsets.

APRIORI in 1994 [38]

Uses a breadth-first search strategy to count the support of itemsets. It uses a candidate generation function which develops the downward closure property of support.

DHP in 1995 [40]

Applies hash techniques to generate candidate itemsets efficiently. For the large two-itemset the problem of bottleneck was greatly reduced.

ECLAT in 1997 [41]

More efficient for large items and less efficient for smaller ones. Frequent itemsets are determined using simple transaction id list intersections in a depth first graph.

MAX-MINER in 1998 [42]

Extracts maximal frequent itemsets as the superset of all frequent itemsets. Frequent patterns can be discovered quickly by level-wise bottom-up traversal with a top-down traversal.

FP-GROWTH in 2000 [43]

Provides a way to find frequent patterns without candidate generation.

H-MINE in 2001 [44]

Initiates hyperlinked data structure (H-Struct). It is used to dynamically adjust links in the mining process.

## Support Vector Machines

SVM was invented by Vladimmir N. Vapnik and the concept of soft margin was suggested by Vapnik and Cornnia Cortes [45]. Support Vector machines are closely related with learning algorithms that are used for classification and regression analysis. The linear or binary classifier takes input and predicts output either in one of the two classes, shown in Fig. 5. A model will be built based upon this algorithm and this model will be used to classify the new dataset. It is also capable of achieving non-linear classification using the concept of kernel trick. Primarily SVM creates a hyper plane or a set of hyper planes that can be used for classification tasks. A good quality of separation is attained by constructing this hyper plane that has a considerable distance to the nearest data point of the class (Maximum-margin hyper plane). It is not always possible to fit dataset into a linearly separable space. So the original space is mapped into a high-dimensional space. A kernel function is described to compute the variables of the original space [46]. Also selection of kernel function is important one. Genton [47] defined many classes of kernels, but which class will suit for a particular problem is not dealt. Some popular kernels are

Linear Kernel

Polynomial Kernel

RBF Kernel

A linear SVM is defined as , where D indicates the set of training data with n points, xi is the real vector with p dimensions. The hyperplane separates the points from 1 to -1.

The hyperplane can be written as w.x-b=0. Here '.' is the dot product, w is the vector of hyperplane. The hyperplane can be described by the two equations, w.x-b = 1 and w.x-b =-1.

The distance between these 2 hyper planes is.

To avoid data points falling into the margin, the restriction is given, for each i, w.xi - b≥1 for the first class and w.xi -b ≤ -1 for the second class. SVM supports multi class classification. This is done by implementing multiple binary classification problems [48]. SVM has strong impact on many views. SVM can automatically select the model size by choosing the right support vectors [49]. Convexity is the significant property of SVM classifiers. The SVM has less impact to infer parameters of a solved model.

wx+b>+1 Class A

wx+b<-1

Support

vectors

Class B

wx+b=0

Support Margin

Vectors

Figure V - Support Vector Machines

## Fuzzy Classifer

Any classifier that applies fuzzy set is called as fuzzy classifier. Fuzzy classifiers group elements into different fuzzy set. Fuzzy set is the set with a number of elements have degrees of membership. It is introduced by Lofti A. Zadeh [50] and Dieter Klaua [51]. Membership function of a fuzzy set is identified by the truth table of a fuzzy propositional function. The goal of fuzzy classifier is to create a fuzzy category membership function to convert objectively measurable parameters into category memberships which are then used for classification or gaining expert knowledge. Expert knowledge can be expressed in the form of linguistic variables, which are described by fuzzy sets. Expert knowledge is built by using the linguistic rules. Linguistic rules contain two parts; an antecedent and a consequent. The consequent may be a linguistic value or a function. Rule based classifier that takes linguistic labels as the consequent is based on Mamdani-type fuzzy system [52] and in Takagi-Sugeno fuzzy systems [53] takes function as the consequent. The following fuzzy 2D rule with 4 classes depicts the model of fuzzy rule based classifier.

IF a is low AND b is medium THEN class is 1

IF a is medium AND b is medium THEN class is 2

IF a is medium AND b is large THEN class is 3

IF a is large AND b is low THEN class is 4

The two features a and b are numerical, but the rules contain linguistic values. If there are X possible linguistic values for each feature and n features in the problem, the possible number of conjunction type is Xn. A fuzzy classifier P, generating soft labels can be defined as a function, P: Fïƒ [0, 1]c , where F is the feature space and the number of classes are denoted by c.

Many classifier models are inspired from fuzzifying the traditional classifiers. A simple example is KNN. Fuzzy KNN uses the distances to the neighboring points and soft margins. Fuzzy based classifiers can be related to popular classifiers such as Learning Vector Quantization (LVQ), Parzen classifier and RBF [54]. In normal classifiers human knowledge can be used with training data. Experts can fix soft labels and build prototypes that is not exist in the training data. The training of fuzzy based classifiers human intuition is not taken for analysis.

## Conclusion

This review updates the well-known supervised techniques. It shows the key ideas discussed from a list of publications. Classification is not whether the algorithm is higher to others, but under what circumstances a particular technique considerably surpasses others on a given problem. After a well understanding the merits and demerits of each technique, the chances of combining two or more algorithms together should be inspected. The main aim of this integration is balancing the weakness of one by the strength of other(s). Also algorithm should be designed with good classification accuracy without increasing the time complexity. Many of the current algorithms are computationally costly, need to keep all data in main memory. Algorithms should be able to handle bulk of data, when they are collected from the distributed data sources.