# Different Types Of Classifier Computer Science Essay

Published:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

A classifier is a function that layout a set of data instance into predetermined classes and the instances are defined by respective attribute value. Classifier also defined as a function that maps input feature vectors to output class labels where X is the feature space and [1]. From a large number of attributes, classifiers select a set of particular attributes as an input and produce a tagged class as an output [2]. For example, if we like to build a classifier to detect spam email then the feature space will include numbers of email and the label of those will be either "Spam" or "Non-Spam". Nomination of class label of an object according to object description is another major task of a classifier. Object descriptions are in essence of vector that having the values of attributes.

Typically, a classifier is built by deriving a regression formula from a set of tagged examples and gain knowledge to foresee class labels utilizing a training dataset and training algorithm [3]. Prior knowledge and expertise are implemented in case of inadequacy of training data. After completion of training phase of a classifier, it is ready to identify classes from test datasets.

In the artificial inelegance and data mining field, classifiers are playing very important role. Researchers are utilizing this for choosing reject option, changing utility functions, compensating for class imbalance, combining models and many other core research grounds. Some practical examples among them are pattern classification [4-8], Weather forecasting [9-10], evaluating set of feature weights [11], offline signature identification [12], time series prediction [13-15], texture classification [16], language identification [17] and image classification [18-19].

Following section illustrates different types of classifiers and their natures in a brief.

## Different types of Classifier

## Researchers have developed several types of classifiers for data mining and it's relating field such as NaÃ¯ve Bayes Classifier, Support Vector Machine, The KNN (K-nearest neighbourhood) classifier, The LM(Levenberg-Marquardt) classifier, Modular Neural Network, Fuzzy classifiers, Parzen classifier, The Fisher linear classifier, Backpropagation classifier and The Tree classifier to name a few. These classifier are generally derived from various kind of famous theory such as probability theory, regression theory and several statistical theory. Some popular classifiers are discussed in brief below.

## NaÃ¯ve Bayes Classifier

This classifier applies Bayes' theorem[20] of probability and so it is called NaÃ¯ve Bayes classifier. This is one of the most admired classifier because of computational efficiency, straightforwardness and better performance than many other approaches such as boosted trees and random forests [21]. The basic strategy of NaÃ¯ve Bayes classifier is independence of assumptions. In simple terms, this classifier assumes that the existence (or nonexistence) of a specific feature of a class has no relation to the existence (or nonexistence) of other feature. This implies that the classifier assumes all features as fully independent. For an example, we consider a naÃ¯ve Bayes classifier to recognise a fruit as an orange. The features of this classifier are colour (orange), shape (round) and diameter (around 12 cm). In general, all those features are dependent with each other to be an orange. However this classifier will consider all these features are independent and each of them individually contribute to the probability of fruit as an orange. NaÃ¯ve Bayes classifier is widely used in many real-world applications such as Mozilla Thunderbird and Microsoft Outlook for filtering out spam emails.

## Support Vector Machine

Support Vector Machine (SVM) or Kernel Machine is another popular classifier that is used for statistical classification and linear regression analysis. In many ways, it is exactly reverse of the naÃ¯ve Bayes classifier. Basic idea of SVM classifier is to create one or more 'hyperplane'[22] that depends on data dimension and utilize this hyperplane for classification, regression or other relevant purpose. Apparently, a well defined hyperplane is mapped with the maximum width to closest training data points of any class (i.e. functional margin). As a predominant rule for a classifier is 'the larger margin the lower the generalization error' [23]. Another approach of SVM classifier is to increase the dimensionality of data that makes a dataset easy to separate []. Therefore, this classifier employs an n-dimensional space (here n denotes number of samples in a training set). Downsides of this approach is outliers that could damage the generalization competence of the classifier and could require huge memory size. SVM is very powerful non-linear classifier. But it is cumbersome to train and slow to evaluate for large dataset. SVM also sensitive to noisy data and manual selection of parameters and kernel function have great impact on classification task.

Modular Neural Network

Neural network is a no-linear powerful classifier that derived from the neural structure of the human brain. General formation of this classifier is to 'learn' one record at a time from the known actual classification of the record[24]. If there are any errors at the preliminary classification, it acknowledges the network to amend the network's algorithm [25]. This procedure continues repeatedly until it reaches the accuracy. For parameters adjustment, this classifier utilizes gradient based optimization algorithm [26]. And for classification, it combines two functions namely increase of dimensionality (that is used in SVM) and recursive superposition of basis functions [27]. A Neural network classifier needs more time to train but it works faster than many other classifiers after compilation of training process. It also handles noisy data extremely well.

Fuzzy classifiers

Fuzzy classifier is a well accepted classifier in the era of artificial intelligence. A general definition of fuzzy classifier is 'any classifier that uses fuzzy sets or fuzzy logic in the course of its training or operation' and this classifier is also defined as probabilistic classifier which is based on the following formula [28].

â€¦â€¦â€¦â€¦â€¦.. (5.1)

The primary rules for this classifier are extracted from linguistic if-then rules. These rules establish relation between all the potential values of the input variables to associated values of output variables. To boost up the performance and generate better rules, this classifier utilize genetic algorithm as well. Because of the usability, diversity of implement of this classifier is amazing. Remote sensing [29-31], satellite image analysis [32], medical image analysis [33] , environmental studies [34], speech, signature and face recognition [35-37], traffic control [38] , are few examples of some interesting areas.

## Parzen classifier

Parazen is a very flexible and conformable classifier that is derived from nonparametric estimation. Nonparametric estimation is a statistical method. This function supports to obtained data in the lack of any supervision or any limitation of theory [39]. The class-conditional probability is another component of this classifier and the class of this probability is denoted by [40]

â€¦â€¦â€¦â€¦â€¦â€¦â€¦. (5.2)

Where is an individual class and c value is obtained from the assessment of the densities of the kernels. Parzen classifiers are widely used for image classification, hand writing recognition, pattern detection and so on. A shortcoming of this classifier is time consuming at learning process.

## The FISHER linear classifier

This is another probabilistic classifier which takes advantage of the posterior class probabilities [41]. Utilizing this probability a class is estimated and a classifier is trained for that specific class. The rest of the classes are constructed and combined using the same procedure. Following is the formula derived from Fisher linear discriminant analysis [40].

f = Wopt . F â€¦â€¦â€¦â€¦â€¦â€¦(5.3)

Where f is a scalar feature

F is combination of features

Wopt optimal weight. This is obtained from compromise of maximizing the between-class variance and minimizing the within-class variance.

Among several implementations of Fisher classifier, data classification and pattern recognition [42] are few examples.

How classifier works in supervised machine learning

Supervised Machine Learning: A Review of Classification techniques

View this also

A review of classification algorithms_classification type.pdf

Because of the efficiency and usability, a number of algorithms have been developed over last two decades. The following section discusses a few of them in brief.

## Classifier based algorithms

See : An Empirical Comparison of Supervised Learning Algorithms

With the growing volume of data repository in the IT society and necessity of analyzing data to make smart decision, data mining is getting popular to a greater extent. Among various types of data mining algorithm, classifier based algorithm occupied a significant attention to researcher because of its efficiency and robustness. Hence a huge number of algorithms have been developed only under the banner of classifier algorithm. This section reveals a few of them in very brief.

NaÃ¯ve Bayes

Complement NaÃ¯ve Bayes developed by Rennie et al. [43] to improve the original NaÃ¯ve Bayes algorithm especially for better handling of text data. They have proved that their proposed amendment aligned NaÃ¯ve Bayes with the realities of bag-of-words textual data and performed significantly better on a several data sets. They also claimed that the modified algorithm is faster and easy-to-implement.

NaÃ¯ve Bayes multinominal proposed by McCallumzy and Nigamy [44], a further modification of generic naÃ¯ve Bayes algorithm. They explored two first-order probabilistic models for classification namely multinomial model and multi-variates model that are originated from naÃ¯ve Bayes assumption. In their empirical outcome they proved that multinomial model had decreased error rate 27% on average and up to 50% in some cases.

almost uniformly better

In NaÃ¯ve Bayes simple algorithm, Duda et al. [45] simplified the classifier to ease for classification. A basic characteristic of this classifier is Normal distribution has been implemented on numeric attributes to build up models.

Pace Regression

Modeling for optimal probability prediction

window

Learning quickly when irrelevant attributes are abound: A new linear threshold algorithm

VotedPerception

Large margin classification using the perceptron algorithm.

IBK

Instance-based learning algorithms

LWL

Locally Weighted Naive Bayes

Bagging

Bagging predictors

LogitBoost

Additive Logistic Regression: a Statistical View of Boosting

Stacking

Stacked generalization

VFI

Classification by voting feature intervals

ID3

Induction of decision trees

NBTree

Scaling up the accuracy of naive-Bayes classifiers: a decision tree hybrid.

LMT

Logistic Model Trees' (ECML 2003)

JRip

Repeated Incremental Pruning to Produce Error Reduction (RIPPER)

Prism

PRISM: An algorithm for inducing modular rules

The PRISM algorithm is proposed as an algorithm for inducing modular rules [1]

with very restrictive assumptions. We modify the PRISM algorithm to get a set

of coarsening classification rules by using a pre-pruning method.

(and from Wiki)

ZeroR

Very simple classification rules perform well on most commonly used datasets

## Algorithm selection

Previous section discussed a few classifier based algorithm among huge number of available algorithms. In fact, it is very complicated task to pick some of them as most of those algorithms are well developed and performing well in different domain. Moreover dissimilar performance metrics for different algorithm create dilemma to choose the best one. In this experiment, I have chosen NaÃ¯ve Bayes, SMO and PART because of their simplicity of algorithm, robustness on data handling and efficiency for classification. Discussion on PART algorithm has been performed in Chapter 4 under section 4.3.3. The following two sections discuss other two algorithms that are NaÃ¯ve Bayes and SMO in a few words.

## NaÃ¯ve Bayes

The Naive Bayes Classifier technique is based on the Bayesian theorem and is particularly suited when the dimensionality of the inputs is high. Despite its simplicity, Naive Bayes can often outperform more sophisticated classification methods.

Figure 5.1: a simple example of NaÃ¯ve Bayes classification

To demonstrate the concept of NaÃ¯ve Bayes Classification, consider the Figure 5.1 displayed in the illustration above. As indicated, the objects can be classified as either GREEN or RED. The task is to classify new cases as they arrive, i.e., decide to which class label they belong, based on the currently exiting objects. Since there are twice as many GREEN objects as RED, it is reasonable to believe that a new case (which hasn't been observed yet) is twice as likely to have membership GREEN rather than RED. In the Bayesian analysis, this belief is known as the prior probability. Prior probabilities are based on previous experience, in this case the percentage of GREEN and RED objects, and often used to predict outcomes before they actually happen.

## Bayes Theorem:

Let X be the data record (case) whose class label is unknown. Let H be some hypothesis, such as "data record X belongs to a specified class C." For classification, we want to determine P (H|X) -- the probability that the hypothesis H holds, given the observed data record X.

P (H|X) is the posterior probability of H conditioned on X. For example, the probability that a fruit is an apple, given the condition that it is red and round.Â In contrast, P(H) is the prior probability, or apriori probability, of H. In this example P(H) is the probability that any given data record is an apple, regardless of how the data record looks. The posterior probability, P (H|X), is based on more information (such as background knowledge) than the prior probability, P(H), which is independent of X. Similarly, P (X|H) is posterior probability of X conditioned on H. That is, it is the probability that X is red and round given that we know that it is true that X is an apple. P(X) is the prior probability of X,Â i.e., it is the probability that a data record from our set of fruits is red and round. Bayes theorem is useful in that it provides a way of calculating the posterior probability, P(H|X), from P(H), P(X), and P(X|H). Bayes theorem is

P (H|X) = P(X|H) P(H) / P(X) â€¦â€¦â€¦â€¦â€¦â€¦â€¦â€¦â€¦â€¦. (5.4)

## Some characteristics of NaÃ¯ve Bayes:

NaÃ¯ve Bayes Classifiers can be built with real-valued inputs.

Bayes Classifiers don't try to be maximally discriminative---they merely try to honestly model what's going on.

NaÃ¯ve Bayes is wonderfully cheap. And survives 10,000 attributes cheerfully.

## SMO

SMO (Sequential Minimal Optimization) is proposed by Platt [46] in 1999. This algorithm contains many optimizations designed to speed up the algorithm on large datasets and ensure that the algorithm converges even under degenerate conditions. SMO solved large quadratic programming (QP) and optimization problems that are widely used for the training of support vector machines. It does not need to any extra matrix storage and does not require numerical QP optimization steps at all. SMO chooses to solve the smallest possible optimization problem at every step. At every step, SMO chooses two multipliers to jointly optimize, finds the optimal values for these multipliers, and updates the SVM to reflect the new optimal values. There are two components to SMO: an analytic method for solving for the two multipliers, and a heuristic for choosing which multipliers to optimize.

The advantage of SMO lies in the fact that solving for two multipliers can be done analytically. Thus, numerical QP optimization is avoided entirely. The inner loop of the algorithm can be expressed in a short amount of C code, rather than invoking an entire QP library routine. Even though more optimization sub-problems are solved in the course of the algorithm, each sub-problem is so fast that the overall QP problem is solved quickly. In addition, SMO requires no extra matrix storage at all. Thus, very large SVM training problems can fit inside of the memory of an ordinary personal computer or workstation. Because no matrix algorithms are used in SMO, it is less susceptible to numerical precision problems.

## Performance measuring tools

Accuracy

Accuracy of Weka Experiment environment resides under 'Analyse' Tab of "comparison field' section. Accuracy compares the two models on what percentage of the test cases each model got correct and if there was any statistically significant different in their performance. We shall then rank them based on their training time. Of course, there many other tests can be conducted. However, since the information about each test presented will be the same, these examples will suffice to explain how to conduct any of the other tests possible.

To begin the analysis we first load the experiment results by pushing Experiment button (Source area). This will load up the last experiment's results. Continuing our example from the birth weight data 200 results will be loaded. To compare the two models on their percentage correct score we shall set the Comparison field to Percentage_correct, and we will set the baseline model to be the J48 model. We will also check the 'Show std. deviations' checkbox. We perform the test by clicking on the 'Perform test' and the test results are displayed in the test output area on the right.

Confusion Matrix

Statistical Model

Modelling time

Learning algorithms are now used in many domains, and different performance metrics are appropriate for each domain. For example Precision/Recall measures are used in information retrieval; medicine prefers ROC area; Lift is appropriate for some marketing tasks, etc. The different performance metrics measure different trade offs in the predictions made by a classifier, and it is possible for learning methods to per- form well on one metric, but be suboptimal on other metrics. Because of this it is important to evaluate algorithms on a broad set of performance metrics.

See : An Empirical Comparison of Supervised Learning Algorithms

See : Evaluating the Performance of Different Classification Algorithms for Fabricated Semiconductor Wafers

Algorithm Slelction

## Classifier

What is classifier?

How classifier works?

Advantages of classifiers

Some disadvantages

Application of classifiers

## Brief description of few algorithms

Literature review (most recent)

(use of classifiers in different algorithm)

## Classifier

What is classifier?

How classifier works?

Advantages of classifiers

Some disadvantages

Application of classifiers

## Brief description of few algorithms

Literature review (most recent)

(use of classifiers in different algorithm)

Survey on Multiclass Classification Methods by Mohamed Aly

Survey of Classification Techniques in Data Mining by Thair Nu Phyu

Classification Algorithms in Comparing Classifier Categories to Predict the Accuracy of the Network Intrusion Detection - A Machine Learning Approach by G. Meera Gandhi1 and S.K. Srivatsa2

Comparison of different classification algorithms for weed detection from images based on shape parameters by Martin Weis1, Till Rumpf2, Roland Gerhards1, Lutz Plümer1

[1] K. Murphy, "Naive Bayes classifiers," The university of British Columbia, Vancouver,Canada.2006.

[2] J. Wu, et al., "Automatic Collocation Suggestion in Academic Writing," in the 48th Annual Meeting of the Association for Computational Linguistics, Uppsala, Sweden, 2010, pp. 115-119.

[3] L. Kuncheva, "Fuzzy classifiers," Scholarpedia, vol. 3, p. 2925, 2008.

[4] R. Lippmann, "Pattern classification using neural networks," IEEE communications magazine, vol. 27, pp. 47-50, 1989.

[5] G. Ou and Y. Murphey, "Multi-class pattern classification using neural networks," Pattern Recognition, vol. 40, pp. 4-18, 2007.

[6] U. Yoon, et al., "Pattern classification using principal components of cortical thickness and its discriminative pattern in schizophrenia," Neuroimage, vol. 34, pp. 1405-1415, 2007.

[7] B. Biggio, et al., "Adversarial pattern classification using multiple classifiers and randomisation," Structural, Syntactic, and Statistical Pattern Recognition, pp. 500-509, 2010.

[8] C. Ecker, et al., "Investigating the predictive value of whole-brain structural MR scans in autism: A pattern classification approach," Neuroimage, vol. 49, pp. 44-56, 2010.

[9] A. Dutot, et al., "A 24-h forecast of ozone peaks and exceedance levels using neural classifiers and weather predictions," Environmental Modelling & Software, vol. 22, pp. 1261-1269, 2007.

[10] A. D'onofrio, et al., "CHAC: a weather pattern classification system for regional climate downscaling of daily precipitation," Climatic change, vol. 98, pp. 405-427, 2010.

[11] M. Analoui and M. Amiri, "Feature reduction of nearest neighbor classifiers using genetic algorithm," Proceedings of world academy of science, engineering and tehchnology, 2006.

[12] D. Kisku, et al., "Offline Signature Identification by Fusion of Multiple Classifiers using Statistical Learning Theory," Arxiv preprint arXiv:1003.5865, 2010.

[13] A. Chitra and S. Uma, "An Ensemble Model of Multiple Classifiers for Time Series Prediction," International Journal of Computer Theory and Engineering, vol. 2, pp. 455-458, 2010.

[14] R. Yadav, et al., "Time series prediction with single multiplicative neuron model," Applied soft computing, vol. 7, pp. 1157-1163, 2007.

[15] I. Beliaev and R. Kozma, "Time series prediction using chaotic neural networks on the CATS benchmark," Neurocomputing, vol. 70, pp. 2426-2439, 2007.

[16] M. Gangeh, et al., "Scale-space texture classification using combined classifiers," in 15th Scandinavian conference on Image analysis Aalborg, Denmark 2007, pp. 324-333.

[17] H. Suo, et al., "Using SVM as back-end classifier for language identification," EURASIP Journal on Audio, Speech, and Music Processing, vol. 2008, p. 2, 2008.

[18] S. Kim, et al., "DisIClass: discriminative frequent pattern-based image classification," in Proceedings of the Tenth International Workshop on Multimedia Data Mining Washington, D.C., 2010, pp. 1-10.

[19] P. Chou, et al., "Kernel-Based Nonlinear Feature Extraction for Image Classification," in Geoscience and Remote Sensing Symposium (IGARSS-2008. IEEE International) Boston, MA 2008, pp. 931-934.

[20] G. Barnard and T. Bayes, "Studies in the History of Probability and Statistics: IX. Thomas Bayes's Essay Towards Solving a Problem in the Doctrine of Chances," Biometrika, vol. 45, pp. 293-315, 1958.

[21] R. Caruana and A. Niculescu-Mizil, "An empirical comparison of supervised learning algorithms," in The 23rd international conference on Machine learning Pittsburgh, Pennsylvania 2006, pp. 161-168.

[22] V. Vapnik and S. Kotz, Estimation of dependences based on empirical data: Empirical inference science: afterword of 2006, 2nd ed.: Springer-Verlag New York Inc, 2006.

[23] C. Park, "Generalization error rates for margin-based classifiers," PhD Thesis, Department of Statistics, The Ohio State University, Ohio, 2005.

[24] J. Anderson and J. Davis, An introduction to neural networks: MIT Press, 1995.

[25] G. Panchal, et al., "Forecasting Employee Retention Probability Using Back Propagation Neural Network Algorithm," in 2010 Second International Conference on Machine Learning and Computing, Bangalore, India 2010, pp. 248-251.

[26] Y. Shang and B. Wah, "Global optimization for neural network training," Computer, vol. 29, pp. 45-54, 1996.

[27] S. Volkov, "Generating some classes of recursive functions by superpositions of simple arithmetic functions," Springer, vol. 76, pp. 566-567, 2007.

[28] L. Kuncheva, Fuzzy classifier design. Heidelberg: Springer-Verlag, 2000.

[29] F. Wang, "Fuzzy supervised classification of remote sensing images," IEEE Transactions on Geoscience and Remote Sensing, vol. 28, pp. 194-201, 1990.

[30] U. Benz, et al., "Multi-resolution, object-oriented fuzzy analysis of remote sensing data for GIS-ready information," ISPRS Journal of Photogrammetry and Remote Sensing, vol. 58, pp. 239-258, 2004.

[31] M. Pepe, et al., "Accuracy benefits of a fuzzy classifier in remote sensing data classification of snow," in Fuzzy Systems Conference, 2007(FUZZ-IEEE 2007), London, 2007, pp. 1-6.

[32] L. Wendling, et al., "Fuzzy segmentation and structural knowledge for satellite image analysis," Springer, vol. 974/1995, pp. 703-708, 1995.

[33] A. Fadi and A. Ikhlas, "Hybrid mammogram classification using rough set and fuzzy classifier," International Journal of Biomedical Imaging, vol. 2009, p. 12, 2009.

[34] N. Chang, et al., "Identification of river water quality using the fuzzy synthetic evaluation approach," Journal of Environmental Management, vol. 63, pp. 293-305, 2001.

[35] K. Kwak and W. Pedrycz, "Face recognition using a fuzzy fisherface classifier," Pattern Recognition, vol. 38, pp. 1717-1732, 2005.

[36] A. Amano, et al., "On the use of neural networks and fuzzy logic in speech recognition," in International Joint Conference on Neural Networks (IJCNN 1989), Washington, DC , USA 1989, pp. 301-305.

[37] M. Hanmandlu and M. Yusof, "Off-line signature verification and forgery detection using fuzzy modeling," Pattern Recognition, vol. 38, pp. 341-356, 2005.

[38] J. Clymer, "Simulation of a vehicle traffic control network using a fuzzy classifier system," in Annual Simulation Symposium, 2002, 2002, pp. 285-291.

[39] M. An and R. Ayala, "Nonparametric estimation of a survivor function with across-interval-censored data."

[40] M. Corona-Nakamura, et al., "Improving classification with combined classifiers," in 2002 WSEAS Int.Conf. on Signal Processing, Robotics and Automation (ISPRA '02), Cadiz, Spain, 2002, pp. 1531-1534.

[41] A. Guerrero-Curieses, et al., "Local estimation of posterior class probabilities to minimize classification errors," IEEE Transactions on Neural Networks, vol. 15, p. 309, 2004.

[42] A. Webb, Statistical pattern recognition. New York, USA: J Wiley and Sons Inc., 2002.

[43] J. Rennie, et al., "Tackling the poor assumptions of naive bayes text classifiers," in International Conferenceon Machine Learning (ICLM-2003), 2003, p. 616.

[44] A. McCallumzy and K. Nigamy, "A comparison of event models for naive bayes text classification," in The 5th National Conference on Artificial Intelligence (AAAI-98) workshop on Learning for Text Categorization, Madison, Wisconsin, 1998.

[45] R. Duda, et al., Pattern Classification and Scene Analysis. New York: A Wiley-Interscience Publication, 1994.

[46] J. Platt, "Fast training of support vector machines using sequential minimal optimization," Advances in kernel methods: support vector learning pp. 185-208, 1999.