Support Vector Machine An Intelligent System Computer Science Essay

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

Machine Learning is an area of Artificial Intelligence concerned with the study of computer algorithms that improve automatically through experience. The major focus of Machine Learning research is to extract information from data automatically by computational and statistical methods. Supervised Learning is a Machine Learning technique whereby the algorithm is first presented with training data based on examples which include both the inputs and the desired outputs; thus enabling it to learn a function. The learner should then be able to generalize from the presented data to unseen examples. This article aims to introduce a comparatively new statistical Supervised Machine Learning algorithm called Support Vector Machine , which is a modern outgrowth of Artificial Neural Networks . We explain two useful areas of Support Vector Machine, that is, classification and regression. Prospects and challenges of future research in this emerging area are also described.

Keywords : Machine Learning, Supervised Machine Learning, Support Vector Machine, Statistical Learning , Support Vector Classification, Support Vector Regression, Support Vectors, Feature Space, Kernel Functions.

1. Intelligent Systems(ISs) : From the beginning, Machine Learning(ML) methodology, which is the origin of Artificial Intelligence (AI), has been rapidly spreading in the different research communities with successful outcomes. ISs are systems which employ methods and techniques from AI which is an intelligence of machines. In our ever-increasing information age, advanced ISs are required to manage and understand enormous amounts of data, and to learn their salient characteristics.

The area of ISs has expanded phenomenally over the years since the 1940s; both in terms of the range of techniques and also in terms of the number of applications wherein they have often provided a competitive edge when compared with other approaches. IS includes a range of techniques that work synergistically and provides, in one form or another, flexible data/information processing capabilities for handling real life situations. IS , unlike conventional techniques, can exploit the tolerance for imprecision, uncertainty/ambiguities, approximate reasoning and partial truth in order to achieve tractability, robustness, and low cost solutions.

The techniques are in general based on biologically inspired strategies for solving problems. At the current time the major categories of IS include Neural Networks(NNs), Fuzzy Logic/Systems(FL/Ss), Evolutionary Computation/Algorithms(EC/As)( including Genetic Algorithms(Gas),Genetic Programming(GP), Evolutionary Strategies(ES)), Support Vector Machines(SVMs),Particle Swarm Optimization(PSO), Memetic Algorithms(MAs), and Ant Colony Optimization(ACO). In addition hybrid combinations also play a major role. These include Neuro-Fuzzy, Neuro-Genetic, Fuzzy-Genetic systems and so on.

2.Goal of the Article : The main objective of this article is to introduce the basic concepts of SVM and its advantages over the other machine learning algorithms. However before coming to the actual point it is nice to tell why this technique became more and more necessary and popular as compared to its counterparts. For example , the problem of empirical data modelling is germane to many engineering applications. In empirical data modelling a process of induction is used to build up a model of the system, from which it is hoped to deduce responses of the system that have yet to be observed. Ultimately the quantity and quality of the observations govern the performance of this empirical model. By its observational nature data obtained is finite and sampled; typically this sampling is non_uniform and due to the high dimensional nature of the problem the data will form only a sparse distribution in the input space. Consequently the problem is nearly always ill posed. Traditional NN approaches have suffered difficulties with generalisation, producing models that can overfit the data. This is a consequence of the optimisation algorithms used for parameter selection and the statistical measures used to select the 'best' model. SVMs are gaining popularity due to many attractive features, and promising empirical performance. The formulation embodies the Structural Risk Minimisation(SRM) principle, which has been shown to be superior, to traditional Empirical Risk Minimisation(ERM) principle, employed by conventional NNs. SRM minimises an upper bound on the expected risk, as opposed to ERM that minimises the error on the training data. It is this difference which equips SVM with a greater ability to generalise, which is the goal in Statistical Learning.

3. SVM : SVM is a supervised learning technique from the field of machine learning which was originally developed to solve the classification problem, but later on extended to the domain of regression problems. SVMs are rooted in Statistical Learning Theory and have been developed by Vladimir Vapnik and co-workers in 1995. SVMs are a method that can perform binary classification(pattern recognition), real valued function approximation(regression estimation) and time series prediction tasks . They quickly gained attention from the pattern recognition community due to a number of theoretical and computational merits. These include, for example, the simple geometrical interpretation of the margin, uniqueness of the solution, statistical robustness of the loss function, modularity of the kernel function, and overfit control through the choice of a single regularization parameter. A special characteristic of SVM is that the solution to a classification problem is represented by the support vectors that determine the maximum margin hyperplane. The time needed to predict a pattern with an SVM model is proportional to the number of support vectors. This makes prediction slow when the SVM has a large number of support vectors. SVMs have been applied to many real-world problems such as text( and hypertext ) categorisation, image classification, biosequence analysis and biological data mining, hand-written character recognition ,face recognition etc .

3.1.Support Vector Classification(SVC): SVM is primarily a classfier method that performs classification tasks by constructing hyperplanes in a multidimensional space that separates cases of different class labels. SVM models were originally defined for the classification of linearly separable classes of objects. A schematic example is shown in Fig. 1. In this example, the objects belong either to class BLACK or GREY. The separating line defines a boundary on the right side of which all objects are GREY and to the left of which all objects are BLACK. Any new object (say white circle) falling to the right is labeled, i.e., classified, as GREY or classified as BLACK should it fall to the left of the separating line.

C:\Users\Hardeep\AppData\Local\Microsoft\Windows\Temporary Internet Files\Content.Word\revised SVMIntro1.gif

Fig. 1 - Linear Classifier.

The above is a classic example of a linear classifier, i.e., a classifier that separates a set of objects into their respective groups (BLACK and GREY in this case) with a line. Due to the limited computational power of linear learning machines, all the classes cannot be separated with a linear classifier and often more complex structures are needed in order to make an optimal separation, i.e., correctly classify new objects (test cases) on the basis of the examples that are available (train cases). In general, complex real-world applications require more expressive hypothesis spaces than linear functions. Typically the data will only be linearly separable in some, possibly very high dimensional feature space. It may not make sense to try and separate the data exactly, particularly when only a finite amount of training data is available which is potentially corrupted by noise. Hence in practice it will be necessary to employ the non-separable approach. This situation has been illustrated in Fig. 2 . Compared to the previous schematic, it is clear that a full separation of the BLACK and GREY objects would require a curve (which is more complex than a line). Classification tasks based on drawing separating lines to distinguish between objects of different class memberships are known as hyperplane classifiers. SVMs are particularly suited to handle such tasks.

C:\Users\Hardeep\AppData\Local\Microsoft\Windows\Temporary Internet Files\Content.Word\revised SVMIntro2.gif

Fig. 2 - Non-Linear Classifier.

When SVMs are used to separate classes that cannot be separated with a linear classifier, the coordinates of the objects are mapped into a feature space using nonlinear functions called feature functions Φ. The feature space is a high-dimensional space in which the two classes can be separated with a linear classifier. The nonlinear feature classifier function Φ combines the input space( the original coordinates of the objects) into the feature space, which can even have an infinite dimension. Because the feature space is high dimensional, it is not practical to use directly feature functions Φ in computing the classification hyperplane. Instead, the nonlinear mapping induced by the feature functions is computed with special nonlinear functions called kernels.

The illustration given in Fig. 3 shows the basic idea behind SVMs. Here we see the original objects (left side of the schematic) mapped, i.e., rearranged, using kernels. Note that in this new setting, the mapped objects (right side of the schematic) is linearly separable and, thus, instead of constructing the complex curve (left schematic), all we have to do is to find an optimal line that can separate the BLACK and GREY objects.

C:\Users\Hardeep\AppData\Local\Microsoft\Windows\Temporary Internet Files\Content.Word\revised SVMIntro3.gif

Fig.3 - Mapping of Input Space into a High Dimensional Feature Space using Feature Functions.

Kernels have the advantage of operating in the input space(rather than the potentially high dimensional feature space), where the solution of the classification problem is a weighted sum of kernel functions evaluated at the support vectors. Hence the inner product does not need to be evaluated in the feature space. This provides a way of addressing the curse of dimensionality. However, the computation is still critically dependent upon the number of training patterns and to provide a good data distribution for a high dimensional problem will generally require a large training set. An inner product in feature space has an equivalent kernel in input space, provided certain conditions hold. If the kernel function is symmetric positive definite ,which satisfies Mercer's Conditions, then the kernel represents a legitimate inner product in feature space. Valid functions that satisfy Mercer's Conditions are: Polynomial, Gaussian Radial Basis Function, Exponential Radial Basis Function, Multi-Layer Perceptron, Fourier Series, Splines, B Splines, Additive Kernels, Tensor Product.

3.1.1.Kernel Selection : The obvious question that arises is that with so many different mappings to choose from, which is the best for a particular problem? This is not a new question, but with the inclusion of many mappings within one framework it is easier to make a comparison. The upper bound on the Vapnik-Chervonenkis(VC) dimension is a potential avenue to provide a means of comparing the kernels. However, it requires the estimation of the radius of the hypersphere enclosing the data in the non-linear feature space. As a final caution, even if a strong theoretical method for selecting a kernel is developed, unless this can be validated using independent test sets on a large number of problems, methods such as bootstrapping and cross-validation will remain the preferred method for kernel selection.

Studies investigating the universal approximation capabilities of SVMs have demonstrated that SVM with usual kernels (such as polynomial, Gaussian RBF, or dot product kernels) can approximate any measurable or continuous functions up to any desired accuracy . Any set of patterns can therefore be modelled perfectly if the appropriate kernel and parameters are used.

3.2.Support Vector Regression (SVR) : In 1996 , SVMs were extended by Vapnik for regression by using an ε-insensitive loss function as shown in Fig. 4. The learning set of patterns is used to obtain a regression model that can be represented as a tube with radius ε fitted to the data. In this ideal case, SVM regression finds a function that maps all input data with a maximum deviation ε from the target(experimental) values. In this case, all training points are located inside the regression tube. However, for datasets affected by errors, it is not possible to fit all the patterns inside the tube and still have a meaningful model. For the general case, SVM regression considers that the error for patterns inside the tube is zero, whereas patterns situated outside the regression tube have an error that increases when the distance to the tube margin increases.

C:\Users\Hardeep\AppData\Local\Microsoft\Windows\Temporary Internet Files\Content.Word\regression1 001.jpg

Fig 4 : Support vector machines regression determines a tube with radius ε fitted to the data.

The model produced by SVC depends only on a subset of the training data, because the cost function for building the model does not care about training points that lie beyond the margin. Analogously, the model produced by SVR also depends only on a subset of the training data, because the cost function for building the model ignores any training data close to the model prediction( within a threshold ε ).

Similarly to classification problems, a non-linear model is usually required in SVR to adequately model data. In the same manner as the non-linear SVC approach, a non-linear mapping can be used to map the data into a high dimensional feature space where linear regression is performed. The kernel approach is again employed to address the curse of dimensionality.

4.Discussion : SVMs are an attractive approach to data modelling. They combine generalisation control with a technique to address the curse of dimensionality. The kernel mapping provides a unifying framework for most of the commonly employed model architectures, enabling comparisons to be performed. In classification problems generalisation control is obtained by maximising the margin, which corresponds to minimisation of the weight vector in a canonical framework. The solution is obtained as a set of support vectors that can be sparse. These lie on the boundary and as such summarise the information required to separate the data. The minimisation of the weight vector can be used as a criterion in regression problems, with a modified loss function.

For SVM classification problems , highly nonlinear kernels can eventually separate perfectly the classes of patterns with intricate hyperplanes. This is what the universal approximation capabilities of SVM guarantees. However, these capabilities cannot promise that the resulting SVM will be optimally predictive. In fact, only empirical comparison with other classification algorithms such as kNN, Linear Discriminant Analysis, ANNs etc. can demonstrate, for a particular problem, that SVM is better or worse than other classification methods. In many cases, the statistical difference between methods is not significant, and due to the limited number of samples used in those studies, one cannot prefer a method against others. In some reports it is claimed that , due to their derivation from SRM , SVMs do not overfit. However, there are many cases where the SVM solution is overfitted for simple datasets.

Based on the discussion, it is inferred that no single learning algorithm can uniformly outperform other algorithms over all databases and the only way left to identify the best algorithm for a given data set is to experiment with different algorithms.

Future directions include : A technique for choosing the kernel function and additional capacity control; Development of kernels with invariances. Thus in real applications, one must carefully select the nonlinear kernel function needed to generate a classification hyperplane that is topologically appropriate and has optimum predictive power. Future research of SVM will provide improved and quality access to the users. Therefore, developing an automated SVM system with state-of-the-art technologies is of paramount importance.