This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.
Data mining is a technology that has been there from years. Many companies used these techniques to analyze market trends. It allows users and companies to evaluate the data from a number of different perspectives and to put it into various categories. They try to find out correlations among the shopping trends and to gain more and more consumer focus. It is mainly nothing but finding similarities in a huge relational database. Data mining can provide companies with all sorts of consumer related factors such as demographic values, pricing etc. This can be really helpful as it can be used to provide each customer with specific deals and products advertisement. e.g. Credit card companies often offer mortgage deals and loan ads based on their expenses, balance in their accounts and their credit history. Amazon offers products based on what customer purchased recently and what items are they searching for these days. Similarly Netflix and Blockbuster recommend rentals to each and every individual based on their rental history.
Support vector machines are very useful when it comes to the classification, regression and supervision of data. It can perform data fitting and classification over very conventional data sets as well as if there is any abstract information given. SVMs perform classification by taking inputs and creating hyper planes. That way it is easy to sort out categories of each data inputs. Simplest form of division of data is classed linear classification where data is separated between two classes. Greater the distance between those two classes is, the better hyperplane it is. Each Support Vector Machine defines boundaries using decision plane which means it splits the objects belonging to different classes and draws a decision boundary to clearly classify them.
The goal of our project is to understand the supervised learning methods and to achieve text classification using Support Vector Machines (SVMs). In order to achieve this, we have used SVM algorithms via the LIBSVM wrapper in Weka environment. LIBSVM makes it easy to use various tools using function calls/APIs to achieve control over classification parameters. Advanced capabilities are also achieved with the help of a GUI rich Weka environment. We have developed two frames to accept a range of input and provide classified output in a graphical manner displaying correctly classified values and many others. We are using pre-processed training and test data sets in ARFF format to achieve our purpose.
Support Vector Machine (SVM)
In formal terms, a support vector machine creates a hyperplane (or a set of hyperplanes) for the purpose of classification of data. It can also be used for regression or other purposes.
Figure 1 Author: Cyc (Wikipedia)SVM model in a finite space is a set of points that are mapped in a way that separate categories are split by a gap that must be as wide as possible. The most effective hyperplane is the one that divides the categories in a way that the distance between the hyperplane and the closest data point to that plane is maximum. Better generalization by the classier is usually achieved if this margin is larger.
E.g. In figure 1, H3 does not divide the categories at all. H2, even though it divides the categories correctly, the margin is too small. H2 on the other hand is the best hyperplane that provides the maximum margin and hence a better classification. Such a plane is called maximum-margin hyperplane and the linear classifier defined by it is called maximum margin classifier.
It's not so straightforward however. Notice the following figure:
Figure 2 Kernel Function mapping
In a finite input space (left), the data points seldom could be separated by a straight line. This finite space can be mapped to a significantly higher space so that the separation could be achieved much easily. The data points on the left need to be somehow rearranged so that they become linearly separable (right). This process of rearrangement is called Mapping. This is done using the kernel functions. Cross products need to be calculated by SVM in terms of the variables in the primary finite space.
These cross products are defined in terms of a function known as Kernel Function K(x,y). Different Kernel functions use different methods in order to achieve effective classification. In simple terms, all the SVM has to do is to find an optimum line that well separates the data classes and categorizes them effectively.
Types of Kernel Functions:
The figure below lists the commonly used kernel functions:
Figure 3 Kernel Functions (Apr 2010, Chih-Wei Hsu, Chih-Chung Chang, and Chih-Jen Lin)
The most widely used kernel is RBF kernel. Linear kernel is case of RBF kernel for specific attribute values. When the number of instances is less than number of features then linear kernel is the best choice. When there is not any linear relationship between attributes and instances then RBF kernel is used to map the values to a higher dimensional space. The Sigmoid and Polynomial kernels have very specific use cases.
LIBSVM is a library for Support Vector Machines (SVMs). I can also be called as a wrapper that provides the SVM algorithms to be used with the help of certain functions. This facilitates the users to use SVM as a tool.
The LIBSVM library provides various functions in the form of commands. Each command has certain number of parameters that could be passed to it. Depending on the parameters and the command that is being executed, we could achieve different types of classifications. The library is not restricted however to just classification related functions. E.g. svm-scale is a tool provided in the LIBSVM library that is used for scaling input data file. Following is a detailed example of the svm-predict command in LIBSVM that predicts the target values of the test data:
Usage: svm-predict [options] test_file model_file output_file
-b probability_estimates: whether to predict probability estimates, 0 or 1 (default 0); for one-class SVM only 0 is supported
model_file is the model file generated by svm-train.
test_file is the test data you want to predict.
svm-predict will produce output in the output_file.
In our project, we have used various LIBSVM functions in order to make predictions and achieve text classification in the test data. The major functions are svm_train and svm_predict.
It stands for Waikato Environment for Knowledge Analysis and is a suite that contains software for machine learning. It provides an assembly of visualization tools and algorithms for data analysis and predictive modeling. These tools and algorithms are available for disposal via a graphical user interface (GUI), also a part of Weka.
Weka provides support for data mining activities that includes classification, visualization, preprocessing of data, clustering, regression and feature selection. The Weka explorer is the primary GUI for interfacing with the user. In simple terms, Weka provides an environment to load the LIBSVM tools and utilize the power of those tools for text classification. The rich GUI capabilities the Weka workbench also facilitates us to display the output of our text classification in a clear visual manner.
In our project, we have loaded the LIBSVM tools in Weka through the jar file. This file consists of all the tools required for text classification. Then we use the API/function calls (e.g. svm-predict) to process our text as per the requirements. We have created two frames. One is for uploading the training data set and test data set along with a screen to display various outputs and results of our text classification. The other frame is used to change the different input parameters like kernel function, capacity etc. and provide a graphical output that displays the various classes, their attributes and the classifications.
The inputs to our application are ARFF (Attribute Relation File Format) files. We upload a training data set as well as a test data set as input to our classifier. Based on the training data set, the LIBSVM classifier generates a model and then analyses the test data file according to this model. It makes prediction based on what it has "learnt" and then generates corresponding output.
ARFF files are plain text files that contain the description of various relations, their attributes and all their instances. Each attribute has a name, data type and a value range.
Figure 4 Sample ARFF file (Holmes, G., Donkin, A., and Witten, I. H. Weka: A Machine Learning Workbench)In the first pass, our application goes through all the instances of this training data set ARFF file and "learns". In the second pass, when it encounters an unforeseen instance in the test data set, it makes a judgment based on the model that was developed after the first pass. Depending on the kernel function, cost, gamma and other factors, it may correctly or incorrectly classify that instance.
The output of our application comprises of two graphs:
SVM accuracy classification
In this graph, we show the accuracy of the each individual class in terms of correctly or incorrectly classified attributes.
Fig 5. shows the Results viewer frame which has a graph containing values of each class. The red ones are the incorrectly classified instances and the correctly classified in blue.
Fig 5. Shows the Accuracy viewer frame which has a graph consisting of True Positives (TP), False Positives (FP), Precision and Recall for individual classes. Graphs for different classes can be viewed by clicking the left hand side table.
Figure 5 Result Viewer - la1s
Figure 6 Accuracy Viewer - la1s
Classes and attributes
This graph has two sub-types:
Nominal (categorical) value graph:
A nominal value means that a class can serve as an identifier for a set of data. E.g. A class named "weather" can consist of values: "sunny", "windy", "rainy" etc.
This graph displays the attributes, values and counts of every instance of a class.
Figure 7 Instance Visualizer - Nominal Attributes
The left hand table displays the attributes for the selected instance, total number of instances and the total number of attributes. Right hand side contains the table consisting of all the labels for a specific selected attribute and graph containing the counts for each.
Non nominal value graph:
A non nominal value means that the data class is absolute and does not represent a set other values.
This graph displays the mean, standard deviation, minimum value and maximum value of a particular class.
Figure 8 Instance Visualizer - Non Nominal
The training and test data sets are uploaded on Weka through one of the developed frames. The LIBSVM tools are accessed via our developed code (API/function calls). The classifier then generates a model after processing the training data set (input). The prediction is then made by the SVM based on the model generated by the training data set. The comparison is finally made between the real value and the classified value of the class. The graph based on these stats is generated which highlights these classified values and deduces the accuracy of the SVM. Fig. 9 below shows the process flow.
Text Classification Tweaks:
As mentioned earlier, there are certain parameters that affect the accuracy of the text classification. There are two important factors that could be tweaked in order to get varying levels of accuracy in text classification are:
Figure 9 Process flow diagram
These parameters directly affect the kernel functions. So if we tweak their values it changes the kernel function and consequently the mappings for the values of the datasets.
We experimented with these values and observed that the accuracy level changes extraordinarily with changes in these two values. We have tabulated the results for a range of input values for both Cost (C) and Gamma (Î³) and their relative Accuracy (A) for two of the datasets taken under consideration as under:
Dataset: La1s.wc Table 5-1 - LA Times
Dataset: Oh05.wc Table 5-2 - Oshomed
We followed a very simple approach to find out the most suitable values for Cost and Gamma. As per the observation done by libSVM creators the values of Cost and Gamma usually are in powers of 2. So we followed a grid search approach to have different combinations of Cost and Gamma for a specific dataset and check the accuracy for the related values. The results of that grid search are tabulated above as mentioned before.
3.0517578 E -05
3.0517578 E -05
4.8828125 E -04
Table 5â€‘1 La2s.wc
3.0517578 E -05
4.8828125 E -04
4.8828125 E -04
Table 5â€‘2 Oh05.wc
The dataset being used for this project is from Geroge Forman from HP Labs. They contain 19 multi-class text datasets, whose word count feature vectors have been extracted. The problems come from LA Times, TREC, OHSUMED, etc. and the data were originally converted to word counts by George Forman.
The datasets range from newspapers to medical related text documents. Almost all of the datasets have one feature in common, Number of attributes are way higher than number of instances. As mentioned above in the kernel section we used linear kernel for the creation of the model as the number of attributes are way greater than the number of instances.
One of the most important points we would like to draw from our experiment is the generalized model developed from our test data matched the sensitivity or data. This was portrayed in graphical charts. We also identified significant changes in SVM output on changing subsequent input values.
We developed a GUI application based on Java Swings to show the graphical outputs after performing SVM training and classification. Few of the Kernel functions have also been included in this study.
Data mining is a widely used concept today and every company needs to implement algorithms. These vary in various sizes from a single local computer to clients and servers stretching over huge platforms. And Support vector machines have proven powerful classification algorithms.