Preprocessing Methods With Clustering And Classification Computer Science Essay

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

This chapter provides an overview of the methodology developed in this thesis to build the hybrid model using two preprocessing methods with clustering and classification. It starts from overall analysis of thesis. Later on, we demonstrated data analysis, two different data transformations analysis used with self organizing map and backpropagation implementation on the Matlab and evaluation of experiments with performance of these two different models.

MIT Lincoln Labs managed the evaluation program of 1998 DARPA intrusion detection for build the data set .This data set is gotten through LAN of U.S air force .This data set has been used in knowledge discovery and data mining (KDD) international conference. It has been held on 1999 for competition of different intrusion detector. This data set was publicly available for research and evaluation of intrusion detection problems. It consists of separate training and testing data. The training data contain 4,898,431 connection records and 311,027 connection records in testing data. It has both seen and unseen data and most have labeled and unlabeled dataset that provide indication whether the training data is normal or belong to attacks. There are 24 attack types in training data set and 14 different testing attacks which is dissimilar to training data set. The attacks divided in to four main categories denial of service, user to root, remote to local and probe [ Stolfo,2000], [Carolina,2007], [Tavallaee , 2009], [MIT, 2007]. These further types and training and testing data sets files are given in [MIT, 2007].

A new data set, NSL-KDD has been used in this research. It consists of selected record of the complete KDD CUP data set and it is proposed by Tavallaee [Tavallaee , 2009] .The old KDD CUP data set has many deficiencies which is effected the evaluation performance of anomaly detection classifier, It is discussed in [McHugh, 2000].This new version of data set has been resolved its two problems but it still more research work is doing for removing all their deficiencies. According to statistical observation, the important problem in old data set

is large number of redundant connection records. This training set detect 78.05% and testing set is consist 75.15% of reduction rate. The NSL KDD CUP is selected records after the removing its redundancy of complete records of old KDD CUP, it can be shown in table 3.2.This data set is reasonable and improves the evaluation. It also gives the consistent and also getting more comparable results [Tavallaee , 2009], [ACM SIGKDD, 2010] It is publicly available for anomaly detection classifier by its site [NSL-KDD , 2009].The sample of NSL KDD CUP data set is provided in Annex “Aâ€Â.

Simulate the network with trained network to get a output of normal and anomaly classes to each samples of winning neuron. The classification is completed after 28 epochs. After completing the training of classification, the output generate mean square error which indicate average square difference between output and target and percent error indicates the fraction of error which are not classified. The summary of all classification training parameter output is illustrated in table 3.8.

In this research, the classifier performances of both using two different methods are measured through the analysis of confusion matrix, accuracy and ROC curve. The balance error is also used in this research for the comparison between unseen data set of two different using methods.

The confusion matrix gives the complete view of classification result. It shows the correct and incorrect classification of classifier or model.

The accuracy is give the overall performance of the classifier and is measured by the addition of true postive and true negative and divided by total values of confusion matrix .The formula of accuracy is given as follows

Overall accuracy of classifier = Sum of true positive and true negative

Sum of total values of confusion matrix.

The ROC stands for receiver operating characteristic.It is also called precision and recall curve plotting.Its plot also show the performance of classifier.It is graph to show camparison between false alarm or false positve rate and hit rate or true positve rate.The x axis curve show the false alarm rate and hit rate shows on y-axis [Andrea,2008] ,[AC wikipedia,2010].

One performance is measure during the development phase,that is validation set ranking but another performance which is used to check the final result ,that is test set ranking.The balance error rate is usually used to check the test set performance[WCCI,2006]


1. The MATLAB 7.9 ( R2009b ) is used for experiments of both different methods

through the SOM clustering and backpropagation classification algorithms.

2. SPSS 13.0 software is used for data analysis and examination of results.