Network Intrusion Detection Using Machine Techniques Information Technology Essay

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

The Internet and the corporate intranets have become a part of daily life and an essential tool Internet, today, plays a major role in creating and advancing new avenues. It aids people in many areas, such as business, entertainment and education. Business needs have motivated enterprises and governments across the globe to develop sophisticated, complex information networks. Such networks incorporate a diverse array of technologies, including distributed data storage systems, encryption and authentication techniques, voice and video over IP, remote and wireless access, and web services. Moreover, corporate networks are providing many services to the management at remote locations over fixed-mobile networks. Therefore, information security of using Internet as the media needs to be carefully concerned. Intrusion detection is one major research problem for business and personal networks.

There has been much discussion on explaining the intrusion detection. A good source for the readers to get familiar with IDS is [15]. We are going to focus the machine learning based techniques used in constructing Intrusion detection systems. The type of IDS built using ML techniques is called A-IDS or Anomaly based Intrusion Detection Systems. These IDS can work at the host level as well at the network level. The network-level IDS are preferred because of lower cost of ownership and overall protection of network. The Fig 1 explains the taxonomy of IDS.

Figure 1 Generic Anomaly based Intrusion Detection System

Anomaly based IDS work in the principle of learning from experience or learning by teacher. The anomaly detection algorithms have the advantage that they can detect new types of intrusions as deviations from normal usage. However, their weakness is the high false alarm rate. This is because previously unseen, yet legitimate, system behaviors may also be recognized as anomalies.

Machine learning techniques

The day by day increase in network traffic has virtually eliminated the possibility of human supervision of network security. Even a system, assisting human administrators in taking decision over intrusion diction is not worthwhile as thousands of intrusion attempts may be made per minute. Giving due care to all such alarms is not possible for human administrator. Therefore, need for an automated, highly accurate reliable intrusion detection system is more than ever. Machine learning is the area, capable of producing such desired system with required output. Even research is going on, a number of machine learning techniques have been used for intrusion detection and their summary is outlined on Table 1

Table 1 Summary of Machine Learning Techniques



Underlying principle

Markov Modeling

Markov Chain

The markov chains consist of sets of states and associated probabilities for transition to different states

Hidden Markov Models

In hidden markov models, the underlying states are not visible

Principle Component Analysis

Statistical Variance

Components of data are formed based on the variance of data instances

Genetic Algorithm

Evolutionary Computing

Genetic algorithm work on the principal of evolutionary biological techniques

Inductive Logic Programming (ILP)

Logic Induction

Try to find an hypothesis so that all positive examples are true and negative examples are not negated

Decision Tree

Construct a tree based on the feature selection


Soft Computing

The neighborhood is computed for each instance and depending on the neighbors; instance is classified as normal or anomaly. This technique can be used both as supervised and un-supervised way


Two or more techniques are applied simultaneously

Rule Based Classification

Rules are inferred directly from the training data and used to classify the instance

Support Vector Machines

The SVM marks the region as normal where most of the normal instances are located.


A collection of classifiers are used and their output is combined using a decision combination function

Naïve Bayes

These classifiers work a generative framework in which each document is generated by a parametric distribution governed by a set of hidden parameters

Neural Network


The network is required to self organize depending on some structure in the input data. Typically this structure is may be some form of redundancy in the input data or clusters in the data. Self Organizing Maps use unsupervised learning rule


The code vectors in a region are trained and test instances are approximated against the code vectors



Markov Modeling

Within this category, we may distinguish two main approaches: Markov chains and hidden Markov models. A Markov chain is a set of states that are interconnected through Certain transition probabilities, which determine the topology and the capabilities of the model. During first training phase, the probabilities associated to the transitions are estimated from the normal behavior of the target system. The detection of anomalies is then carried out by comparing the anomaly score (associated probability) obtained for the observed sequences with a fix threshold.

HMM is a double stochastic process. The upper layer is a Markov process whose states are not observable. The lower layer is a normal Markov process where emitted outputs can be observed. HMM is a powerful tool in modeling and analyzing complicated stochastic process. In the case of a hidden Markov model, the system of interest is assumed to be a Markov process in which states and transitions are hidden. Only the so-called productions are observable. [1] Proposes a simple data preprocessing approach to speed up a hidden Markov model (HMM) training for system-call-based anomaly intrusion detection. Experiments based on a public database demonstrate that this data preprocessing approach can reduce training time by up to 50 percent with unnoticeable intrusion detection performance degradation, compared to a conventional batch HMM training scheme. More than 58 percent data reduction has been observed compared to prior incremental HMM training scheme. Although this maximum gain incurs more degradation of false alarm rate performance, the resulting performance is still reasonable.

Another Markov Model i.e. Service Specification and Stochastic Markovian modeling" (S3M) is also applied to IDS [2]. S3M makes use of the Markov theory to provide a production model consisting of a FSA and the associated probabilities for the observed events or transitions between states.

Principle Component Analysis

PCA is a dimensionality reduction technique based on the statistical variance among data instances. In mathematical terms, PCA is a technique where n correlated random variables are transformed into d uncorrelated variables. The uncorrelated variables are linear combinations of the original variables and can be used to express the data in a reduced form. Typically, the first principal component of the transformation is the linear combination of the original variables with the largest variance. In other words, the first principal component is the projection on the direction in which the variance of the projection is maximized. The second principal component is the linear combination of the original variables with the second largest variance and orthogonal to the first principal component, and so on. In many data sets, the first several principal components contribute most of the variance in the original data set, so that the rest can be disregarded with minimal loss of the variance for dimension reduction of the dataset.

This paper proposes a novel scheme that uses robust principal component classifier in intrusion detection problem where the training data may be unsupervised. Assuming that anomalies can be treated as outliers, an intrusion predictive model is constructed from the major and minor principal components of normal instances. A measure of the difference of an anomaly from the normal instance is the distance in the principal component space. The distance based on the major components that account for 50% of the total variation and the minor components with eigenvalues less than 0.20 is shown to work well. The experiments with KDD Cup 1999 data demonstrate that the proposed method achieves 98.94% in recall and 97.89% in precision with the false alarm rate 0.92% and outperforms the nearest neighbor method, density-based local outliers (LOF) approach, and the outlier detection algorithms based on Canberra metric. [ 3 ]

Most current intrusion detection systems are signature based ones or machine learning based methods. Despite the number of machine learning algorithms applied to KDD 99 cup, none of them have introduced a pre-model to reduce the huge information quantity present in the different KDD 99 datasets. Authors introduce a method that applies to the different datasets before performing any of the different machine learning algorithms applied to KDD 99 intrusion detection cup. This method enables us to significantly reduce the information quantity in the different datasets without loss of information. This method is based on Principal Component Analysis (PCA). It works by projecting data elements onto a feature space, which is actually a vector space Rd, that spans the significant variations among known data elements. The work present two well known algorithms, decision trees and nearest neighbor, and the authors show the contribution of their approach to alleviate the decision process. The experimental work is done over network records from the KDD 99 dataset, first by a direct application of these two algorithms on the rough data, second after projection of the different datasets on the new feature space. [4]

In this paper, architecture for principal component analysis (PCA) to be used as an outlier detection method for high-speed network intrusion detection systems (NIDS). PCA is a common statistical method used in multivariate optimization problems in order to reduce the dimensionality of data while retaining a large fraction of the data characteristic. First, PCA is used to project the training set onto eigenspace vectors representing the mean of the data. These eigenspace vectors are then used to predict malicious connections in a workload containing normal and attack behavior. The simulations show that this architecture correctly classies attacks with detection rates exceeding 99% and false alarms rates as low as 1.95%. For next generation NIDS, anomaly detection methods must satisfy the demands of Gigabit Ethernet. FPGAs are an attractive medium to handle both high throughput and adaptability to the dynamic nature of intrusion detection. Using hardware parallelism and extensive pipelining, the architecture is implemented on FPGAs to achieve Gigabit link speeds. [5]

Inductive Login Programming

Inductive rule generation algorithms typically involve the application of a set of association rules and frequent episode patterns to classify the audit data. In this context, if a rule states that ''if event X occurs, then event Y is likely to occur'', then events X and Y can be described as sets of (variable, value)- pairs where the aim is to find the sets X and Y such that X ''implies'' Y. In the domain of classification, The Y is fixed and attempt to find sets of X which are good predictors for the right classification. While supervised classification typically only derives rules relating to a single attribute, general rule induction techniques, which are typically unsupervised in nature, derive rules relating to any or all the attributes. The advantage of using rules is that they tend to be simple and intuitive, unstructured and less rigid. As the drawbacks they are difficult to maintain, and in some cases, are inadequate to represent many types of information. A number of inductive rule generation algorithms have been proposed in literature. Some of them first construct a decision tree and then extract a set of classification rules from the decision tree.

In order to enhance the availability and practicality of intelligent intrusion detection system based on machine learning in high-speed network, an improved fast inductive learning method for intrusion detection (FILMID) is designed and implemented. Accordingly, an efficient intrusion detection model based on FILMID algorithm is presented. The experiment results on the standard testing dataset validate the effectiveness of the FILMID based intrusion detection model. [14]

Genetic Algorithms

Genetic algorithms are categorized as global search heuristics, and are a particular class of evolutionary algorithms (also known as evolutionary computation) that use techniques inspired by evolutionary biology such as inheritance, mutation, selection and recombination. Thus, genetic algorithms constitute another type of machine learning-based technique, capable of deriving classification rules and/or selecting appropriate features or optimal parameters for the detection process The main advantage of this subtype of machine learning A- NIDS is the use of a flexible and robust global search method that converges to a solution from multiple directions, whilst no prior knowledge about the system behavior is assumed. Its main disadvantage is the high resource consumption involved.

In this paper, authors explore the use of Genetic Programming (GP) for such a purpose. The approach is not new in some aspects, as GP has already been partially explored in the past. Here GP can offer at least two advantages over other classical mechanisms: it can produce very lightweight detection rules (something of extreme importance for high-speed networks or resource-constrained applications) and the simplicity of the patterns generated allows to easily understand the semantics of the underlying attack. [9]

In [10], the authors present a serial combination of two genetic algorithm-based intrusion detection systems. Many solutions for intrusion detection based on machine learning techniques have been proposed, but most of them introduce significant computational overhead, which makes them time-consuming and thus increases their period of adapting to the environmental changing. In the first step, features are extracted in order to reduce the amount of data that the system needs to process. Hence, the system is simple enough not to introduce significant computational overhead, but at the same time is accurate, adaptive and fast due to genetic algorithms. Furthermore, on account of its inherent parallelism, the solution offers a possibility of implementation using reconfigurable hardware with the implementation cost much lower than the one of the traditional systems. The model is verified on KDD99 benchmark dataset and it has been proven that it is comparable to the solutions of the state-of-the-art, while exhibiting the mentioned advantaged.

Hybrid A-NIDS

Many of the above-discussed techniques are applied in combination to achiever better results and to overcome the deficiencies of other technique.

An Intrusion Detection System (IDS) based on Principal Component Analysis (PCA) and Grey Neural Networks (GNN) is presented to improve the performance of BP neural networks in the field of intrusion detection. In this technique, the pre-processed data set is normalized and the features of them are extracted by PCA. Next, five layers of the grey neural networks is designed based on BP neural networks and Grey theory, then the IDS composed of sniffer module, data processing module, grey neural network module and intrusion detection module is presented. The presented system was tested on the data set of DARPA 1999 and the results demonstrate that the feature extraction reduced the dimensionality of feature space greatly without degrading the systems' performance [6]. In [7], An IDS combining with GA and BP is put forward.

Decision Tree

Decision trees are powerful and popular tools for classification and prediction. The attractiveness of tree-based methods is due in large part to the fact that, in contrast to neural networks, decision trees represent rules. A decision tree is a tree that has three main components: nodes, arcs, and leaves. Each node is labeled with a feature attribute which is most informative among the attributes not yet considered in the path from the root, each arc out of a node is labeled with a feature value for the node's feature and each leaf is labeled with a category or class. A decision tree can then be used to classify a data point by starting at the root of the tree and moving through it until a leaf node is reached. The leaf node would then provide the classification of the data point.

Traditional intrusion detection technology exists a lot of problems, such as low performance, low intelligent level, high false alarm rate, high false negative rate and so on. In this paper, C4.5 decision tree classification method is used to build an effective decision tree for intrusion detection, then convert the decision tree into rules and save them into the knowledge base of intrusion detection system. These rules are used to judge whether the new network behavior is normal or abnormal. Experiments show that: the detection accuracy rate of intrusion detection algorithm based on C4.5 decision tree is over 90%, and the process of constructing rules is easy to understand, so it is an effective method for intrusion detection. [11]

How to find the intrusion behaviors is a problem that troubled the intrusion detection field for years. Until now, there is not a good method to solve it, epically in a realistic context. Most methods are effective on small data sets, but when used to the massive data of IDS, the effectiveness seems to be unsatisfactory. In this paper, a new method based on decision tree is discussed to solve the problem of low detection rate of massive data. [12]

Nearest Neighbor

The k-nearest neighbors algorithm (k-NN) is a method for classifying objects based on closest training examples in the feature space. k-NN is a type of instance-based learning, or lazy learning where the function is only approximated locally and all computation is deferred until classification. The k-nearest neighbor algorithm is amongst the simplest of all machine learning algorithms: an object is classified by a majority vote of its neighbors, with the object being assigned to the class most common amongst its k nearest neighbors (k is a positive integer, typically small). If k = 1, then the object is simply assigned to the class of its nearest neighbor

This paper describes feature selection and model selection simultaneously for k-nearest neighbor (k-NN) classifiers. In order to reduce the optimization effort, various techniques are integrated that accelerate and improve the classifier significantly: hybrid k-NN, comparative cross validation. The feasibility and the benefits of the proposed approach are demonstrated by means of data mining problem: intrusion detection in computer networks. It is shown that, compared to earlier k-NN technique, the run time is reduced by up to 0.01 % and 0.06 % while error rates are lowered by up to 0.002 % and 0.03 % for normal and abnormal behavior respectively. The algorithm is independent of specific applications so that many ideas and solutions can be transferred to other classifier paradigms. [13]

Neural Networks

Rule Based Classification

The Association Rules are canonical data mining taking aim at discovering relationships between items in the dataset. The association-rule-based classifiers model the text documents as a collection of transactions where each transaction represents a document, and the items in the transaction are the terms selected from the document and the categories documents is assigned to. The most popular algorithm use to compute association rules effectively are apriori algorithm [20] and FP-tree algorithm [21]. The [22] uses a similar approach to construct a rule-base classifier with apriori-based algorithm but the results obtained on the Reuter-21578 collection are not promising as for five categories out of ten; the precision/recall breakeven point is around 60 %. For a relatively difficult category the breakeven point is 25.8 %, which is not acceptable for practical classification.

Support Vector Machine

SVM is learning methods introduced by [23]. SVM are based on the structural risk minimization principal from the computational theory. SVM use the Vapnik-Chervonenkis (VC) dimensions of a problem to characterize its complexity, which can be independent of the dimensionality of the problem. The basic idea is to find decision surfaces between use a hyper plane to the classes of data, positive and negative with maximum margins from the both sides. Kernel functions are used for nonlinear separation. The groups of vectors that lie near the separating hyperplane are called support vectors. Once the separating hyper plane is found the new examples can be classifies by simply checking that on which side of the hyperplane they fall. SVM not only has a rigorous theoretical foundation, but also performs classification more accurately than most other methods in applications, especially for high dimensional data. In classifiers using SVM, term selection is often not needed, as SVMs tend to be fairly robust to over fitting and can scale up to considerable dimensionalities. Also there is no human and machine effort in parameter tuning on a validation set is needed, as there is a theoretically calculated "default" choice of parameter settings.

SVM have shown superb performance for text classification tasks. The reasons that SVMs work well for TC is that during learning classifiers, one has to deal with many features such as more than 10,000. Since SVM use over fitting protection that does not depend on the number of features and have the potential to deal with the large number of attributes. Most of the document vectors are sparse and contained very few non-zero entries. It is shown in [24] that additive algorithms having inductive base like SVM work very well for problems with dense concepts and sparse instances. Most of the text categorization problems are linearly separable such as the reuters-21578. SVMs are accurate, robust, and quick to apply to test instances. Their only potential drawback is their training time and memory requirement. For n training instances held in memory, the best-known SVM implementations take time proportional to na, where a is typically between 1.8 and 2.1.

Naïve Bayes

These classifiers work a generative framework in which each document is generated by a parametric distribution governed by a set of hidden parameters. Naïve Bayes method assumes that all attributes of the examples independent of each other given the context of a single class while this assumption is clearly wring in the real world, the Naïve Bayes often works well. Because of the independence assumption the parameters for every attribute can be learned separately. This makes the learning process very simple especially when the attributes are large.

Fuzzy Logic

Fuzzy logic (or fuzzy set theory) is based on the concept of the fuzzy phenomenon to occur frequently in real world. Fuzzy set theory considers the set membership values for reasoning and the values range between 0 and 1. That is, in fuzzy logic the degree of truth of a statement can range between 0 and 1 and it is not constrained to the two truth values (i.e. true, false). For examples ''rain" is a commonly natural phenomenon, and it may have very fierce change. Raining may be able to convert the circumstances from slight to violent


Conclusion & Future Works

This paper discussed the up to date work on intrusion detection using machine learning techniques. We have also discussed less-reviewed technique such as ILP, GP and PCA and provided a summary of work done in each of the technique. The future work includes in-depth analysis of promising intrusion detection technique with possible improvements. In next review, the techniques may be tested in a controlled environment to obtain quantitative results. Then a (nearly) fool-proof system will be used to design an intrusion detection system for peer-2-peer communications.