# A Multi Layered Framework To Detect Outliers Psychology Essay

Network intrusion detection system serves as a second line of defense to intrusion prevention. Anomaly detection approach is important in order to detect new attacks. In this thesis the statistical, and distance based approaches have been adapted to detect anomaly behavior towards network intrusion detection. This model has been evaluated with the kddcup99 data set. The results showing that the connectivity based scheme outperform current intrusion detection approach in the capability to detect attacks and low false alarm rate.

Definition1 An outlier is an observation that lies an abnormal distance from other values in a random sample from a population. In a sense, this definition leaves it up to the analyst (or a consensus process) to decide what will be considered abnormal. Before abnormal observations can be singled out, it is necessary to characterize normal observations.

Definition2 Anomaly is a pattern in the data that does not conform to the expected behaviour, and also referred to as outliers, exceptions, peculiarities, surprise, etc. Anomalies translate to significant real life entities network intrusions. Outlier detection is a good research problem in machine learning that aims to find objects that are dissimilar, exceptional and inconsistent with respect to the majority normal objects in an input database. This is one of the approaches in intrusion detection system.

## X

## Y

## N1

## N2

## o1

## o2

## O3

Figure 4.1 Examples for outliers Objects

A simple example for outlier objects is shown in figure 4.1. Points, o1 and o2 are outliers. Points in region O3 are anomalies. In the regions N1 and N2 are normal objects. The outliers are far away from normal objects and most of the objects are normal objects and only a few objects are outliers in terms of small percentage of total number of objects.

First, outliers can be classified as point outliers and collective outliers based on the number of data instances involved in the concept of outliers.

## 4.1.1 Point Outliers

In a given set of data instances, an individual outlying instance is termed as a point outlier. This is the simplest type of outliers and is the focus of majority of existing outlier detection schemes [60, 61]. A data point is detected as a point outlier because it displays outlier-ness at its own right, rather than together with other data points. In most cases, data are represented in vectors as in the relational databases. Each tuple contains a specific number of attributes.

The principled method for detecting point outliers from vector-type data sets is to quantify, through some outlier-ness metrics, the extent to which each single data is deviated from the other data in the data set.

## 4.1.2 Collective Outliers

A collective outlier [62] represents a collection of data instances that is outlying with respect to the entire data set. The individual data instance in a collective outlier may not be outlier by itself, but the joint occurrence as a collection is anomalous [60]. Usually, the data instances in a collective outlier are related to each other. A typical type of collective outliers is sequence outliers [63, 64], where the data are in the format of an ordered sequence.

Outliers should be investigated carefully. They may contain valuable information about the process under investigation or the data gathering and recording process. Before elimination of these points from the data, one should try to understand why they appeared and whether it is like similar values will continue to appear. Of course, outliers are often bad data points.

## 4.2 Challenges in Detecting Outlier Objects

There are many challenges are there for detecting outlier objects.

Defining a representative normal region is challenging the boundary between normal and outlying behaviour is often not precise

The exact notion of an outlier is different for different application domains

Availability of labelled data for training/validation

Malicious adversaries

Normal behaviour keeps evolving

## 4.3 Aspects of Anomaly Detection Problem

The main aspects of anomaly detection problems are

Nature of input data

Availability of supervision

Type of anomaly: point, contextual, structural

Output of anomaly detection

Evaluation of anomaly detection techniques

A key observation that motivates this research is that outliers existing in high-dimensional data streams [65] are embedded in some lower-dimensional subspaces [66, 67]. The existence of projected outliers is due to the fact that, as the dimensionality of data goes up, data tend to become equally distant from each other. As a result, the difference of data point’s outlier-ness [68] will become increasingly weak and thus undistinguishable.

## 4.4 Multistage Framework to Detect Outliers (MFDO)

This framework has four layers for IDS. In this framework supervised and unsupervised learning have been used. They are explained here.

## Existing Statistical Methods

Statistical outlier detection methods [69-71] rely on the statistical approaches that assume a distribution or probability model to fit the given dataset. Under the distribution assumed to fit the dataset, the outliers are those points that do not agree with or conform to the underlying model of the data.

The statistical outlier detection methods can be broadly classified into two categories, i.e., the parametric methods and the non-parametric methods. The major differences between these two classes of methods lie in that the parametric methods assume the underlying distribution of the given data and estimate the parameters of the distribution model from the given data [41] while the non-parametric methods do not assume any knowledge of distribution characteristics [72, 73].

Statistical outlier detection methods (parametric and non-parametric)[74] typically take two stages for detecting outliers, i.e., the training stage and test stage.

## Training phase

The training stage mainly involves fitting a statistical model or building data profiles based on the given data. Statistical techniques can be performed in a supervised, semi-supervised, and unsupervised manner. Supervised techniques estimate the probability density for normal instances and outliers. Semi-supervised techniques [75, 76] estimate the probability density for either normal instances, or outliers, depending on the availability of labels. Unsupervised techniques determine a statistical model or profile which fits all or the majority of the instances in the given data set;

## Test phase

Once the probabilistic model or profile is constructed, the next step is to determine if a given data instance is an outlier with respect to the model/profile or not. This involves computing the posterior probability of the test instance to be generated by the constructed model or the deviation from the constructed data profile. For example, I can find the distance of the data instance from the estimated mean and declare any point above a threshold to be an outlier [77].

Only in moderate or low dimensional subspaces can significant outlier-ness of data be observed. This phenomenon is commonly referred to as the curse of dimensionality. In my research a framework has been designed for detecting outliers shown in figure 4.2. This architecture consists of four layers. In each layer some process has been done towards detecting the outliers. These layers were explained below:

## 4.4.1 Layer One Description

The first step in this layer is reading the dataset. After reading the dataset next step is to preprocess the dataset. The preprocessing steps were given below:

sample dataset if the data set is large dataset

remove any missing values

normalize samples

select feature subset

## 4.4.1.1 Sampling

Sampling can be used as data reduction technique because it allows a large data set to be represented by a much smaller random sample or subset of data. Suppose that a large dataset, D, contains N tuples. A simple random sample without replacement (SRSWOR) method has been used for sampling the records.

SRSWOR of size s, this is created by drawing s of the N tuples from D ( s < N ), where the probability of drawing any tuple in D is 1/N, that is, all tuples are equally likely to be sampled.

Layer-4

Layer-1

Layer-2

Layer-3 Normal Objects

sample dataset

Normalize Data

Select features subset

Use k-Nearest Neighbor distance based method

Normal

Read dataset

Read dataset

Read dataset

Read dataset

Preprocess Dataset

Cal. z-score for each continuous/categorical attribute

zscore>= threshold_val

Outlier Objects

Yes

Yes

Yes

Yes

No

Multi-class Classifier

(Bayesian Network classifier)

O_Type1

O_Type2

## - - -

O_TypeN

For each test sample

Any test record

Yes

Stop

No

O_Type1

O_Type2

## - - -

O_TypeN

Compute mean for each class

Figure 4.2 Multistage Framework for IDS

An advantage of sampling for data reduction is that the cost of obtaining a sample is proportional to the size of the sample, s, as opposed to N, the data set size. Hence, sampling complexity is potentially sub linear to the size of data. Other data reduction technique can require at least one complete pass throw D. For a fixed sample size, sampling complexity increases only linearly as the number of data dimensions, n, increases, whereas techniques using histograms, for example, increase exponentially in n.

When applied to data reduction, sampling is most commonly used to estimate the answer to an aggregate query.

It is possible to determine a sufficient sample size for estimating a given function within a specified degree of error. This sample size, s, may extremely be small in comparison to N. Sampling is a natural choice for the progressive refinement of a reduced dataset. Such a set can be further refined by simply increasing the sample size.

## 4.4.1.2 Removing Missing Values

A missing value can signify a number of different things in data. Perhaps the field was not applicable, the event did not happen, or the data was not available. It could be that the person who entered the data did not know the right value, or did not care if a field was not filled in. Therefore, Analysis Services provides two distinctly different mechanisms for managing and calculating these missing values, also known as null values.

Classification modeling specifies that a column must never have missing values; you should use the NOT_NULL modeling flag when you define the mining structure. This will ensure that processing will fail if a case does not have an appropriate value. If an error occurs when processing a model, you can then log the error and take steps to correct the data that is supplied to the model.

There are a variety of tools that can use to infer and fill in appropriate values, such as the Lookup transformation or the Data Profiler task in SQL Server Integration Services, or the Fill by Example tool provided in the Data Mining Add-Ins for Excel. The rows were simply removed which have missing values.

## 4.4.1.3 Normalize samples

An attribute is normalized by scaling its values so that they call within a small specified range, such as 0.0 to 1.0. Standardization is particularly useful for classification algorithms involving neural networks, or distance measurements such as nearest neighbor classification and clustering. For distance based methods, data standardization prevents attributes with initially large ranges from out weighing attributes with initially smaller ranges.

There are many methods for data normalization. z-score normalization has been implemented for normalization. In z-score normalization (or zero-mean normalization), the values for an attribute, A, are normalized based on the mean and standard deviation of A. A value, v, of A is normalized to v` by computing

(4.1)

where and are the mean and standard deviation, respectively, of attribute A. This method of normalization is useful when the actual minimum and maximum of attribute A is unknown, or when there are outliers that dominate the min-max normalization.

## 4.4.1.4 Select features subset

Data sets for analysis may contain hundreds of attributes, many of which may be irrelevant to the mining task or redundant. Leaving out relevant attributes are keeping irrelevant attributes may be detrimental, causing confusion for the mining algorithm employed. This can result in discover patterns of poor quality. In addition, the added volume of irrelevant or redundant attributes can slow down the mining process.

Attributes subset selection reduces the data set size by removing irrelevant or redundant attributes (dimensions). The goal of attribute subset selection is to find a minimum set of attributes such that the resulting probability distribution of the data classes is as close as possible to the original distribution obtained using all attributes. Mining on a reduced set of attributes has an additional benefit. It reduces the number of attributes appearing in the discovered patterns, help to make the patterns easier to understand.

For n attributes, there are 2n possible subsets. An exhaustive search for the optimal subset of attributes can be prohibitively expensive, especially as n and the number of data classes increase. Therefore, heuristic methods that explore a reduced search space or commonly used for attribute subset selection.

These methods are typically greedy in that, while searching through attribute space, they always make what looks to be the best choice at the time their strategy is to make a locally optimal choice in the hope that this will lead to a globally optimal solution. Such greedy methods are effective in practice and may come close to estimating an optimal solution.

The best (and worst) attributes are typically determining using tests of statistical significance, which assume that the attributes are independent of one another. Many other attribute evaluation measures can be used, such as the information gain measure used in building decision trees for classification.

Most previous works of feature selection emphasized only the reduction of high dimensionality of the feature space. But in cases where many features are highly redundant with each other, it must utilize other means, for example, more complex dependence models such as Bayesian network classifiers.

In this thesis, it has been introduced a new information gain and divergence-based feature selection method. The feature selection method strives to reduce redundancy between features while maintaining information gain in selecting appropriate features for classification.

Entropy is one of greedy feature selection methods, and conventional information gain which is commonly used in feature selection for classification or clustering models [78]. Moreover, our feature selection method sometimes produces more improvements of conventional machine learning algorithms over support vector machines which are known to give the best classification accuracy. It is defined as

## 4.4.2 Layer Two Description

Here the separation of objects into normal and outliers can be done using the z-score measure, which is explained below.

There are many statistical methods to detect outlier data objects. z-score method has been used to detect outliers. The z-score measure can be defined as follows:

, where and are the mean and standard deviation, respectively, of attribute A. The z-score can be calculated for selected features subset. For each attribute a threshold value is set to determine particular instance is outlier or not. If the majority of the z-scores indicating that instance is outlier, then that object is outlier.

In this research, the problem of detecting outliers from high-dimensional data streams has been studied. This problem can be formulated as follows: given a dataset D with a potentially unbounded size of n-dimensional data points, for each data point xi = {xi1, xi2, . . . , xin} in D, outlier detection method performs a mapping as

f : xi → (b, Si, Z-Scorei)

Where each data xi is mapped to a triplet (b, Si, Z-Scorei). b is a Boolean variable indicating whether or not xi is a projected outlier. If xi is a projected outlier then b is true, therefore Si is the set of outlying subspaces of xi and Z-Scorei is the corresponding outlier-ness score of pi in each subspace of Si. In the case that xi is a normal data, then b is false, Si = ∅ and Z-Scorei is not applicable.

In my work z-score value has been set to +/-1.5 as a threshold value for objects separation. Once the objects are separated as normal and outliers, the outliers can be classified further using multi-class classifier.

Here Bayesian Classifier [79-81] has been used for separating outliers. After classifying the outliers, the mean value for each group of class is calculated. Use k-nearest neighbor method between mean values of class objects and test tuples to determine the class of unseen test records. The Bayesian Network is briefly explained below.

## 4.4.3 Layer Three: Bayesian Network Classifier

The Bayesian network classifiers are directed acyclic graphs that allow efficient and effective representation of the joint probability distributions over a set of random variables. Each vertex in the graph represents a random variable, and edges represent direct correlations between the variables.

More precisely, the network encodes the following conditional independence statements: Each variable is independent of its non-descendants, in the graph given the state of its parents.

These independencies are then exploited to reduce the number of parameters needed to characterize a probability distribution, and to efficiently computer posterior probabilities given evidence. Probabilistic parameters are encoded in a set of tables, one for each variable, in the form local conditional distributions of a variable given its parents. Using the independence statements encoded in the network, the join distribution is uniquely determined by these local conditional distributions.

Let U = {x1, x2 . . . , xn}, n ≥ 1 be a set of variables. A Bayesian network B over a set of variables U is a network structure BS, which is a directed acyclic graph (DAG) over U and a set of probability tables in BS.

To use a Bayesian network as a classifier, one simply calculates argmaxyP(y|x) using the distribution P(U) represented by the Bayesian network.

## =

And since all variables in x are known, it does’t need complicated inference algorithms, but just calculate (1) for all class values.

## 4.4.3.1 Bayesian Network learning algorithm

The dual nature of a Bayesian network makes learning a Bayesian network as a two stage process a natural division: first learn a network structure, and then learn the probability tables.

There are various approaches to structure learning, conditional independence test has been used as a learning method, which is briefly given below.

## 4.4.3.2 Learning Network Structure

Learning a BN from the data is unsupervised learning, in the sense that the learner does not distinguish the class variable from the attribute variables in the data. The objective induces a network that best describes the probability distribution over the training data. This optimization process is implemented by using heuristic search techniques to find the best candidate over the space of possible networks. This search process relies on a scoring function that assesses the merits of each candidate network.

## 4.4.3.3 Learning based on conditional independence tests

This method mainly stems from the goal of uncovering causal structure. The assumption is that there is a network structure that exactly represents the independencies in the distribution that generated the data. Then it follows that if a (conditional) independency can be identified in the data between two variables that there is no arrow between those two variables. Once locations of edges are identified, the direction of the edges is assigned such that conditional independencies in the data are properly represented. The results of model at this stage shown below:

Correctly Classified Instances 21969 98.6972 %

Incorrectly Classified Instances 290 1.3028 %

Mean absolute error 0.0062

Root mean squared error 0.0618

Relative absolute error 3.0386 %

Root relative squared error 19.3229 %

Total Number of Instances 22259

## 4.4.4 Layer Four Task Description

Here the classification of new records is done using the distance based measures, which are explained below.

## 4.4.4.1 kNN Distance Based Method

After separating the records into normal and attacks, the test tuples can be predicted using k-NN distance based method. There have been a number of different ways for defining outliers from the perspective of distance-related metrics.

Most existing metrics are used for distance-base outlier detection techniques are defined based upon the concepts of local neighborhood or k- nearest neighbors (kNN) of the data points. The notion of distance-based outliers does not assume any underlying data distributions and generalizes many concepts from distribution-based methods. Moreover, distance-based methods scale better to multi-dimensional space and can be computed much more efficiently than the statistical-based methods.

In distance-based methods [82, 83], distance between data points is needed to be computed. It can use any of the Lp metrics like the Manhattan distance or Euclidean distance metrics [84] for measuring the distance between a pair of points.

Alternately, for some other application domains with presence of categorical data (e.g., text documents), non-metric distance functions can also be used, making the distance-based definition of outliers very general. Data normalization is normally carried out in order to normalize the different scales of data features before outlier detection is performed.

## 4.4.4.2 Local Neighborhood Methods

The first notion of distance-based outliers, called DB(k, λ)-Outlier [85], is defined as follows. A point p in a data set is a DB(k, λ)-Outlier [86], with respect to the parameters k and λ, if no more than k points in the data set are at a distance λ or less (i.e., λ−neighborhood) from p. This definition of outliers is intuitively simple and straightforward. The major disadvantage of this method, however, is its sensitivity to the parameter λ that is difficult to specify a priori.

when the data dimensionality increases, it becomes increasingly difficult to specify an appropriate circular local neighborhood (delimited by λ) for outlier-ness evaluation of each point since most of the points are likely to lie in a thin shell about any point [87].

Thus, a very small λ will cause the algorithm to detect all points as outliers, whereas no point will be detected as outliers if a too large λ is picked up. In other words, one needs to choose an appropriate λ with a very high degree of accuracy in order to find a modest number of points that can then be defined as outliers.

To facilitate the choice of parameter values, this first neighborhood distance-based outlier definition is extended and the so-called DB (pct, dmin)-Outlier is proposed which defines an object in a dataset as a DB (pct, dmin)-Outlier if at least pct% of the objects in the datasets have the distance larger than dmin from this object.

Similar to DB(k, λ)-Outlier, this method essentially delimits the neighborhood of data points using the parameter dmin and measures the outlierness of a data point based on the percentage, instead of the absolute number, of data points falling into this specified local neighborhood. DB (pct, dmin) is quite general and is able to unify the existing statistical detection methods using discordancy tests for outlier detection. The specification of pct is obviously more intuitive and easier than the specification of k.

## 4.4.4.3 Extension to kNN-distance Method : e-kNN

From the k nearest objects to the k nearest dense regions is proposed. This method employed the sum of the distances between each data point and its k nearest dense regions to rank data points. This enables the algorithm to measure the outlier-ness of data points from a more global perspective.

In the local perspective, human examine the point’s immediate neighborhood and consider it as an outlier if its neighborhood density is low. The global observation considers the dense regions where the data points are densely populated in the data space.

Specifically, the neighboring density of the point serves as a good indicator of its outlying degree from the local perspective. In the left sub-figure of Figure 4.3, two square boxes of equal size are used to delimit the neighborhood of points p1, and p2. Because the neighboring density of p1 is less than that of p2, so the outlying degree of p1 is larger than p2. On the other hand, the distance between the point and the dense regions reflects the similarity between this point and the dense regions.

Intuitively, the larger such distance is, the more remarkably p is deviated from the main population of the data points and therefore the higher outlying degree it has, otherwise it is not. In the right sub-figure of 4.3, the dense region and two outlying points, p1 and p2. Because the distance between p1 and the dense region is larger than that between p2 and the dense region, so the outlying degree of p1 is larger than p2.

Figure 4.3: Local and global perspectives of outlier-ness of p1 and p2

## 4.4.4.4 Global Outlierness Metrics

The global outlier-ness metrics refer to those metrics that measure the outlier-ness of data points from the prospective of how far away they are from the dense regions (or clusters) of data in the data space [88, 89]. The most intuitive global outlier-ness metric would be the distance between a data point to the centroid of its nearest cluster.

This metric can be easily extended to considering the k nearest clusters (k > 1) [90]. For an easy referral, these metrics are termed as kNN Clusters metric. The Largest Cluster metric used in [90] is also a global outlier-ness metric.

### Request Removal

If you are the original writer of this essay and no longer wish to have the essay published on the UK Essays website then please click on the link below to request removal:

Request the removal of this essay