# Optimize Fuzzy Time-Interval Sequential Patterns

Published:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

Abstract. Sequential pattern mining, which discovers frequent subsequences as patterns in a sequence database, is an important data-mining problem with broad applications. From these discovered sequential patterns, we can discover the order of the patterns; however, they cannot tell us the time intervals between successive patterns. Accordingly, Chen et al. have proposed a fuzzy time-interval (FTI) sequential pattern mining algorithms, which reveals the time intervals between successive patterns [12][13]. In this paper, we contributed to the ongoing research on FTI sequential pattern mining by proposing a multi objective Genetic Algorithm (GA) based method. Fuzzy solves the sharp boundary problem and the refinement ability of GA helps to find the global optimum FTI sequential patterns. Our approach uses two measures as objectives, namely: Confidence and Coverage to prune the traditional Apriori algorithm. The main objective is to achieve maximum confidence and maximum coverage in the FTI sequential patterns. The paper defines the confidence of the FTI sequences, which is not yet defined in the previous researches. The main advantage of the proposed algorithm is the use of fuzzy genetic approach to discover optimized sequences in the network traffic data to classify and detect intrusion.

Keywords: Data mining, fuzzy sets, sequence data, time interval, genetic algorithm, intrusion detection system.

1 Introduction

Data mining extracts implicit, previously unknown and potentially useful information from databases. The discovered information and knowledge are useful for various applications, including market analysis, decision support, fraud detection, intrusion detection and business management. Many approaches have been proposed to extract information, and mining sequential patterns is one of the most important ones [1][2][3]. An example of such a pattern is that customers typically rent videos, "Star Wars", then "Empire Strikes Back", and then "Return of the Jedi". Note that these rentals need not be consecutive. Customers who rent some other videos in between also support this sequential pattern.

The security of computer network plays a strategic role in modern computer systems with the widespread use of network. Intrusion Detection Systems (IDS) are effective security tools, placing inside a protected network and looking for known or potential threats in network traffic and/or audit data recorded by hosts. Basically, IDS analyzes information about user's behaviors from various sources such as audit trail, system table, and network usage data. The problem of intrusion detection has been studied extensively in computer security [4][5][6], and has received a lot of attention in machine learning and data mining [7].Intrusion detection techniques can be categorized into misuse detection, which uses Patterns of well-known attacks or weak spots of the system to identify intrusions, misuse detection system, for example, IDIOT [8] and STAT [9], use patterns of well-known attacks or weak spots of the system to match and identify intrusion; and anomaly detection, which tries to determine whether deviation from the established normal usage patterns can be flagged as intrusion, anomaly detection system, for example, IDES [10], flags observed activities that deviates significantly from the established normal (statistical based) usage profiles as anomalies (i.e., possible intrusions). Lee and Stolfo [11] discuss data mining approaches for intrusion detection. With the improvement of intrusion detection means, sometime it is difficult to judge whether an isolated sequence of audit event belongs to intrusion or not, but if a series of events are ranged in time sequencing, we may find anomalies. So the sequence pattern algorithms of data mining techniques are applied in intrusion detection and intrusion pattern rules are found by learning the frequent episodes. A frequent episode is a set of events that occur frequently within a time window. Sequential pattern mining algorithms can help us understand what (time-based) sequence of audit events are frequently encountered together. These frequent event patterns are important elements of the behavior profile of a user or program.

From the discovered sequential patterns, we can know what patterns are frequently brought together and in what order they appear. However, they cannot tell us the time gaps between successive patterns. Accordingly, Chen et al. have proposed a generalization of sequential patterns, called time-interval sequential patterns, which reveals not only the order of patterns, but also the time intervals between successive patterns [12]. An example of time-interval sequential pattern has a form like (A, I2, B, I1, C), meaning that pattern B is followed by pattern A and pattern C is followed by pattern B with the predetermined time interval of I2 and I1 respectively. Although sequential patterns extended with time intervals can offer more information than those without time intervals, the approach may cause the sharp boundary problem. That is, when a time interval is near the boundary of two adjacent ranges, we either ignore or overemphasize it. For example, let the interval of I1 be 5 â‰¤ t < 10 and that of I2 be 10 â‰¤ t < 20, where is the time gap between two successive patterns. Then if the time gap between patterns A and B is near 10, either a little larger or smaller, it is difficult to say the time gap is in I1 or in I2. However, according to the original definition of Chen et al. it can only be one hundred percent in I1 or in I2. This difficulty can be adequately tackled by using fuzzy techniques, for fuzzy set theory allows this time gap to be 50% in I1 and at the same time 50% in I2. This simple example indicates that the fuzzy sets provide a smooth transition between member and nonmember of a set. Two efficient algorithms, FTI-Apriori algorithm and the FTI-PrefixSpan algorithm, are developed by Chen et al for mining FTI sequential patterns [13].

It has been a great challenge to improve the efficiency of Apriori algorithm. Since all the frequent sequential patterns are including in the maximum frequent sequential patterns, the task of mining frequent sequential patterns can be converted as mining maximum frequent sequential patterns. Mining maximum frequent sequential patterns is more important for data mining FTI sequential patterns. Pruning and parallelization can be two efficient technologies to deal with it. In this paper, we have proposed an algorithm which uses GA to prune FTI-Apriori algorithm. Maximum confidence and maximum coverage is used as criteria for pruning the FTI sequential patterns. The main motivation for using GA in the discovery of high-level prediction rules is that they perform a global search and cope better with attribute interaction than the greedy rule induction algorithms often used in data mining.

In this paper, an algorithm is proposed to prune the traditional Apriori algorithm for pattern mining, to mine frequent sequential pattern using FTI sequential pattern in network audit data to classify and detect intrusion.

2 Related Work

The problem of mining sequential patterns was first introduced by Agarwal and Srikant [1] which discovers patterns that occur frequently in a sequence database.

A sequence database is formed by a set of data sequences. Each data sequence includes a series of transactions, ordered by transaction times. After mid 1990's,

following Agrawal and Srikant [1], many scholar provided more efficient algorithms[15][16][17][18]. Besides these, works have been done to extend the mining of sequential patterns to other time-related patterns.

Existing approaches to find appropriate sequential patterns in time related data are mainly classified into two approaches. In the first approach developed by Agarwal and Srikant [14], the algorithm extends the well-known Apriori algorithm. This type of algorithms is based on the characteristic of Apriori-that any subpattern of a frequent pattern is also frequent [1]. The later, uses a pattern growth approach [15], employs the same idea used by the Prefix-Span algorithm.

This algorithm divides the original database into smaller subdatabases and solve them recursively. Previous research addresses time intervals in two typical ways, first by the time-window approach, and second by completely ignoring the time interval. First, the time window approach requires the length of the time window to be specified in advance. A sequential pattern mined from the database is thus a sequence of windows, each of which includes a set of patterns. Patterns in the same time window are bought in the same time period. In the algorithm [17], Shrikant and Agrawal, specified the maximum interval (max-interval), the minimum interval (min-interval) and the sliding time window size (windowsize). Moreover, they cannot find a pattern whose interval between any two sequences is not in the range of the window-size. Agrawal and Srikant [1], introduced mining traditional sequential mining, by ignoring the time interval and including only the temporal order of the patterns.

To address the intervals between successive patterns in sequence database, Chen et al. have proposed a generalization of sequential patterns, called time-interval sequential patterns, which reveals not only the order of patterns, but also the time intervals between successive patterns [12]. Chen et al. developed algorithms to find sequential patterns using both the approaches [12]. Their work, by assuming the partition of time interval as fixed, developed two efficient algorithms -I-Apriori and I- PrefixSpan. The first algorithm is based on the conventional Apriori algorithm, while the second one is based on the PrefixSpan algorithm. An extension of the algorithm developed by Chen et al [12], to solve the problem of sharp boundaries to provide a smooth transition between members and non-members of a set, is addressed in Chen et al [13]. The sharp boundary problems can be solved by the concept of fuzzy sets. Two efficient algorithms, the FTI-Apriori algorithm and the FTI-PrefixSpan algorithm, were developed for mining FTI sequential patterns. There are several other reasons that support the use of FTI in place of crisp interval. First, the human knowledge can be easily represented by fuzzy logic. Second, it is widely recognized that many real world situations are intrinsically fuzzy, and the partition of time interval is one of them. Third, FTI is simple and easy for users.

Anrong et al [20], addresses application of sequential pattern in intrusion detection by refining the pattern rules and reducing redundant rules. Their work implements PrefixSpan algorithm in the data mining module of network intrusion detection system (NIDS). Fuzzy logic addresses the formal principles of approximate reasoning. It provides a sound foundation to handle imprecision and vagueness as well as mature inference mechanisms by varying degrees of truth. As boundaries are not always clearly defined, fuzzy logic can be used to identify complex pattern or behavior variations. And it can be accomplished by building an intrusion detection system that combines fuzzy logic rules with an expert system in charge of evaluating rule truthfulness. Zhou et al [21], incorporates fuzzy logic and GA into the classifying system based on fuzzy association rule to extract both accurate and interpretable fuzzy IF-THEN rules from network traffic data for classification, and utilize GA to optimize the classifier. Zhou et al [21], could detect high accuracy for intrusions and low false alarm rate for a reliable system. Prasad et al [22], proposed a model for intrusion detection systems for anomaly detection based on fuzzy association rules which use genetic programming. Yunwu [23], developed a Fuzzy Expert System Based on GA for Intrusion Detection System, firstly the algorithm draws initial rules by using fuzzy algorithm, and in then, the membership function is optimized by the GA, with simplification of fuzzy rules, to build a high performance fuzzy system.

Existing algorithms, try to measure the quality of generated sequential patterns by considering only one evaluation criterion, i.e., support count. This criterion evaluates the sequential patterns depending on the number of occurrence of the sequences in the entire database. Hence, optimization of the Apriori algorithm is to be addressed. Saggar et al [19], has addressed optimization by using GA. GA performs a global search and copes better with attribute interaction than the greedy rule induction algorithms.

This paper focuses on pruning the traditional Apriori algorithm for pattern mining, to mine frequent sequential pattern using FTI sequential pattern in network audit data to classify and detect intrusion. A pruning algorithm is proposed with the use of GA's global search to prune the FTI sequential pattern mining, Apriori based algorithm to detect high accuracy for intrusions and optimize the network intrusion classifier.

3 Theory

In all the definitions for n patterns in S sequences with sid as the sequence-id in a network traffic pattern T is represented as <sid, S>.

3.1 Sequence Pattern

A pattern-set is a non-empty set of patterns. A sequence is an ordered list of pattern-set. Without loss of generality, we assume that the set of patterns is mapped to a set of contiguous integers. We denote a pattern-set a as (a1a2â€¦an), where aj is a pattern. We denote a sequence S by < s1s2â€¦sn>, where sj is a pattern-set.

A sequence < a1a2â€¦an > is contained in another sequence < b1b2â€¦bm > if there exist integers i1 < i2 < â€¦ < in such that a1 \subseteq \!\, bi1, a2 \subseteq \!\, bi2, â€¦, an \subseteq \!\, bin. For example, the sequence < (3) (4 5) (8) > is contained in < (7) (3 8) (9) (4 5 6) (8) >, since (3) \subseteq \!\, (3 8), (4 5) \subseteq \!\, (4 5 6) and (8) \subseteq \!\, (8). However, the sequence <(3) (5) > is not contained in < (3 5) > (and vice versa). The former represents patterns 3 and 5 occurred one after the other, while the latter represents pattern 3 and 5 occurred together.

3.2 Time interval Sequence Pattern

A sequence ST is represented as ((a1,t1),( a2,t2), (a3,t3), â€¦, (an,tn)), where aj is an pattern and tj stands for the time at which aj occurs, 1 â‰¤ j â‰¤ n, and tj-1 â‰¤ tj, for 2 â‰¤ j â‰¤ n. In the sequence, if patterns occur at the same time, they are ordered alphabetically. The time interval values can be calculated as tij = | tj+1 - tj| , where j = 1,2,â€¦,n-1.

For example, in a sequence S, ((a, 4), (d, 10), (e.28)) the time interval values are 6 and 18. Here a, d and e are the patterns of the network traffic pattern.

3.3 Fuzzy Logic

Fuzzy logic has been a powerful tool for decision making to handle imprecise and uncertain data. In contrast to classical set, a fuzzy set is a set without crisp boundaries; the transition from "belong to a set" to "not belong to a set" is gradual. Membership function is utilized to reflect a degree of membership and indicated by a value in the range [0.0, 1.0].

A Fuzzy set can be defined as - If U is a collection of objects denoted generically by x, then a fuzzy set A in U is defined as a set of ordered pairs:

A= {(x, ÂµA(x))| x Ð„ U},

where, ÂµA(x) is the membership function and U is the universe of discourse.

For example, A one year old baby will clearly be a member of the set, and a 100 years old person will not be a member of this set, but what about people at the age of 20, 30, or 40 years? U = {0, 20, 40, 60, 80},

A = { (0,1),(20,0.6),(40,0.1),(60,0),(80,0)}. Alternative representation can be A = 1/0 + 0.6/20 + 0.1/40 + 0/60 + 0/80. Fig 1 shows the graphical representation.

Fig 1. Graphical representation of a Fuzzy set.

Just like an algebraic variable takes numbers as values, a Linguistic/ Fuzzy Variables takes words or sentences as values. The set of values that it can take is called its "term set". Each value in the term set is a "Fuzzy variable" defined over a "Base variable". The base variable defines the universe of discourse for all the fuzzy variables in the term set. The fuzzy variables themselves are adjectives that modify the variable (e.g. "large positive" error, "small positive" error ,"zero" error, "small negative" error, and "large negative" error). As a minimum, one could simply have "positive", "zero", and "negative" variables for each of the parameters. Additional ranges such as "very large" and "very small" could also be added to extend the responsiveness to exceptional. For example, Let x be a Linguistic/ Fuzzy Variable with the label "Age",

T = { Old, VeryOld,NotSoOld,MoreOrLessYoung, QuiteYoung, VeryYoung} or can be T = {Young, Middle Aged, Old} as depicted in the Fig 2.

Fig 2. Membership Function for a fuzzy set.

3.4 Fuzzy Time Interval Sequence Pattern

Two approaches have been used to determine linguistic terms and fuzzy membership functions. The first approach relies on domain experts to specify the functions based on their background knowledge and requirements. The second approach assumes that the functions are obtained by a preprocessing phase that learns the functions from the data, such as learning by neural-network, by GA, by clustering method, and by entropy measure. Since, in the current fuzzy mining researches, the first approach is more popular than the second one, we also adopt the same assumption that the fuzzy membership functions are as shown in Fig 3.

Fig 3. Membership function for time interval

Âµshort (tij) = (1)

Âµmiddle (tij) = (2)

Âµshort (tij) = (3)

Suppose we want to represent a time interval by using three linguistic terms: Short (S), Middle (M), and Long (L) within a month. Consider a month having 30 days, Short linguistic variable can be defined as, if the time interval between first two events is less than or equal to two days then they are definitely in sequence and has the membership value as 1and if time interval is within 15 days then membership values is calculated according to the slope of the line between 2 and 15. If the time interval is more than 15 then definitely it cannot be represented by variable Short and hence the membership value is 0. Similarly, we can define the membership value for linguistic variable Middle and Long. Their membership functions can be represented by equations (1), (2) and (3) respectively [13].

Let A={a1a2â€¦an}, be the set of patterns and LT = { ltj| j= 1,2,â€¦,l} be a set of all linguistic terms. A sequence Î± = (b1, lg1, b2, lg2,â€¦, br-1, lgr-1, br) is a FTI sequence if bi Ð„ A and lgi Ð„ LT for 1â‰¤ i â‰¤ r-1 and br Ð„ A.

If a sequence Î± is contained in with degree Î³, then we call Î± a FTI subsequence of S with degree Î³. The total number of patterns in a FTI sequence Î± is referred to as the length of the sequence. A FTI sequence whose length is k is referred to as a fuzzy k-time-interval sequence.

SupportS(Î±) = (4)

Support of a FTI sequence Î± is given by equation (4). Given a sequence database and min_sup, the goal of FTI sequential pattern mining is to determine in the sequence database all the FTI subsequences whose supports are more than or equal to min_sup.

support_countS(Î±) = {(sid,s) | (sid,s) Ð„ S Ë„ Î± is contained in s} (5)

ConfidenceS(Î±1 => Î±2) = (6)

Fig 4. Sequence Database

Confidence of FTI sequence Î±1 on Î±2 is given by equation (6) uses the support_count defined in equation (5). For example, consider the sequence database in Fig 4, with the linguistic terms defined: Short (S), Middle (M), and Long (L). For 1-sequence, L1 is given as, a (8), b (7), and e (5) with 8, 7 and 5 as their support respectively. Here confidence of the sequences is calculated according to the equation (5). Hence we get, Confidence (a => b) = 0.625 and Confidence (a => e) = 0.5. Now for 2-sequence, Supp (a short b) = 0.392, Supp (a long e) = 0.361 and Supp (b long e) = 0.308. Accordingly, the confidence ((a short b) => e) = 0.8.

3.5 Genetic Algorithms

Genetic algorithm is a domain-independent method that genetically breeds a population of computer programs to solve a problem. Specifically, GA iteratively transforms a population of computer programs into a new generation of programs by applying analogs of naturally occurring genetic operations. The genetic operations include crossover (sexual recombination), mutation, reproduction, gene duplication, and gene deletion. Steps for GA are shown in Fig 5.

Fig 5. GA algorithm

4 Proposed Algorithm

A pruning algorithm is proposed, to prune the FTI sequential pattern mining. The proposed algorithm uses two measures as objectives, namely: confidence and coverage to prune the traditional Apriori algorithm. The main advantage of the proposed algorithm is the use of fuzzy genetic approach to discover optimized sequences in the network traffic data to classify and detect intrusion.

In the algorithm defined,

Lk - the set of all frequent k-FTI sequences,

Ck - the set of candidate k-FTI sequences.

The algorithm proceeds in phases, in the first phase, for k=1, L1 is found from C1: Clearly, C1 can be generated by listing all distinct patterns in databases. To determine the supports of all patterns in Ck, a tree structure, called fuzzy candidate tree, is used as a basis. Basically, the candidate tree is similar to the prefix tree adopted in previous research [14]. The major difference lies in that the traditional approach connects each tree branch with an item name, whereas in the new approach two components are attached- an pattern name and a linguistic term. In the second phase, having obtained Lk, population is generated randomly of the frequent FTI sequential patterns. On the individuals, GA operators are applied, for example, crossover, mutation, reproduction, gene duplication, and gene deletion. A multi objective fitness function is defined with the parameters to maximum confidence and maximum coverage of the individual.

Maximum coverage as given in equation (9), is considered to get the largest frequent FTI sequence and to achieve accuracy maximum confidence is taken as a parameter as given by equation (8).

The fitness function is defined as in equation (7).

Fitness (I) = Fconfidence(I) * Fcoverage(I) (7)

where,

Fconfidence(I) = ConfidenceS(Î±1 => Î±2) (8)

Fcoverage(I) = (9)

## Fuzzy time interval sequential pattern mining algorithm:

Input: Sequence Database S, Minimum Support min_sup, and Linguistic Terms LT

Output: The complete set of FTI sequential patterns.

C1= find all patterns in S.

L1= {c Ð„ C1| â‰¥ min_sup}

for (k=2; Lk-1 k++) {

Ck= new candidates generated from Lk-1

for each p1 Ð„ Lk-1 {

for each p2 Ð„ Lk-1 {

If (k=2) {

for each ltd Ð„ LT {

c= p1* ltd* p2;

## }

if (k>2)

Build the fuzzy candidate tree from Ck;

for each sequence s Ð„ S

Traverse the fuzzy candidate tree and accumulate the supports;

Lk={c Ð„ Ck | â‰¥ min_sup}

## }

P = randomly generate population from Lk;

While (not termination condition)

Apply GA operators on individuals of P.

Evaluate fitness criteria on individuals of P.

return L'k;

Thus the above algorithm prunes the traditional Apriori algorithm for pattern mining, to mine frequent sequential pattern using FTI sequential pattern in network audit data to classify and detect intrusion.

5 Conclusion

In this paper, we contributed to the ongoing research on FTI sequential pattern mining by proposing a multi objective GA based method for mining FTI sequential patterns. Our approach uses two measures as objectives, namely: confidence and coverage to prune the traditional Apriori algorithm. The main objective is to achieve maximum confidence and maximum coverage in the FTI sequential patterns. The paper defines the confidence of the FTI sequences, which is not yet defined in the previous researches. The main advantage of the proposed algorithm is the use of fuzzy genetic approach to discover optimized sequences in the network traffic data to classify and detect intrusion. Fuzzy solves the sharp boundary problem and the refinement ability of GA helps to find the global optimum FTI sequential patterns.