Intrusion Detection Is A Test Computer Science Essay

Published:

Nowadays, intrusion detection is an essential part of the network systems like routers. Pattern matching is a core operation of intrusion detection. Some of the patterns are specified as regular expression. The common way to perform regular expression matching is using finite automaton (FA). There are two kinds of FAs, one is Deterministic Finite Automaton (DFA) and the other is Non-deterministic Finite Automaton (NFA). DFA can provide constant throughput however it has the state explosion problem. NFA doesn't have state explosion problem but it is relatively slow compared to DFA. A hardware accelerator based on NFA is thus feasible. In this report, management software is built for a hardware regular expression match engine. This match engine is a hardware accelerator which is based on NFA methods. As the hardware circuit cannot process regular expression directly, the software is used to convert the regular expressions to the internal hardware readable format. The software converts or rewrites the regular expression if necessary.

Lady using a tablet
Lady using a tablet

Professional

Essay Writers

Lady Using Tablet

Get your grade
or your money back

using our Essay Writing Service!

Essay Writing Service

1. Introduction

1.1 Intrusion Detection

Intrusion Detection is a test on the intrusion behavior. It collects and analyses network traffic, security log and other available information on the computer network. It collects information on some critical nodes and checks whether there is any behavior that violates security strategies. Intrusion detection is an active security protection technology. It provides real time protection on both internal attack and external attack. It interrupts and responds to the network attack before the attack becomes effective. Intrusion detection is a reasonable additional protection to the computer network systems. It is considered the second security gate after the firewall.

There are two categories of intrusion detection that are used in intrusion detection system. One is anomaly detection and the other is signature-based detection. Anomaly detection monitors the network traffic. If any user's behavior is different from what it should be, it generates an alert or takes some actions on it. The advantage of anomaly detection is that it has a high detection rate [1]. However, in some cases, the attack might take effect already when it detected it. It is also very hard to detect slow-attacks that the attacker gradually slows down the network traffic.

Signature based detection inspect the payload of the network traffic. However, the signature data set should be known in advance. Detecting the signature is also a computation intensive task. In this project, the software is used on a hardware accelerator which is used on a signature-based intrusion detection system.

1.2 Snort Intrusion Detection System

Snort [2] is an open source intrusion detection system. It combines both anomaly detection and signature-based detection. It is considered to be the de facto standard of intrusion detection. In snort signature based detection, a Snort rule specifies the signature. A Snort rule is consists of two parts, rule header and rule options. Two example of Snort rule are shown in Fig. 1. Part of the information in the rules is omitted.

Rule Number

Rule Header

Rule Options

1

alert tcp $EXTERNAL_NET any -> $HOME_NET 21064

(msg:"SQL Ingres Database uuid_from_char buffer overflow attempt"; pcre:"/uuid_from_char\s*\(\s*(\x22|\x27)[^\1]{37}/smi";)

2

alert tcp $EXTERNAL_NET any -> $SQL_SERVERS 3306

(msg:"MYSQL root login attempt"; content:"|0A 00 00 01 85 04 00 00 80|root|00|";)

Figure 1. Example of Snort rules

Rule header specifies the packet protocol and type of the traffic flow while the rule option defines the content. The pcre refers to the Perl Compatible Regular Expression. Pcre means that the inspected content is a regular expression. For those rules that are not regular expressions, they use "content" which means there is no regular expression property in the rule. The content in "content" is the specification of raw bytes.

1.3 Finite Automaton

The common way to detect regular expression is to use finite automaton (FA). There are two kinds of common finite automatons. They are Deterministic Finite Automaton and Non-deterministic Finite Automaton. DFA has following three properties. The first one is that only 1 state is active at any given time. The second one is that for each symbol, only one state transition is made. The last one is that for each state transition, it involves only one symbol. NFA has some different properties than DFA. It allows multiple active states at any time. For each input symbol, the system could make multiple state transitions. DFA offers a constant processing time but it suffers from exponential state explosion problem. NFA offers linear space requirement while the processing can be very slow. For the worst case of NFA, its processing time could be N Square for each input symbol.

Lady using a tablet
Lady using a tablet

Comprehensive

Writing Services

Lady Using Tablet

Plagiarism-free
Always on Time

Marked to Standard

Order Now

Intrusion detection system must have dynamic updates on the pattern set. For NFA, the conventional lookup table method could offer a linear space but the processing speed is slow. Some researchers used FPGA to speed up NFA so that NFA can provide constant speed [3 - 7]. But when the pattern is updates, the FPGA needs to be reconfigured. The reconfiguration takes time and sometimes the reconfiguration is not fully automatic.

The management software is used on a memory based NFA regular expression match engine [8] proposed by Dr. Derek PAO. The match engine is called MX-NFA and it meets the requirements on memory cost, speed and dynamic updates.

1.4 Signature Based MX-NFA

Here is some brief introduction to the MX-NFA designed by Dr. PAO.

The organization of MX-NFA rule table is different from conventional way. Conventional way is to maintain lists of current state and next state. In MX-NFA, it maintains the transition edges but not states. Each transition rule is associated with an input symbol and a set of control flags. Each rule can be enabled or disabled. Only when the transition rule is enabled and the input symbol matches the symbol in the rule, the rule will then fire associated rule. After each cycle, an enabled rule will be disabled automatically unless some control flags are set. The following table is a set of control flags and their corresponding functions.

Control flag

Description

Enable (E)

The rule is active if E=1. By default the E bit is reset automatically at the end of the clock cycle.

Initial (I)

If I=1, the rule is active after initialization (or system reset).

Hold (H)

If H=1, the E bit will not be reset automatically.

Self-activation (S0)

If S0=1, the rule will activate itself in the next cycle.

Sequential activation (S2S1)

The two control bits specify the number of sequential rules to be activated if the rule is fired. Let the current rule be rule i.

S2S1=00 : activate rule i+1

S2S1=01 : activate rules i+1 to i+2

S2S1=10 : activate rules i+1 to i+3

S2S1=11 : activate rules i+1 to i+4

Output (O)

If O=1, an output signal is generated when the rule is fired.

Figure 2 Control Flags in MX-NFA. [8]

When the E (Enable) bit of a rule is set, it means that the rule is enabled. When the input character matches the symbol of the rule, this rule will fire next few rules according to S2S1. Normally, this rule will be disabled in the next cycle until H (Hold) bit is set or by some other events.

Here is an example of using MX-NFA to detect regular expression. The regular expression is "\d.[\t]*a". The input sequence is "abc12a"

MX-NFA rule table

Changes of the E-bit when process the inputs "abc12a"

Entry

Symbol

Control flags

t=0

t=1

t=2

t=3

t=4

t=5

t=6

I

H

O

S2S1

S0

reset

'a'

'b'

'c'

'1'

'2'

'a'

E1

\d

1

1

0

00

0

1

1

1

1

1(F)

1(F)

1

E2

.

0

0

0

01

0

0

0

0

0

0

1(F)

1

E3

[\t]

0

0

0

00

1

0

0

0

0

0

0

1

E4

a

0

0

1

00

0

0

0

0

0

0

0

1(M)

E5

null

0

0

0

00

0

0

0

0

0

0

0

0

Figure 3 Example of E bit change for an example of regular expression "\d.[\t]*a" when the input sequence is "abc12a". When this rule is fired, it is in italic and there is an 'F' besides it. When the rule is a match, there is an 'M' besides it.

When t = 0, it is the reset cycle. E1 is E bit is set to one because I bit is set. I bit means that its E bit set during the initializing cycle. When t = 1, the input is 'a', it does not match any enabled rule. E1 is not disabled automatically because the H bit is set. When t = 2 and t = 3, the situation is the same as t = 1. When t = 4, E1 is fired as 1 is a digit ('\d' refers to the digit [0-9].) E2 will be enabled in the next cycle. When t = 5, the input character is 2, it matches both '\d' and '.'. '.' Is the wildcard character. E1 will fire E2 in the next cycle. E2 will fire E3 and E4 in the next cycle according to S2S1. The reason for set S2S1 to 01 is that the rule E3 is a zero to infinity match, so next character could be the rule after it. As a result E2 will fire both E3 and E4 in the next cyle. When t = 6, the input character is a, it matches E4 and it will generate the output signal. For the last entry, it is called null entry which means no input symbol would match this rule. Its usage is to block this rule from influencing next regular expression when it is in actual usage.

Lady using a tablet
Lady using a tablet

This Essay is

a Student's Work

Lady Using Tablet

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

Examples of our work

The following figure shows a hardware structure of the MX-NFA. Since this MX-NFA supports byte oriented packets, each input symbol has 256 possibilities. 256 * 36 bit array is used to store the character symbol. The comparison result is stored in the memory register. The memory register outputs the 'm' to the control logic.

The Boolean equation for the control logic is as follows.

A rule's E bit is set according to the following situation.

(1) It is enabled and its H bit is set.

(2) It is enabled, its S0 is set and input symbol matches this rule.

(3) It is activated by any 4 rules before this rule.

Figure 4 Block diagram for MX-NFA basic match unit.

Then it comes to the count module. There are some requirements on the regular expression that allows it to use the count module. The most important one is that the start counting should be precisely defined. The management software takes care of these cases. These cases will be discussed in the later section. The following diagram is the block diagram for the MX-NFA with count module. Using the count module is to handle the repetition count like {n, m}

Figure 5 MX-NFA with count module

The control logic has additional R bit and C bit for count module. It also has 3 input signals CEL, CEU and Br to interact with count module. It has AC output signal to activate the count module. The count module has 2 control flags, N bit and U bit. It has 2 counters that are used to store lower bound value and upper bound value respectively. The values of the internal registers are preloaded. The U bit stands for unbounded. If U bit is set, the initial upper bound value is set to non-zero. Setting U bit is to handle the {n,} cases which is an infinity upper bound case. If U is set, the upper bound will be frozen during counting process and the upper bound will never be counted to zero. N bit is control the counter action when dealing with AC signal. N bit stands for not renewable. If N bit is set, the count module will not be activated again when an AC signal is regenerated. CEL and CEU signals are stands for lower bound counter event and upper bound counter event. CEL means lower bound counter counts down to zero and CEU means upper bound counter counts down to zero. Br signal means break signal and it is sent when the input symbol does not match the symbol in the count module. The following table shows how the match unit reacts to the count event signal.

Control flag

Description

Respond (R)

If R=1, respond to count-event and break signals:

set E-bit upon the count event of the lower-bound counter CEL

clear E-bit upon the count event of the upper-bound counter CEU

clear E-bit upon the break signal Br

Activate Counter (C)

If C=1, sends an activate signal AC to the count module when the rule is fired.

Figure 6 Control Flag R and C's action to the count event signals

Usually, the R bit of the rule after the count module is set. If R is set and CEL is generated, the rule sets the E bit as the lower bound counter is fulfilled, it can be activated. If CEU is generated, the E bit is cleared as the upper bound is fulfilled; the task of count module is finished. If Br signal is generated, E bit is cleared as count condition if not fulfilled; there is no need to activate this rule. If C bit is set, it activates and sends AC signal to the count module. Usually, the C bit of the rule before the count module is set.

Here is an example of using count module. The regular expression is "/^ab[^\n]{3,5}cd/m". /^…/m means that this regular expression can appear at the beginning or start with new line. Suppose the input stream is "\nab1234cd"

MX-NFA rule table

Changes of the E-bit when process the inputs

"\nab1234cd"

Rule

Symbol

Control flags

t=0

t=1

t=2

t=3

t=4

t=5

t=6

t=7

t=8

t=9

I

H

O

S2S1

S0

R

C

reset

'\n'

'a'

'b'

'1'

'2'

'3'

'4'

'c'

'd'

cm

[^\n]

--

--

--

--

--

--

--

--

--

--

--

--

--

--

--

--

--

E1

\n

1

1

0

00

0

0

0

1

1(F)

1

1

1

1

1

1

1

1

E2

a

1

0

0

00

0

0

0

1

0

1(F)

0

0

0

0

0

0

0

E3

b

0

0

0

00

0

0

1

0

0

0

1(F)

0

0

0

0

0

0

E4

null

0

0

0

00

0

0

0

0

0

0

0

1

0

0

0

0

0

E5

c

0

1

0

00

0

1

0

0

0

0

0

0

0

0

1

1(F)

1

E6

d

0

0

1

00

0

0

0

0

0

0

0

0

0

0

0

0

1(M)

E7

null

0

0

0

00

0

0

0

0

0

0

0

0

0

0

0

0

0

Figure 7 Example of using count module

The rule before E1 is the count module rule. This rule is to detect whether input symbol matches the count module. This regular expression has "/^…/m" can appear at the beginning or start with a new line. As a result, E1 is "\n" and H bit is set. The second rule is E2 and I bit is also set. When t = 0, it the system reset. E1 and E2 is enabled. When t = 1, the input symbol is "\n", E1 fires E2. When t = 2, the input symbol is 'a', E2 fires E3, E1 remains active due to H bit is set. When t = 3, E3 fires E4 which is a null rule. Null rule is used to separate subsequent fire. If without null rule, E3 will fire E5 which is not correct. E5 will handle according to the C bit and count module. E3 activate the counter as its C bit is set. The counter starts counting in t = 4, the initial value for lower and upper bound counter is 3 and 6. When t = 4, 5, 6 the counting process is going on. When t = 6, the lower bound counter generates a count-event signal CEL. R bit of E5 is set, so it responses to the CEL. It sets the E bit. When t = 7, the counting is still going on. The input symbol is '4' which only matches count module but not E5. E5 is not disabled as H bit is set. When t = 8, the input symbol is 'C', E5 fires E6. When t = 9, E6 generates an output signal. At this time the upper bound counter counts down to zero. If there is a next cycle t = 10, the enable bit of E5 will be cleared.

2. Management Software

The major task for this management software is to convert the regular expression (regex) to the internal rule entries and their corresponding control flags. During the conversion process, some of the regular expressions need to be rewritten to be compatible with the hardware.

2.1 Supported Regular Expression Features

Here is a table of supported features in the MX-NFA hardware. So the software also supports the same list of features.

Regex symbol

Feature

*

Kleene star. Matches zero or more times.

+

Plus-closure. Matches one or more times.

?

Match zero or one time.

^

Match pattern at start of input, or start of a line (with /m option)

.

Wildcard character

(a|b)

Alternate character, e.g. 'a' or 'b' (a special case of char. subclass)

[a-z], \d, \s, etc.

Character subclass, e.g. 'a' to 'z', decimal digits, white-space char.

[^a]

Negation of character class, e.g. not 'a'

{n}, {n,m}, {n,}, {,m}

Discrete repetitions of a character (repetition of substring with 2 or more characters is not supported)

Figure 8 Table of supported regex features [8]

2.2 Background Information of Management Software

This software is written in Java. No external libraries are referenced. Currently, this software runs on PC. However, some parts of the program take hardware runnable program into consideration, these parts can be modified easily to be adapted to the hardware.

2.3 Data flow of the program

The input of the program is the regular expression that contains features in the table above. The output of the program is the setup of the regular expression in the hardware. A typical output is shown below.

C:\Users\Robert Yang\Documents\CityU\Final Year\Final Year Project\Code\Demo\Ra.jpg

Figure 9 Hardware setup for the regex "\x5Csv\s+[^\x7D]*\x3B[^\x7D]*\x3B[^\x7B]{12}" a dash stands for 0 at particular position. If there is an alphabet in this position, it means that this is set.

The program process the regular expression into six steps:

1. Read in the regular expression (Read)

2. Preprocess the regular expression in "Java String" format (Preprocess)

3. Convert the preprocessed String into a collection of "Pair" (Conversion 1)

4. Rewrite the regular expression according to the hardware (Further Process)

5. Convert the collection of "Pair" into a collection of "Rule" (Conversion 2)

6. Output the "Rule" in the human readable format (Output)

2.3.1 Read

The first stage is quite simple. It reads from a file in the current version and it put into a Java String. Then it handles for the preprocessing stage.

2.3.2 Preprocess

The preprocessing stage operates on a String. It is divided into the following sub stages.

2.3.2.1 Adjacent Special Character

In some cases, there are some formats like this "(a|b)?*" or "(a|b)?+". Actually, these kind of adjacent special characters can be simplified into one "*". "?" means zero or one time, "*" means zero to infinity time, the combination of "*" and "?" can be replace by one single "*". This sub stage also guaranteed an assumption used in the following stage. There is at most one special for each character sub class. The actual program traverse the whole string, if there is two adjacent special characters, change these two into one "*".

2.3.2.1 Question Mark after Bracket

In some case, there are some formats like "(abcf)?". Although MX-NFA is not able to process "(abcf){2,8}" such case, it is able to process a single question mark after a bracket that contain multiple characters. The way to process single question is to expand the original regular expression into two. For example "ab(abcf)?de" will be expanded into two regular expression, "abde" and ababcfde". Any string match one of these two regular expressions will be considered to match original one. The way to handle this case in the program is to convert this into "ab(abcf|)de". The meaning of the regular expression is the same as previous one. Then the next preprocessing stage will take care of this situation.

2.3.2.2 Preprocess Brackets and Reference Blocks

This stage is to dereference the reference blocks and eliminate the brackets by expanding the regular expression. There is a kind of feature in the regular expression called "reference block". For example, "(ab|cd) ef(gh|ij)kl\2", in this regular expression, there is a "\2" in it, it refers to the bracket of "(gh|ij)". Some regular expression is even more complex which have multiple levels of brackets. In such cases, reference block is to reference the brackets in the same level. As a result, in the example "(ab|cd) ef(gh|ij)kl\2", the original regular expression will be "(ab|cd) ef(gh|ij)kl(gh|ij)". So final result of expansion of "(ab|cd) ef(gh|ij)kl\2" would be:

1. "abefghklgh"

2. "abefijklij"

3. "cdefghklgh"

4. "cdefijklij"

There is only 4 cases instead of 8 cases. The reason behind is that there is one more rule for reference blocks is that the reference blocks should have same pattern with previous one. Thus "abefghklij" is not acceptable.

The example above also demonstrates the expected output of this sub stage. The idea of processing brackets is to use "General Tree" because the level of brackets can be very large. The program constructs a general tree on the brackets and then performs pre-order traversal to expand the regular expression. During this process, some unnecessary brackets are eliminated and "(a|b|c)" such case is converted into [abc]. The detailed algorithm will be discussed in the following section about class design.

After processing these brackets, the output regex does not contain the bracket.

2.3.3 Conversion 1

As String is not easy to process in this application, this stage converts the String into an internal format called "Pair". Detailed implementation of class "Pair" will be discussed in the following section about class design. "Pair" is basically a character sub class plus two counter values. You will find any regular expression can be generalized into pairs of character plus its quantifier. As required in the hardware, the character sub class is a bit vector. In this program, the character sub class is an array of integers. For the quantifier, two values are used to store the quantifier, lower bound and upper bound. For a single character, lower and upper bound is set to 1. For question mark, lower bound is set to 0 and upper bound is set to 1. There are also supporting methods to determine the exact type of this Pair.

During the conversion, all kinds of format representing the character sub class will be generalized into one single bit vector.

After this stage, further process is done only on the collection of pairs.

2.3.4 Further Process

In this stage, the further process is divided into 5 sub stages including some important counter handling.

2.3.4.1 Multiple Question Marks

In some cases, the regular expression would be like "a?a?a?a?". In such case, it is converted into "a{0-4}"

2.3.4.2 Special Character before Counter

Here is an example for the problem "\s+[^\s]{200}". During actual hardware, when "\s+" is fired, it will be self-activated if the input symbol matches. If this case happens, it will keep send ac signals to the counter which will cause some problems. So the solution is to unroll the counter once. The regex would be like "\s+\S[^\s]{199}". Thus this problem is solved. However, in this example, \s has no overlap with \S and unrolling once can solve the problem. If the character before the "+" and the character in the counter have come overlap, it would be difficult to determine what is the start time of the counter. Time to start counting is critical in this hardware. So under this situation, fully unrolling the counter is a way.

2.3.4.3 Counter Handler

Back to the regular expression, the regex can be considered in this format. "prefixString[character sub class]{n,m}suffixString ". There are in total 5 situations:

1. Suffix String is empty. There is no need to change.

2. n is equal to zero. Under this situation, no further processing Is required.

3. Prefix String cannot be contained in the counter. For example, "ab[^\s]{3,5}cd", under this situation, no further processing is required as length of "ab" is smaller than the lower bound count 3.

4. Prefix String can be contained in the counter and the counter is a fixed count. For example, "cd.{7}ef", under this situation, cd can be contained in the counter, so the counter has to be unrolled so that cd and the unrolled part cannot be contained in the counter. This time, it should be "cd….{4}ef". This situation requires minimum of unroll 3 times. Length of "cd…" is larger than the counter value 4.

5. Prefix String can be contained in the counter and the counter is a range count. The reason is similar to the condition 4. The unrolling is done on the lower bound part only. For example, "ab.{4-9}cd" is changed into "ab…..{0-5}cd".

2.3.4.4 Back to Back counter

After unrolling the counter, if 2 counters are back to back, the hardware circuit cannot support this kind of situation. In the hardware implementation, for each single rule, it cannot be connected to 2 counters which means R bit and C bit cannot be set on the same rule. As a result the counter should be unrolled some time to "create" some rules between the counters. For example "ab.{20}[^\n]{30}cd" should be changed to "ab.{18}..[^\n]{30}".

This is acceptable only at the software level. However, at the hardware level, in order to save the hardware resources, every 36 rules is equipped with one counter. So the unrolling may be done more in order to eliminate back to back situation. If this is the case, the regex will be unrolled to "ab………………..[^\n]{30}" in order to fulfill the requirement.

2.3.4.5 Counter Appearance at the End

Some special handling needs to be done if the counter appears at the end. If the counter is a range and count to infinity counter, it can be changed to a fixed count since fulfilling the lower bound is already fulfilling the output requirement.

If there is a counter at the end, it has to be unrolled once. There must be some rules whose R bit is set and then output the result. For example, "abc{10}" will be changed to "abc{9}c".

2.3.5 Conversion 2

This process is to convert the processed collection of "Pair" into "Rule". Rule is another class. It represents a rule in the hardware. It contains the control flags and if R or C is set, it stores which count module to activate or respond.

Here are some conditions that a particular control flag is set.

Let's see the function table again.

Control flag

Description

Set Condition

Enable (E)

The rule is active if E=1. By default the E bit is reset automatically at the end of the clock cycle.

Enabled during the matching process

Initial (I)

If I=1, the rule is active after initialization (or system reset).

Set if it is the first pair or the second pair and the first pair is start with new line(/^…/m case).

Hold (H)

If H=1, the E bit will not be reset automatically.

1. First rule

2. The rule after counter and the counter's upper bound is not infinity.

Self-activation (S0)

If S0=1, the rule will activate itself in the next cycle.

1. "+" or "*"

2. The rule after counter and upper bound is infinity or fixed count

Sequential activation (S2S1)

The two control bits specify the number of sequential rules to be activated if the rule is fired. Let the current rule be rule i.

S2S1=00 : activate rule i+1

S2S1=01 : activate rules i+1 to i+2

S2S1=10 : activate rules i+1 to i+3

S2S1=11 : activate rules i+1 to i+4

See how many rules after it that has "?" or "*"

Output (O)

If O=1, an output signal is generated when the rule is fired.

Set if it is the last pair

Respond (R)

If R=1, respond to count-event and break signals:

set E-bit upon the count event of the lower-bound counter CEL

clear E-bit upon the count event of the upper-bound counter CEU

clear E-bit upon the break signal Br

Set if it is the rule after the counter

Activate Counter (C)

If C=1, sends an activate signal AC to the count module when the rule is fired.

Set if it is the rule before the counter

U (Unbounded)

If U = 1, the upper bound counter is frozen

Set if upper bound is infinity

N (Not renewable)

If N = 1, it will ignore the activation signal if it is counting

Set if lower bound is zero or it is the last counter

Figure 10 Conditions that corresponding control flags are set

This stage will output 2 arrays. One is an array of count module rules and the other is an array of regular rules.

2.3.6 Output

The output is to print the rules out. Besides, it prints the resources information such as how many rules, count modules and how many cases it is expanded.

2.4 Class Design and Algorithms Used

2.4.1 Class Design

The major classes used in MX-NFA is "MXNFA", "Pair", "Rule", "GTree", "Decision" and "MutableInt".

"MXNFA" works as the main class. It reads the input and be responsible for the output.

"Pair" is to generalize the String regex.

"Rule" is to represent the rule in the hardware.

"GTree" is used to handle the brackets.

"Decision" and "MutableInt" are used in the recursive call when expanding the regex.

2.4.1.1 Pair

Pair has 4 data fileds.

Instance Variables

subClass is an integer array to store the character sub class bit vector. lowerBound and upperBound are store the lower and upper bound value respectively. originalCharacter is to store the original human readable symbol.

Constructors

There are 2 constructors. The first one is to accept a String as the character subclass. This constructor will call preprocess to convert this String into bit vector. The other constructor is to use another pair's character subclass as its own sub class.

Methods

Preprocess is to convert the character sub class into its bit vector. Escape is to deal with escape characters such as "\d", "\s" and so on.

SameSymbol, overlap and match are 3 comparison methods which compare the character sub class with another pair. SameSymbol means the same symbol, overlap means 2 character sub classes has overlap and match means that this is a match.

getCounterType, isCounter, isFixedCount, isRangeCount, isInfinityRangeCount, isStartWithNewLine. These 5 methods are used to determine the type of this pair. It can be a regular one or a counter or with special characters ("?", "+", "*") or it is the first pair.

getUpperBound and getLowerBound are 2 getters for the lower upper bound. setUpperLowerBound, decreaseLowerUpperBound and decreaseLowerBoundToZero are 3 methods to manipulate the lower and upper bound. It is mainly in unrolling. For setters, these methods may throw IllegalArgumentException to indicate misuse of these methods.

getSymbol, getBinarySymbol and getNumberOfSymbols are 3 methods to get the information on the character sub class.

Print is for debugging purpose and provides information of this pair to some extent.

2.4.1.2 Rule

Instance Variables

It has a full set of corresponding control flags used in the hardware. It has a pair which is able to get the character sub class. It also has activation and respond. These 2 variables are to be used when it is a counter rule. In a regular rule, they are set to null.

Constructor

There is only one constructor for Rule. They input variables are the full set of control flags used in the hardware.

Methods

A large amount of methods in this class are used as getter and setter only. Only these 2 print methods are working for the final output. The print without counterNumber and activation is for the regular rule and the other is for counter rule output.

2.4.1.3 MutableInt

This class is designed to be used in the recursive call. As in Java, primitive data type is pass by value and object is pass by reference. It should use object in the recursive call if this variable is to be modified and tracked during the recursive call. However, the Integer object provided by Java is immutable. So I have to create a MutableInt for my purpose.

Constructor

Default constructor is to construct a int with value 0. Another constructor is to construct an int with value provided.

Methods

Increment is to increase 1. Another is to increase delta.

getValue is to return the value.

2.4.1.4 Decision

This is a support class used in the recursion when processing brackets.

Here is an example for using this class.

Create an integer array for example [2,3]. The elements in this array means there is 2 options for the first one the 3 options for the second one.

The hasNext variable is true now.

The variable current is [0,0].

Call nextDecision

Current changed to [0,1]

Call nextDecision

Current changed to [0,2]

Call nextDecision

Current changed to [1,0]

Call nextDecision

Current changed to [1,1]

Call nextDecision

Current changed to [1,2]

Call nextDecision

Current changed to [0,0]

When it changes to [0,0] the second time, hasNext changed to false.

When expanding the brackets, some brackets has alternations, this array stores the current decision I should make and advance the decision after expansion of 1 case is done.

2.4.1.5 GTree

This class is designed to store the hierarchical structure of the brackets.

Instance Variables

Start and end are 2 integers to store the left and right brackets. If it is the Root node, it is the start and end position of the regex.

Parent means the parent of this node. If it is Root, the parent is set to null.

Children is a list of children under this node. In some cases, brackets can contain brackets.

Alternation is to store the positions of "|" under this node or brackets. Those "|" which in the next level is not included.

levelNumber indicates which level is the node in.

nodeNumber indicates the creation sequence when reading the regex.

Constructor

The construct requires only straight forward information. The reason end is not in the constructor is because this is used in the reading process of regex. When I read a left bracket, it is time to create one node although end is not sure. However, there is a method called setEnd for complete this node.

Methods

Most methods are standard getter and setters. There is only 1 method left that is not a getter or setter. It is the split.

Consider this case ((No.1)(No.2)|(No.3)|)

If call split(0), it will return an array of No.1 and 2.

If call split(1), it will return an array of No.2

If call split(2), it will return an array which size if zero.

If call other number, Exception will be thrown.

Split is basically a method that returns the children inside particular partition.

2.4.1.6 MXNFA

This class does not have constructors. It has only static methods.

The method converRegex accept an String input as the file name and output the converted regex table.

isSpecialChar examines whether the input character is one of "?", "+" or "*".

IsTrueChar determine whether a char at this particular position of a regex is a true char. True char means that it isn't an escape character. For example the string "\(". If call isTrueChar(1, "\("), it will return false.

The ALLOWABLE_DISTANCE is used for the back to back counter. Currently it is 3, it can be modified to larger number according to the hardware.

2.4.2 Algorithms used

The most complex algorithm used is in the bracket processing. Other algorithms are just traverse the whole array or collection and then it is done.

2.4.2.1 Bracket Processing