Sequential Pattern Mining Methods A Snap Shot Computer Science Essay

Published:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

Increased application of sequential pattern mining requires a perfect understanding of the problem and a clear identification of the advantages and disadvantages of existing algorithms.Mining sequential patterns has been focused theme in data mining research for over a decade. Sequential Pattern Mining (SPM) is promising approach to analysed potential sequential pattern. SPM is mainly deal with finding the behaviour of a sequential pattern that can help in many analyzing applications like market basket analysis, recent trends, buying patterns of customer, predicting next event etc.SPM algorithms are broadly categorize into two basic approaches: Apriori based and Pattern growth. Most of the sequential pattern mining methods follow the Apriori based methods, which leads to too many scanning of database and very large amount of candidate sequences generation-testing, which decrease the performance of the algorithms. Pattern growth based methods solve all above problems and in addition, it works on projected database which minimize the search space. Paper reviews the existing SPM techniques, compares various SPM techniques theoretically and practically. It highlights performance evaluation of each of the techniques. Finally, a discussion of the current research challenges and pointed out future research direction in the field of SPM.

I. Introduction:

Data mining problem, discovering sequential patterns, was introduced in [3]. The input data is a set of sequences, called data-sequences. Each data- sequence is a list of transactions, where each transaction is a set of literals, called items. Typically, there is a transaction-time associated with each transaction. A sequential pattern also consists of a list of sets of items. A sequence is maximal if it is not contained in any other sequence. A sequence with k items is called a k-sequence.

In addition to introducing the problem of sequential patterns, [3] presented three algorithms for solving this problem, but these algorithms do not handle time constraints, sliding windows, or taxonomies. Two of these algorithms were designed to solve only maximal sequential patterns; however, many applications require all patterns and their supports. The third algorithm, AprioriAll, find all patterns; its performance was better than or comparable to the other two algorithms which are introduced in [3]. AprioriAll is a three-phase algorithm.

It first finds all item- sets with minimum support (frequent itemsets),

transforms the database so that each transaction is replaced by the set of all frequent itemsets contained in the transaction,

Then finds sequential patterns.

There are two problems with this approach: First, it is computationally expensive to do the data transformation. Second, while it is possible to extend this algorithm to handle time constraints and taxonomies, it does not appear feasible to incorporate sliding windows.

Srikant and Agrawal [3*] generalized their problem to include time constraints, sliding time window, and user-defined taxonomy, and presented apriori-based improved algorithm GSP (i.e., generalized sequential patterns). It also work on heuristic, Any super pattern of a non frequent pattern ca not be frequent.GSP [3], adopts a multiple-pass, candidate generation-and-test approach.

SPIRIT algorithm is to use regular expressions as flexible constraint specification tool [4].

For frequent pattern mining, a frequent pattern growth method called FP-growth [9] has been developed for efficient mining of frequent patterns without candidate generation.

FreeSpan (Frequent pattern-projected Sequential pattern mining) [8], which reduce the efforts of candidate subsequence generation.

Another and more efficient method, called PrefixSpan [10] (Prefix-projected Sequential pattern mining), which offers ordered growth and reduced projected databases. To further improve the performance, a pseudoprojection technique is developed in PrefixSpan.

In the last decade, a number of algorithms and techniques have been proposed to deal with the problem of sequential pattern mining. The main approaches to sequential pattern mining are - AprioriAll [1], GSP [7], SPADE [8], PrefixSpan [5] and Spam [3]. From these, GSP and PrefixSpan are the best-known algorithms. A comprehensive performance study shows that PrefixSpan, in most cases, outperforms the apriori-based algorithm GSP, FreeSpan, and SPADE [5] (a sequential pattern mining algorithm that adopts vertical data format) and PrefixSpan integrated with pseudoprojection, is the fastest among all the tested algorithms.

The Sequential pattern mining can be further extended to mining multidimensional sequential patterns; time-interval sequential pattern, closed sequential pattern and constraint based sequential patterns.

This survey paper mainly focuses on SPM based on Association rule mining (ARM). Basically there are two main methods to find the association of data items: (1) Apriori based method which is work on Generate and Test (2) Frequent pattern Growth (FP-Growth) which is Graph-based method, which overcome problems of Apriori based methods. Both the methods are worked on frequency (minimum support).

II. Justification of Area

Data Mining is task which is finding Interesting and useful information from large data amount, which can be used in numerous areas. It can be applicable in many domains like, web-log analysis, medical record analysis, retail marketing , stock analysis, telecommunication field etc. Lot of work already been done on SPM. Environment may vary constantly. So, it is necessary to understand up-coming trend. Different sets of rules are used to identify sequence pattern but rules may change over a time period. So, It is necessary to indentify and incorporate novel rules in algorithm and design more efficient sequential pattern mining methods which is capable enough to identify new trends.

III. Apriori based mining algorithm

The Apriori [1] [Agrawal and Srikant 1994] and AprioriAll [Agrawal and Srikant 1995] worked on “All nonempty subsets of a frequent itemset must also be frequent.” It’s worked on basic strategy of Generate-Test.

Which follows below steps :

Generate candidate

Scan DB for each candidate

Test candidate support count with minimum support count

Technique suffers from repeated scanning of database and huge sequence of candidate generation

which decreases the efficiency.

Apriori-based SPM Algorithms:

The sequential pattern mining problem was first proposed by Agrawal and Srikant in [2], and the same authors further developed a generalized and refined algorithm, GSP [20], based on the Apriori property [1]. Since then, many sequential pattern mining algorithms have also been proposed for performance improvements. Among those, SPADE [24], and SPAM [4] are quite interesting ones. SPADE is based on a vertical id-list format and uses a lattice-theoretic approach to decompose the original search space into smaller spaces. Both SPADE outerforms GSP. SPAM is a recently developed algorithm for mining long sequential patterns and adopts a vertical bitmap representation. Its performance study shows that SPAM is more efficient in mining long patterns than SPADE .however, it consumes more space in comparison with SPADE.

Apriori-based Methods are mainly categorized into following:

Apriori-based, horizontal formatting method, with GSP Srikant and Agrawal

(1996) as its representative;

Apriori-based, vertical formatting method, such as SPADE (Zaki, 2001); and

Projection-based pattern growth method, such as SPAM (Ayres et al., 2002).

Table 1: Comparative study of Apriori-based Algorithm

Apriori-based Algorithm

Algorithm

GSP

(Generalized Sequential Pattern)

SPADE (Sequential PAttern Discovery using Equivalent Class )

SPAM ( Sequential Pattern Mining )

Initial Idea Given by

Agrawal and Shrikant [3]

M. Zaki [5]

Ayres et al. [6]

Key features

Generate & Test

-A vertical format

-Reduce the costs for computing support counts

-Improvement of SPADE

- Reduce cost of merging

- Lattice search techniques

-Sequences are discovered in only three database scans

Working

Scan DB for frequent item/candidate

If the candidates do not fit in memory, generates only those candidates will fit in memory.

If sequence is frequent are written to disk; else removed

-Divide the candidate sequences into groups by items.

-ID-List technique to reduce the costs for computing support counts.

-Represent each ID-list as a vertical bitmap

-data set stored by <CID,TID,Itemsets>

where,CID: customer-id, TID: transaction-id based on the transaction time

Location Memory wise

Not a main-memory algorithm

ID-List completely stored in the main memory

<CID,TID,Itemsets> Completely stored in the main memory

Data Structure

candidate sequences are stored in a hash-tree

Hash-tree (ID â€"list)

vertical bitmap

Limitation

-Multiple scanning

-Multiple passes over the database

-Same pair is recorded more times when it/they appear(s) more than once in the same customer sequence

-repeatedly merge the ID-list

(Customer id list,transaction id list and itemset ) Information triplet should be in main memory.

IV. Frequent pattern Growth (FP-Growth) mining algorithm:

Pattern growth-method [7] is the solution method of limitations of the apriori-based methods. It come up with solution of the problem of generate-and-test. Its work on following key features:

(1)Avoid the candidate generation step.(2) focus the search on a restricted portion of the initial database.

Work on following:

Scan DB once, find frequent 1-itemset (single item pattern)

Order frequent items in frequency descending order

Scan DB again, construct FP-tree

It is faster than Apriori because candidate generation-testing is not performed, which uses Compact data structure and it eliminate repeated database scanning. Basic operation performed is counting and FP-tree building.

Frequent pattern Growth based SPM algorithm:

FreeSpan [8] was developed to substantially reduce the expensive candidate generation and testing of Apriori. FreeSpan uses frequent items to recursively project the sequence database into projected databases while growing subsequence fragments in each projected database. While PrefixSpan adopts a horizontal format dataset representation and mines the sequential patterns under the pattern-growth paradigm: grow a prefix pattern to get longer sequential patterns by building and scanning its projected database. PrefixSpan out performs almost all existing algorithms.

Table 2: Comparative study of FP-Growth based Algorithm

FP-Growth based algorithm

Algorithm

FreeSpan (Frequent pattern-Projected Sequential pattern mining )

Prefixspan (Prefix-Projected Sequential Pattern Mining)

Initial Idea Given by

Key features

Reduce candidate generation(basic feature of FP-growth)

Work on projected database

Reduce candidate generation(basic feature of FP-growth)

Work on projected prefix database (Less projection)

Core idea of projection

projected sequence database on based of frequent item

Scan DB & find frequent items

Recursively &Database projection based on frequent Prefix

Optimization

--

(1) Bi-level:

Partition search space based on length-2 sequential patterns

(2)pseudo-projection:

Pointer refer pseudo-projection in sequence DB

Projection information: pointer to the sequence in database and offset of the postfix in the sequence.

Limitation

Database projection cost

Prefix Database projection cost (lower than Database projection cost for frequent item )

advantage

Reduce search space

Projection is based only on frequent prefixes which minimize search space

V. Experimental Results:

In this section we have performed a simulation study to compare the performances of the algorithms: the Apriori [1], PrefixSpan [8] and SPAM [10], Comparison is based on runtime, frequent sequence patterns, memory utilization on various (10 % to 60%.) minimum support threshold.

These algorithms were implemented in Sun Java language and tested on an Intel Core Duo Processor with 2GB main memory under Windows XP operating system. Following is the description of Dataset:

Table: Description of Dataset

Dataset

contextPrefixSpan.txt

Number of distinct items

7

Average number of itemsets per sequence

5.0

Average number of distinct item per sequence

5.5

Average number of occurences in a sequence for each item appearing in a sequence

1.4090909090909092

Average number of items per itemset

1.55

Figure 1. Execution Times of algorithms

Figure 2. No. of patterns verses Support count

Figure 3. Memory utilization of algorithm

Fig.1 shows that when support threshold is very low Apriori based algorithm takes more time for execution. In same state Prefixspan is taking very less time for execution. PrefixSpan consumes a much smaller memory space in comparison with Apriori and SPAM. In case of lower threshold, Apriori and SPAM algorithms extract large no. of sequential pattern but for those algorithms generates small no. of patterns in case of higher threshold value.

VI. Conclusion and Future Scope:

From the theoretical and simulation study of various sequential pattern mining algorithms, we can say that PrefixSpan [9] is an efficient pattern growth method because it outperforms GSP [3], FreeSpan [7] and SPADE [5]. It is clear that PrefixSpan Algorithm is more efficient with respect to running time, space utilization and scalability then Apriori based algorithms

This is still an active research area with many unsolved challenges. Much more remains to be discovered in this young research field, regarding general concepts, techniques, and applications.

Most of the existing SPM algorithms work on objective measures:(i) support (ii) confidence

Support is use to eliminate uninteresting rule. Confidence measures reliability of the inference made by the rules.A rule (pattern) is interesting if (1) Unexpected: pattern which is extracted is surprising to the user (2) Actionable: user can utilize resultant pattern further. Researchers can identify novel objective measure which can make the pattern more interesting and can be helpful to identify emerging patterns.

Research can make in such a direction where algorithm should handle large search space by modification of existing algorithm or designing novel approach.

Algorithm should avoid repeated scanning of database during mining process which can improve efficiency of algorithm.

To design such a SPM algorithm which should be efficiently perform in distributed/parallel environment

Support

Time(ms)

Frequent sequence extracted

memory

PrefixSpan

SPAM

Apriori

PrefixSpan

SPAM

Apriori

PrefixSpan

SPAM

Apriori

0.1

8

47

41

40

812

812

6.4648

7.1279

6.9267

0.2

12

10

3

40

812

812

6.96739

5.4260

3.2847

0.3

9

7

2

9

812

812

5.43109

7.1868

3.7588

0.4

22

6

2

9

812

812

6.94976

4.4516

4.1296

0.5

6

2

1

9

53

53

4.3545

4.8832

4.5836

0.6

15

7

1

8

53

53

3.6764

2.8229

5.0346

Writing Services

Essay Writing
Service

Find out how the very best essay writing service can help you accomplish more and achieve higher marks today.

Assignment Writing Service

From complicated assignments to tricky tasks, our experts can tackle virtually any question thrown at them.

Dissertation Writing Service

A dissertation (also known as a thesis or research project) is probably the most important piece of work for any student! From full dissertations to individual chapters, we’re on hand to support you.

Coursework Writing Service

Our expert qualified writers can help you get your coursework right first time, every time.

Dissertation Proposal Service

The first step to completing a dissertation is to create a proposal that talks about what you wish to do. Our experts can design suitable methodologies - perfect to help you get started with a dissertation.

Report Writing
Service

Reports for any audience. Perfectly structured, professionally written, and tailored to suit your exact requirements.

Essay Skeleton Answer Service

If you’re just looking for some help to get started on an essay, our outline service provides you with a perfect essay plan.

Marking & Proofreading Service

Not sure if your work is hitting the mark? Struggling to get feedback from your lecturer? Our premium marking service was created just for you - get the feedback you deserve now.

Exam Revision
Service

Exams can be one of the most stressful experiences you’ll ever have! Revision is key, and we’re here to help. With custom created revision notes and exam answers, you’ll never feel underprepared again.