# Frequent Pattern Mining Techniques Computer Science Essay

Published:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

Abstract-Frequent pattern mining has been a focused theme in data mining research for over a decade. Lots of techniques have been proposed to improve the performance of frequent pattern mining algorithms. This paper presents review of different frequent mining techniques. With each technique, we provided brief description of the technique. At the end, we compare different frequent pattern mining techniques.

Keywords-Data mining, Frequent pattern mining, Association rules, Dynamic itemset counting

## Introduction

Frequent patterns are itemsets, subsequences, or substructures that appear in a data set with frequency no less than a user-specified threshold. Frequent pattern mining is a first step in association rule mining. One of the major uses with association rules is to analyze large amount of supermarket basket transactions [1-3]. Recently, association rules have been applied to other areas like outliers detection,

classification, clustering etc [4, 6, 8].

Association rules mining can formally be defined as follows. Let I = {i1, i2, i3, …, im} be a set of attributes called items. Let D be a set of transactions. Each transaction t Ð„ D consists of a set of items such that t ⊆ I. A transaction t is said to contain an item set X if and only if all items within X are also contained in t. Each transaction also contains a unique identifier called transaction identification

(TID). Support of an item set is normalized number of occurrences of the item set within the dataset. An item set is considered as frequent or large, if the item set has a support that is greater or equal to the user specified minimum support [25-27].

The most common form of association rules is implication rule which is in the form of X ⇒ Y, where X ⊂ I, Y ⊂ I and X ∩ Y = ∅. The support of the rule X ⇒ Y is equal to the percentage of transactions in D containing X ∪ Y. The confidence of the rule X ⇒ Y is equal to the percentage of transactions in D containing X also containing Y. Once the required minimum support and confidence are specified, association rule mining becomes finding all association rules that satisfy the minimum support requirements. The problem can be further broken down into two steps: mining of frequent item sets and generating association rules.

The number of possible combinations of Item sets increases exponentially with I and the average transaction length. Therefore it is infeasible to determine the support of all possible item sets. When counting the supports of item sets, there are two strategies. The first strategy is to count the occurrences directly, whenever an item set is contained in a transaction, the occurrence of the item set is increased. The second strategy is to count the occurrences indirectly by intersecting TID set of each component of the item set. The TID set of a component X, where X can be either item or item set, is denoted as X.TID. The support of an item set S = X⊂Y is obtained by intersecting X.TID ∩ï€ Y.TID =S.TID and the support of S equals S.TID [28, 29].

## Apriori-based Algorithms

The first published frequent item set mining algorithm is Apriori [1]. Apriori uses breadth first search (BFS). At each level, Apriori reduces the search space by using downward closure property of item set. If an item set of length k is not frequent, none of its superset patterns can be frequent. Candidate frequent item sets, Ck where k is the length of the item set, are generated before each data scan. The supports of candidate frequent item sets are counted. Candidate k item sets, Ck, are generated with frequent k - 1 item sets. Apriori algorithm achieves good performance by reducing the candidate item sets iteratively. The problem, however, associated with Apriori is it requires k data scans to find all frequent k-item sets. It is very much expensive to scan the

large data base.

Dynamic Itemset Counting, (DIC) relaxes the strict separation between generating and counting of itemsets [4]. DIC starts counting the support of candidate frequent itemsets as soon as they are being generated. By overlapping counting and candidate itemset generation, DIC reduces the overall data scans required. Orlando et al. [13] proposed analgorithm that combines transaction reduction and direct data access. At the end of each scan, transactions that are potentially useful are used for the next iteration. A technique called scan reduction uses candidate 2 itemsets to generate subsequent candidate itemsets [12]. If all intermediate data can be held in the main memory, only one scan is required to generate all candidate frequent itemsets. Another data scan is required to verify whether the candidate frequent itemsets are frequent or not.

With all of those improvements, the number of data scans required by Apriori based algorithms has been reduced significantly. However, the cost of generating candidate frequent itemsets has not been fully addressed by Apriori based algorithms. This problem becomes visible when there are huge numbers of frequent 1 or 2 itemsets.

## Partition-based Algorithms

Partition-based Algorithms [15] solves the problem of high number of database scans, associated with Apriori-based algorithm. It requires two complete data scan to mine frequent itemsets. The Partition algorithm divides the dataset into many subsets so that each subset can be fitted into the main memory. The basic idea of the Partition-based algorithm is that a frequent itemset must be frequent in at least one subset. Partition-based algorithm generates local frequent itemsets for each partition during the first data scan. Since the whole partition can be fitted into the main memory, the complete local frequent itemsets can be mined without any disk I/O operations. The local frequent itemsets are added to the global candidate frequent itemsets. In the second data scan, false candidates are removed from the global candidate frequent itemsets. In a special case where each subset contains identical local frequent itemsets, Partition algorithm can mine all frequent itemsets with a single data scan. However, when the data is distributed unevenly across different partitions, this algorithm may generate a lot of false candidates from a small number of partitions. By employing the knowledge collected during the

mining process, false global candidate frequent itemsets are pruned when they are found that they cannot be frequent. In addition, those algorithms reduce the number of scans in the worse case to (2b- 1)/b where b is the number of partitions.

## DFS and Hybrid Algorithms

Eclat and Clique [16] combine both depth first search (DFS) and intersection counting. Since intersection counting is used, no complicated data structure is required. These hybrid algorithms reduces the memory requirement, since only the TID sets of the path itemsets from the root to the leaves have to be kept in the memory simultaneously. Intersection of TID sets can be stopped as soon as the remaining length of the shortest TID set is shorter than the required support minus the counted support value. The intersection of TID sets of 1-itemset to generate frequent 2 itemsets is expensive.

The maximal hypergraph clique clustering is applied to 2-frequent itemsets to generate a refined set of maximal itemsets. Hipp et al. [10] pointed out that DFS cannot prune candidate k itemsets by checking frequent k - 1 itemsets, because DFS searches from the root to the leaves of the tree without using any subsets relationship. A hybrid approach of BFS and DFS is proposed in [11]. It is cheaper to use itemset counting with BFS to determine the supports, when the number of candidate frequent itemsets is small. When the number of candidate frequent itemsets is relatively large, the hybrid algorithm switches to TID set intersection with DFS, since simple TID set intersection is more efficient than occurrence counting when the number of candidate frequent

itemsets is relatively large. This results in additional costs to generate TID sets. The authors proposed [11] to use hash-tree-like structure to minimize the cost of transition. However, the authors do not provide an algorithm to determine the best condition to switch the strategy. In the evaluation, the authors provide parameters to change in strategy. However, those parameters may not be generalized enough for all kinds of datasets. Incorrect timing of changing strategy may decrease the performance of hybrid algorithm.

## Pattern-Growth Algorithms

Two major costs of Apriori based algorithms are the cost to generate candidate frequent itemsets and the cost associated with I/O operations. The issues related to I/O have been addressed, but the issues related to candidate frequent itemsets generation remain open. If there are n frequent 1 itemsets, Apriori based algorithms would require to generate approximately n2/2 candidate frequent itemsets. Secondly, the memory required to hold the candidate frequent itemsets and their supports could be substantial. For example, when n equals 10,000, there would be more than 108 length 2 candidate frequent itemsets. If we assume that it requires 4 bytes to hold the support and 4 bytes to hold the itemsets, approximately 0.5 gigabytes of main memory would be needed to store the information [18]. Furthermore, the memory required does not include the overhead associated with the data structure. Also the cost required to count the support of candidate itemsets may be large. As far as Apriori-based algorithms are concerned, the run time increases as the support value decreases. Therefore, the cost of candidate frequent itemsets generation of Apriori based algorithms will exceeds than the cost of I/O [24]. Han et al. [9] proposed a data structure called frequent pattern tree or FPTree. FP-growth mines frequent itemsets from FP-Tree without generating candidate frequent itemsets. FP-Tree is an extension of prefix tree structure.

Only frequent items get stored in the tree. Each node contains the item's label along with its frequency. The paths from the root to the leaves are arranged according to the support value of the items with the frequency of each parent is greater than or equal to the sum of its children's frequency. The construction of FP-Tree requires two data scans. In the first scan, the support value of each item is found. This calculated support values are used in the second

scan to sort the items within transactions in descending order. If two transactions share a common prefix, the shared portion is merged and the frequencies of the nodes are incremented accordingly. Nodes with the same label are connected with an item link. The item link is used to facilitate frequent pattern mining. In addition, each FP-Tree has a header that contains all frequent items and pointers to the beginning of their respective item links. FP-growth partitions the FP-Tree based on the prefixes. FP-growth traverses the paths of FP-Tree recursively to generate frequent itemsets. Pattern fragments are concatenated to ensure all frequent itemsets are generated properly. Thus FP-growth avoids the costly operations for generation and testing operations of candidate itemsets.

When the data is sparse, the compression achieved by the FP-Tree is small and the FPTree is bushy. As a result, FP-growth would spend a lot of effort to concatenate fragmented patterns with no frequent itemsets being found.

A new data structure called H-struct is introduced in [14]. In this, transactions are sorted with an arbitrary ordering scheme. Only frequent items are projected in the H-struct. H-struct consists of projected transactions and each node in the projected transactions contains item label and a hyper link pointing to the next occurrence of the item. A header table is created for H-struct. The header contains frequencies of all items, their supports and hyper link to the first transaction containing given item. H-mine mines the H-struct recursively by building a new header table for each item in the original header with subsequent headers omitting

items that have been mined previously. For each subheader, H-mine traverses the H-struct according to the hyper links and finds frequent itemsets for the local header. At the same time, H-mine builds links for items that have not been mined in the local header. Those links are used to find conditional frequent patterns within the local header. The process is repeated until all frequent itemsets have been mined. In case of a dense dataset, H-struct is not as efficient as FP-Tree because FP-Tree allows compression.

## Incremental Update with Apriori-based Algorithms

Complete dataset is normally huge and the incremental portion is relatively small compared to the complete dataset. In many cases, it is not feasible to perform a complete data mining process while transactions are being added continuously. Therefore, incremental data mining algorithms have to reuse the existing information as much as possible, so that either computational cost and/or I/O cost can be reduced.

A general incremental mining algorithm called Fast Update 2 (FUP2), that allows both addition and deletion of transactions was proposed in [7]. The major idea of FUP2 is to reduce the cost of candidate frequent itemsets generation. Incremental portion of the dataset is scanned; frequent patterns in the incremental data are compared with the existing frequent itemsets in the original dataset. Previous frequent itemsets are removed if they are no longer frequent after the incremental portion of the data is added or removed. The supports of previous frequent itemsets that are still frequent are updated to reflect the changes. In those ways, previous frequent itemsets that are still frequent are not required to be checked for their supports again. New k + 1 candidate frequent itemsets are generated from frequent k itemsets. The entire updated dataset is scanned to verify those newly added candidate itemsets if they are indeed frequent. The process is repeated until the set of candidate frequent itemset becomes empty. FUP2 offers some benefits over the original Apriori algorithm. However, it still requires multiple scans of the dataset.

Another incremental Apriori based algorithm is called Sliding Window Filtering (SWF) [12]. SWF incorporates the main idea of Partition algorithm with Apriori to allow incremental mining. SWF divides the dataset into several partitions. During the scan of partitions, a filtering threshold is employed in each partition to generate candidate frequent 2 itemsets. When a candidate 2 itemset is found to be frequent in the newly scanned partition, the partition number and the frequency of the itemset are stored. Cumulative information about candidate frequent 2 itemsets is selectively carried over toward subsequence partition scans. Cumulative frequencies of previous generated candidate frequent 2 itemsets are maintained as new partitions are being scanned. False candidate frequent itemsets are pruned when the cumulative support of the candidate frequent itemsets fall below required proportional support since they have become frequent. Once incremental portion of the dataset is scanned, scan reduction techniques are used to generate all subsequence candidate frequent itemsets [5]. Another data scan over the whole dataset is required to confirm the frequent itemsets. In the case of data removal, the partition to be removed are scanned, the cumulative count and the start partition number of candidate length 2 itemsets are modified accordingly. Although SWF achieves better performance than pervious algorithms, the performance of SWF still depends on the selection of partition size and removal of data can only be done at partition level.

## SQL-based algorithms

DBMS can facilitate data mining to become an online, robust, scalable and concurrent process by complementing the existing querying and analytical functions. The first attempt to the particular problem of integrated frequent itemset mining was the SETM algorithm [10, 17], expressed as SQL queries working on relational tables. The Apriori algorithm [1] opened up new prospects for FIM. The databasecoupled variations of the Apriori algorithm were carefully examined in [19]. The SQL-92 based implementations were too slow, but the SQL implementations enhanced with object-relational extensions (SQL-OR) performed acceptable. The so called Cache-Mine implementation had the best overall performance, where the database-independent mining algorithm cached the relevant data in a local disk cache [21-23].

SQL based frequent mining using FP-tree provide best performance than other SQL based techniques [20]. Although an FP-tree is rather compact, it is unrealistic to construct a main memory based FP-tree when the database is large. However using RDBMSs provides us the benefits of using their buffer management systems specially developed for freeing the user applications from the size considerations of the data. And moreover, there are several potential advantages of building mining algorithms to work on RDBMSs.

An interesting alternative is to store a FP-tree in a table. There are two approaches in this category - FP, EFP (Expand Frequent Pattern). They are different in the construction of frequent pattern tree table, named FP. FP approach checks each frequent item whether it should be inserted into a table FP or not one by one to construct FP. EFP approach introduces a temporary table EFP, thus table FP can generate from EFP. According to the properties of FP-tree, FP-tree can be presented by a table FP with three column attributes: item identifier (item), the number of transactions that contain this item in a subpath (count), and item prefix sub-tree (path). The field path is beneficial not only to construct the table FP but also to find all frequent patterns from FP. In the construction of table FP, the field path is an important condition to judge if an item in frequent

item table F should be insert into the table FP or update the table FP by incrementing the item's count by 1. If an item does not exist in the table FP or there exist the same items as this item in the table FP but their corresponding path are different, insert the item into table FP. In the process of constructing conditional pattern base for each frequent item, only need to derive all its path in the table FP as a set of conditional paths, which co-occurs with it.

## COMPARISON OF VARIOUS FREQUENT PATTERN MINING TECHNIQUES

Comparison of different FPM techniques is given in Table 1, where A is length of maximal frequent itemset and B is number of partitions. As Shown In the table, various algorithms are compared against four parameters, number of database scans required for the generation of frequent itemset, the candidate generation technique used, whether the frequent item generation approach is incremental or not, and how the algorithm is sensitive to the change in user parameters.

Apriopi-based methods use efficient technique for pruning the candidate itemsets, but they require lots of computational time as well as multiple database scans to generate candidate itemsets. Partition-based methods limit the size of candidate itemsets. Partition algorithm may generate a lot of false candidates from a small number of partitions. FP-Tree based methods require only two database scans in order to generate frequent patterns. These methods use a compact treestructure to represent the entire database. They do not require candidate generation, reducing the computational cost

Conclusion

Frequent pattern mining is first step for association rule mining. There is large number of applications in which frequent pattern mining is necessary like market basket analysis, click stream analysis, DNA analysis etc. Various techniques have been found to mine frequent patterns. Each technique has it's own characteristics. Performance of particular technique depends on input data and available resources. Technique which performs better for all frequent itemset mining using less memory and less time is the best technique for frequent pattern mining. Among all of the techniques mentioned above, FPTree based approach achieves better performance by requiring only two database scans hence reducing the computational time. It takes less memory by representing large database in compact tree-structure.