An Algorithm For Frequent Item Set Based English Language Essay

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

Association rules are the main technique for data mining. The Apriori algorithm is a classical algorithm in mining association rules. With the time a number of changes proposed in Apriori to enhance the performance in term of time and number of database passes. For the two bottlenecks of frequent item sets mining: the large multitude of candidate 2-itemsets, the poor efficiency of counting their support. This paper main focus lies in the generation of frequent patterns which is the most important task in explanation of the fundamentals of association rule mining. This is done by analyzing the implementations of the well known association rule mining algorithms Apriori and Proposed algorithm Set operation for Frequent Item using Transaction database .

Keywords: Association rule; Frequent Itemset Mining ; date mining; KDD


Data mining is the core of knowledge discovery in databases. It's a procedure to find the useful and potentially knowledge in database. Association rules are one of the most important knowledge of data mining's result which can be defined as the relation and dependency between the itemsets by given support and confidence in database [1].

The mining method of association rule in the transaction databases is suitable for analytical databases [2]. Finding frequent item sets is one of the most investigated fields of data mining. The problem was first presented in [1]. The subsequent paper [3] is considered as one of the most important contributions to the subject. Its main algorithm, APRIORI, not only influenced the association rule mining community, but it affected other data mining fields as well.

APRIORI is not only an appreciated member of the FIMI community and regarded as a baseline algorithm, but its variants to find frequent sequences of itemsets [2], episodes, [12], Boolean formulas [9] and labeled graphs [10, 11] have proven to be efficient algorithms as well[3].

Two new algorithms for Association Rule Mining, Apriori and AprioriTid, along with a hybrid of the two algorithms, are described in the paper. The performance of these algorithms is shown to be from many times better for smaller datasets to many orders of magnitude better than the then current algorithms. The algorithms described in the paper represent a huge improvement over the state of the art in Association Rule Mining at the time. The new algorithms improve upon the existing algorithms by employing the following. Apriori and AprioriTid reduces the number of itemsets to be generated each pass by reducing the number of candidate itemsets.

AprioriTid uses an encoding of the database for each pass instead of using the complete database. This greatly reduces the overhead cost associated with making passes over the dataset especially in later passes.

Apriori Hybrid combines the features of the two aforementioned algorithms. It employs the Apriori algorithm for mining for earlier passes but switches to AprioriTid when the size of the encoding is expected to fit into memory.[4]


Problem Decomposition and paper organization

The problem of discovering all association rules can be decomposed into two sub problems [4]:

Find all sets of items (itemsets) that have transaction support above minimum support. The support for an itemset is the number of transactions that contain the itemset. Itemsets with minimum support are called large itemsets, and all others small itemsets. In section V we present experimental overview of modified Apriori.

Use the large itemsets to generate the desired rules. Here is a straightforward algorithm for this task. For every large itemset 1, find all non-empty subsets of 1. For every such subset a, output a rule of the form a (1 - a) if the ratio of support (l) to support (a) is at least minconf .We need to consider all subsets of 1 to generate rules with multiple consequents. Due to lack of space, we do not discuss this sub problem further, but refer the reader to [10] for a fast algorithm.

In Section 3, we show the discovery of association rule in large database. We also describe the classical Apriori, AprioriTid. We also describe how the Apriori and AprioriTid algorithms can be combined into a hybrid algorithm, AprioriHybrid, in section 4. In section VI we show the relative performance of the proposed improved-Apriori algorithms against the Apriori [5] and AprioriTid [4] algorithms, which is a combination of Apriori algorithm and transaction space with frequent Itemset tree. We conclude by pointing out some related problems in Section VII.

Association Rule Mining

The discovery of association rules in large databases was first described by Agrawal et al. (1993). The initial motivation for association rules was to aid in the analysis of large transaction databases, such as those collected by supermarkets. The discovery of associations between the purchases of various line items can potentially aid decision making within retail organizations. Transaction databases are therefore the primary domain targeted for association rule discovery. [4]

The initial algorithm, AIS, proposed by Agrawal et al, operates by looking for itemsets with the required support within the database, called large itemsets. Trying to find all of the large itemsets in a single pass over the data is inefficient because most of the itemsets that would be measured are not large. The AIS technique reduces this inefficiency by making several passes over the data and using the results of the current pass to produce candidate large itemsets to evaluate in the next pass. This process of discovering large itemsets continues until no more large itemsets are found. This approach however has proven to produce too many small itemsets as candidates for large itemsets. This is because it produces candidate itemsets by extending the existing large itemsets using itemsets that are not large. Mannila et al. (1993, 1994) describe an algorithm that is faster than the Agrawal et algorithm, and has demonstrated a performance improvement of a factor of five in testing examples. The efficiency gain in their algorithm is based upon the observation that subsets of large sets must also be large. The number of candidate sets can therefore be dramatically reduced by using only large itemsets in the construction of other candidate large itemsets. They also point out the viability of sampling as an efficient method for finding large itemsets. However, as expected, sampling techniques result in a trade-off in accuracy. As a result of the same observation as Mannila et al, Agrawal and Srikant propose two alternative algorithms, Apriori and AprioriTid, both of which are more efficient than AIS (Agrawal and Srikant 1994)[5]

The association rule mining is a two-step process:

1. Find all the frequent itemsets in the transaction database. If support of itemset X,, support(X)≥minsup, then X is a frequent itemset. Otherwise, X is not a frequent itemset.

2. Generate strong association rules from frequent itemsets. For every frequent itemset A, if BA, B ≠ , and support (A)/support (B) ≥ minconf, then we have association rule B (A-B). The second step is relatively easy, and its algorithm generation can be found in the reference.

The present focuses in research are to find highly efficient algorithms in the first step[6]. Apriori and AprioriTid generate itemsets by using only the large itemsets found in the previous pass, without considering the transactions. AprioriTid improves Apriori by only using the database at the first pass. Counting in subsequent passes is done using encodings created in the first pass, which is much smaller then the database. This leads to a dramatic performance improvement of three times faster than AIS and four times faster than SETM in one of their experiments in [4].

AprioriTid uses the database only once and use apriori-gen to generate candidates. Database D is not used for counting support after the first pass. The set is Ck'used, for this purpose. Elements of Ck'are in the form <TID, {Xk}> where each Xk is a potentially large k-itemset present in the transaction with identifier TID. The member of Ck' corresponding to transaction t is <t.TID, {c Î Ck' | c contained in t}> [4] [7].

Apriori-hybrid algorithm

It is not necessary to use the same algorithm in all the passes over data. Figure 4.1 shows the execution times for Apriori and AprioriTid for different passes over the dataset T10.14.DlOOK. In the earlier passes, Apriori does better than AprioriTid. However, AprioriTid beats Apriori in later passes. We observed similar relative behaviour for the other datasets, the reason for which is as follows. Apriori and AprioriTid use the same candidate generation procedure and therefore count the same itemsets. In the later passes, the number of candidate itemsets reduces. However, Apriori still examines every transaction in the database. On the other hand, rather than scanning the database, AprioriTid scans Ck' for obtaining support counts, and the size of Ck' has become smaller than the size of the database. When the Ck' sets can fit in memory, we do not even incur the cost of writing them to disk.

Figure 4.1: Per pass execution times of Apriori and

AprioriTid (T10.14.DlOOK, minsup = 0.75%)

Based on these observations, we can design a hybrid algorithm, which we call AprioriHybrid, that uses Apriori in the initial passes and switches to AprioriTid when it expects that the set Ck' at the end of the pass will fit in memory. We use the following heuristic to estimate if Ck' would fit in memory in the next pass. At the end of the current pass, we have the counts of the candidates in Ck. From this, we estimate what the size of Ck' would have been if it had been generated. This Size, in words, is (c Ck Support(c) + number of transactions). If Ck' in this pass was small enough to fit in memory, and there were fewer large candidates in the current pass than the previous pass, we switch to AprioriTid. The latter condition is added to avoid switching when Ck' in the current pass fits in memory but Ck' in the next pass may not.

Switching from Apriori to AprioriTid does involve a cost. Assume that we decide to switch from Apriori to AprioriTid at the end of the kth pass. In the (k + 1)th pass, after finding the candidate itemsets contained in a transaction, we will also have to add their IDS to Ck+1. Thus there is an extra cost incurred in this pass relative to just running Apriori. It is only in the (k+2)th pass that we actually start running AprioriTid. Thus, if there are no large (k + 1)th itemsets, or no (k+ 2)-candidates, we will incur the cost of switching without getting any of the savings of using AprioriTid.[4]. Now, Fig 4.2 show the execution times for Apriori,AprioriTid and AprioriHybrid for different passes over the dataset T10.14.DlOOK [4]

In this section we describe our new approach for frequent Itemset mining is described. This approach is a combination of Apriori algorithm and frequent Itemset lattice.

Our algorithm is a combination of Apriori, which is the most popular mining algorithm, and set operations. Principles of set operations, which are intersection and union, are used. These principles are related to frequent Itemset tree. In frequent Itemset tree, there are nodes holding frequent itemsets and transactions containing related itemsets.

In order to construct k-itemsets, frequent (k-1)-itemsets are used. Their union is formed and for their support count and intersection operation is employed between the Tids of the itemsets.

Itemset {A} is in transactions with Tid 1,3,4,5 and {B} is in transactions with Tid 1,2,3,4,5,6, i.e. T(A)={1,3,4,5} and T(B)={1,2,3,4,5,6}. The Itemset {A, B} is the union of these two itemsets and intersection principles is used to find the tids for {A, B} as follows:

T(AB) = T(A) T(B) ={1,3,4,5}{1,2,3,4,5,6}

= {1, 3, 4, 5}

If the result is greater than minimum support, it will be joined to frequent Itemset tree. If the result is lower than minimum support, it will be pruned off.

Now, steps of the proposed Algorithm "Set Operation for frequent Itemset using Transcation Database" SFIT:

Scan database and find frequent 1-itemsets, at the same time obtain transaction sets, which includes the Itemset. (Assume that the frequent 1-itemsets and transaction sets, require no more memory that available and also there is space for generating candidate 2-itemsets from frequent 1-itemset.

Generate candidate 2-itemsets from frequent 1-itemset only.

Prune off the candidate 2-itemsets whose node count is lower than min support using their Tidset. Now frequent Itemset tree contains only frequent 2-itemsets at the second level.

Consequently, for each frequent 3, 4… n-Itemset, scan the database to approve the consistence of the Itemset.

Finally, Itemset are used to generate strong rules having minimum confidence in the frequent Itemset tree.

This is an example based on the transaction database, D, of Figure 5.1. There are four transactions in this database, that is, |D|=4. We use the improved Apriori algorithm for finding frequent itemsets in D, based on the set operation Intersection.

Below Fig shows the example implementation of Apriori and proposed Apriori SFIT.

Implementation and Analysis

The target of this section is implementing Apriori and DIC. Our concerns include comparison of frequent item sets obtained, and factors that affect the efficiency of the algorithms.

By this means, we intend to state the advantages of SFIT over Apriori.

Environment and Database

The programming language we use to implement

the algorithms is .NET C# 2008.

Result analysis

For the experiment we have used datasets of different application. These datasets was obtained from the UCI repository of machine learning databases [CC98]. The Table below portraits the characteristics of the datasets selected for the experiment.

To study the performance of the algorithms, data sets transactions and support threshold with 30% to 70% were used. The Table shows that the execution time of the algorithms decreases with varying support thresholds for adult data set, hepatitis and heart.


In this paper, the Modified Apriori algorithm SFIT, based on set operation is proposed to update the classical Apriori algorithm. Through pruning candidate itemsets by the infrequent itemsets, the present algorithm can reduce the number of database scanning and the redundancy while generating subtests and verifying them in the database. Validated by the experiments, it can obtain better efficiency. The presented new approach for mining frequent itemsets is a bottom-up level wise approach that utilizes both Item set space and transaction space.