Analysis Of Association Rule Learning Computer Science Essay

Published:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

Machine learning refers to a system capable of acquiring and integrating the knowledge automatically.The capability of the systems to learn from experience, training, analytical observation, and other means, results in a system that can continuously self-improve and thereby exhibit efficiency and effectiveness. A machine learning system usually starts with some knowledge and a corresponding knowledge organization so that it can interpret, analyze, and test the knowledge acquired.

A major focus of machine learning research is the design of algorithms that recognize complex patterns and make intelligent decisions based on input data. We analyze and compare various algorithms used in Association rule learning in this project.We will perform the comparison based on execution time, memory usage for a large number of real and synthetic data sample.

Keywords: Machine Learning, Association rule Learning, Knowledge.

1.Introduction

As with the advancement of IT technologies, the amount of accumulated data is also increasing.It has resulted in large amount of data stored in databases, warehouses and other repositories. Thus the learning algorithm comes into picture to analyse the databases to extract the interesting and previously unknown patterns and rules known as association rule mining.

"Machine Learning" is a branch of artificial intelligence is a discipline geared toward the technical development of human knowledge.It generates rules based on the data that had fed into it.It provides systems with the ability to learn without being explicitly programmed. It search through the data and look for patterns.

"Unsupervised learning" is one of the categories of Machine Learning which models a set of inputs. It is a useful structure without labeled classes, optimization criterion, feedback signal, or any other information beyond the raw data. It refers to the problem of trying to find hidden structure in unlabeled data.Learning in which the system parameters are adapted using only the information of the input and are constrained by prespecified internal rules.

"Association rule learning" is a method for discovering interesting relations between variables in large databases and falls under "unsupervised learning". A major focus of machine learning research is the design of algorithms that recognize complex patterns and make intelligent decisions based on input data.

Association rules are employed today in many application areas including Web usage mining, intrusion detection and bioinformatics. The agent-based approach to web mining involves the development of sophisticated AI systems that can act autonomously or semi-autonomously on behalf of a particular user, to discover and organize web-based information.

Association rules are one of the major techniques to find frequent patterns, associations, correlations, among sets of items or objects in transaction databases, relational databases, and other information repositories . The volume of data is increasing dramatically as the data generated by day-to-day activities. Therefore, learning association rules from massive amount of data in the database are interested for many industries which can help in decision making processes, such as cross-marketing, Market Basket analysis, and promotion assortment.

1.1General Structure

Association rules are usually required to satisfy a user-specified minimum support and a user-specified minimum confidence at the same time. Association rule generation is usually split up into two separate steps:

First, minimum support is applied to find all frequent itemsets in a database.

Second, these frequent itemsets and the minimum confidence constraint are used to form rules.

Let I=\{i_1, i_2,\ldots,i_n\} be a set of n binary attributes called items. Let D = \{t_1, t_2, \ldots, t_m\} be a set of transactions called the database. Each transaction in D has a unique transaction ID and contains a subset of the items in I. A rule is defined as an implication of the form X \Rightarrow Y where X, Y \subseteq I and X \cap Y = \emptyset. The sets of items (for short itemsets) X and Y are called antecedent (left-hand-side or LHS) and consequent (right-hand-side or RHS) of the rule respectively.

The support \mathrm{supp}(X) of an itemset X is defined as the proportion of transactions in the data set which contain the itemset.

The confidence of a rule is defined \mathrm{conf}(X\Rightarrow Y) = \mathrm{supp}(X \cup Y) / \mathrm{supp}(X)

Methodology (re phrased)

The existing algorithms have several strengths and weakness and different performance levels for different types of data. In this project, the performances of these algorithms is investigated and compared. Pruning unimportant patterns is very important in intelligent decision making process.

2.Theory and Implementation

Machine Learning is a branch in Artificial Intelligence, which should be able to find relationships, correlations and association rules which are hidden within large amounts of data. When the database is too huge, it becomes a problem to find the association rules and relationships. For this purpose, many algorithms have been proposed to efficiently find the association rules among the large data.

Frequent itemsets are very important in the Association rules.The various approaches of generating frequent itemsets are divided into basic three techniques.

Horizontal layout based techniques

Apriori Algorithm

DHP Algorithm

Partition

Sample

Dynamic Item set counting

Improved Apriori algorithm

Vertical layout based techniques

Eclat algorithm

Projected layout based techniques

FP-tree algorithm

COFI- tree algorithm

CT - PRO algorithm

H-mine algorithm

2.1 Algorithms from Horiontal Layout Database

Definition: In this each row of database represents a transaction which has a transaction identifier (TID), followed by a set of items.

2.1.1Apriori Algorithm :

Apriori is a classic algorithm for learning association rules Apriori uses breadth-first search and a tree structure to count candidate item sets efficiently.. It is a seminal algorithm for finding frequent itemsets using candidate generation. By convention, Apriori assumes that items within a transaction or itemset are sorted in lexicographic order. It is characterized as a level-wise complete search algorithm using anti-monotonicity of itemsets.Experimenting with Apriori-like algorithm is the first thing that data miners try to do.

Let X, Y ⊆ I be any two itemsets. If X ⊆ Y, then sup(X) ≥ sup(Y), which leads to the following

two corollaries:

If X is frequent, then any subset Y ⊆ X is also frequent.

If X is not frequent, then any superset Y ⊇ X cannot be frequent.

Start

L1= {frequent items};

for (k= 2; Lk-1 !=∅; k++) do begin

Ck= candidates generated from Lk-1 (that is: cartesian product Lk-1 x Lk-1 and eliminating any

k-1 size itemset that is not frequent);

for each transaction t in database do

increment the count of all candidates in

Ck that are contained in t

Lk = candidates in Ck with min_sup

end

return k Lk;

2.2 Algorithms from Vertical Layout Database

Definition: In vertical layout data set, each column corresponds to an item, followed by a TID list, which is the list of rows that the item appears.

2.2.1Eclat Algorithm:

It uses transaction id set (tid set) intersection approach. Unlike Apriori and FP Tree algorithm, Eclat uses vertical database, i.e., it associates tids with an itemset. Frequent patterns are identified by joining itemsets and taking intersection of their tids. It generates the larger amount of candidates than Apriori because it uses only two sets at a time for intersection. There is reordering step that takes place at each recursion point for reducing the candidate itemsets.It utilizes the aggregate memory of the system by partitioning the candidates into disjoint sets using the equivalence class partitioning.

Begin

/* Initialization Phase*/

Scan local database partition

Compute local counts for all 2-itemsets

Construct global L2counts

/* Transformation Phase */

Partition L2 into equivalence classes

Schedule L2 over the set of processorsP

Transform local database into vertical form

Transmit relevant tid-lists to other processors

Local L2 = receive tid-lists from other processors

/* Asynchronous Phase */

For each equivalence class E2 in Local L2

ComputeFrequent(E2)

/* Final Reduction Phase*/

Aggregate Results and Output Associations

End Eclat

/* Compute Frequent(Ek -1)*/

Begin Compute Frequent(Ek -1)

For all itemsets I1and I2in Ek-1

if ((I1.tidlist \I2.tidlist) >= minsup)

add (I1 U I2) to Lk

Partition Lk into equivalence classes.

For each equivalence class Ek in Lk

Compute Frequent(Ek)

End Compute Frequent

2.3 Algorithms from Projected Layout Database

Definition: The projected based layout contains the record id separated by column then record.

2.3.1 H-mine Algorithm:

It is the improvement over FP-tree algorithm as in H-mine Projected database is created using in-

memory pointers. H-mine uses an H-struct new data structure for mining purpose known as

hyperlinked structure. H-minehas a high performance and is scalable in many kinds of data, with a very limited and precisely predictable main memory overhead.

Input:A transaction database TDBand a support threshold min sup.

Output : The complete set of frequent patterns.

Method:

Step 1. Scan the transaction database TDBonce to find L, the complete set of frequent items.

Step 2. PartitionTDBinto k parts, TDB1, ..., TDBk, such that, for each TDBi(1 ≤ i ≤ k ), the frequent-item projections in TDBi can be held in the main memory.

Step 3. For i = 1to k , use H-mine(Mem)to mine frequent patterns in TDBi with respect to the minimum support threshold min_supi =[min_sup Ã- ni/ n],where n and ni are the number of transactions in TDB and

TDBi,respectively. Let Fi be the set of frequent patterns in TDBi.

Step 4. Let F =Uki =1 Fi. ScanTDB one more time, collect support for patterns in F.Output those patterns which pass the minimum support threshold min_sup.

2.1 Inputs to the Project (re phrased)

2.1.1 Data Representation

'I' = {i1, i2, i3, i4…in} is a set of items where 'n' is the dimensionality of the problem.

'D' is the database.

'T' is a transaction in the database 'D' and is a set of items such that 'T € I'.

2.1.2 Inputs:

The database 'D' and a minimum support threshold are fed to the algorithms as input. The

itemsets which have frequency above this threshold. The aim is to find the frequent itemsets from

this database.

2.1.3 Representation of Data

The data is represented as either an 'array' or a 'tree structure' for the algorithms implemented

in this project. 'Array' is used for the 'Horizontal layout' and 'Vertical layout'. 'Tree' is used

for 'Projected layout' based algorithms.

2.2 Objectives of the project

1. Implement the 12 algorithms (java is used as the programming language).

2. Calculate the space, time taken by the algorithms.

3. Analyze and compare the memory usage, time taken by the algorithms.

3. Importance of the project

Many of the pattern finding algorithms such as decision tree, classification rules and clustering techniques have been developed in machine learning research community. Frequent pattern or Association Rule is one of the few exceptions to this tradition. The impact of this technique its is tremendous. Some of the applications of Frequent Patterns are

1. Market Basket Analysis: It is a more specific type of Affinity Analysis which discovers co-occurring relationships among activities performed by (or recorded about) specific individuals or groups. E.g. In a super market, if customers buy cereals, they are more likely to buy milk as well and by studying this types of relationships, we can decide physical locations of the items, offer discounts to the customers who buy both the milk and the cereals.

2. Web Usage Mining: By studying patterns of users visiting web-pages, we can decide authority and rank of a web-page. Also, it is sometimes used in differentiating between web-crawlers and actual users, analyzing relevance of the advertisements displayed to the users.

3. Statistical Anomaly based Intrusion Detection : A statistical anomaly-based IDS determines normal network activity like what sort of bandwidth is generally used, what protocols are used, what ports and devices generally connect to each other and alerts the user when abnormal traffic is detected.

4.RESULTS

Algorithm\Parameters

Method

Data Structure

Memory Usage

Execution Time

Apriori

It uses breadth first search and tree structure to find the candidate itemsets and then it prunes the candidates which are infrequent

Array Based

14.109375 mb

It takes large amount of space because large number of candidates are produced

It takes more time because it uses a double recursive scheme to Count the transactions (0.735 seconds)

Partitioning

The database is partitioned for finding the local frequent itemsets.

Array Based

Main memory is used for each partitioning

It takes more time because of finding locally frequent and globally frequent

(0.45 seconds)

Direct Hashing And Pruning

It uses hashing technique for finding frequent itemsets

Array Based

It takes less amount of space at earlier stages

It takes less time

(0.039 seconds)

Sampling

It takes random sample for checking the frequency itemsets of whole database

Array Based

It take very less amount of memory

It takes very less time

(0.019 seconds)

Dynamic Itemset Counting

It depends on dynamic insertion of candidate items

Array Based

Memory utilization varies depending on the different point of time

It takes less time because according to the situation dynamic itemset are added (0.0618 seconds)

Improved Apriori

It employs the same Array

method like Apriori Based

but it replaces the Database D by D'which is only contains

frequent transactions.

Array Based

It takes less amount of space than Apriori

It takes less time when compared to Apriori

(0.56 seconds)

FP-Tree

It builds the frequent pattern tree which satisfies the minimum support

Tree Based

0.69140625 mb

It requires less memory because there are no candidate generation

It takes more time due to the complex data structure(0.022 seconds)

COFI Tree

It builds the bidirectional FP-tree and then constructs the COFI-Tree locally for each item

Tree Based

Main memory is utilized because it divides the locally in parts for the complete tree

It takes more time for finding the local and global frequent (0.036 seconds)

CT-PRO

It builds the compact FP-tree and then mine frequent itemsets according to projection index

Tree Based

Main memory is utilized as it uses the compress FP-Tree and projections are made seperately

It takes more time due to large number of projections (0.045 seconds)

Algorithm\Parameters

Method

Data Structure

Memory Usage

Execution Time

H-Mine

It takes use of hyperlink pointers to store the partitioned database

Tree Based

0.9208984375 mb Memory is used according to the partitions of the projected database

It takes less time than FP-Tree due to the Partition of the data base (0.056 Seconds)

Eclat

It uses transaction ids to generate the candidate itemsets

Array Based

Less amount of memory is required when compared to Apriori

It takes less time than Apriori

(0.024 seconds)

Relim

It is a step byTree Based

step elimination of items with a

recursive processing of transaction subsets.

Tree Based

It takes less amount of memory when compared to Apriori

It takes less time when compared to FP-Tree and Apriori (0.021 Seconds)

Comparison Of Time Complexity

Screen shots

6. CONTRIBUTION

7. CONCLUSION AND FUTURE WORK

Writing Services

Essay Writing
Service

Find out how the very best essay writing service can help you accomplish more and achieve higher marks today.

Assignment Writing Service

From complicated assignments to tricky tasks, our experts can tackle virtually any question thrown at them.

Dissertation Writing Service

A dissertation (also known as a thesis or research project) is probably the most important piece of work for any student! From full dissertations to individual chapters, we’re on hand to support you.

Coursework Writing Service

Our expert qualified writers can help you get your coursework right first time, every time.

Dissertation Proposal Service

The first step to completing a dissertation is to create a proposal that talks about what you wish to do. Our experts can design suitable methodologies - perfect to help you get started with a dissertation.

Report Writing
Service

Reports for any audience. Perfectly structured, professionally written, and tailored to suit your exact requirements.

Essay Skeleton Answer Service

If you’re just looking for some help to get started on an essay, our outline service provides you with a perfect essay plan.

Marking & Proofreading Service

Not sure if your work is hitting the mark? Struggling to get feedback from your lecturer? Our premium marking service was created just for you - get the feedback you deserve now.

Exam Revision
Service

Exams can be one of the most stressful experiences you’ll ever have! Revision is key, and we’re here to help. With custom created revision notes and exam answers, you’ll never feel underprepared again.