A Process Oriented Perception Of Personalization Computer Science Essay

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

Web personalization is a strategy, a marketing tool and an art. With the rapid development of Deep Web, a large number of web information often lead to information overload and information disorientated ", yet, personalized techniques can solve this problem. Personalized techniques are one such software tool used to help users obtain recommendations for unseen items based on their preferences. The commonly used personalized techniques are content based filtering, collaborative filtering and rule based filtering. In this paper, we present a survey on a personalized collaborative filtering method combining the association rule mining focusing on the problems that have been identifying and the solution that have been proposed.

Keywords- Web mining, web usage mining, personalization, collaborative filtering, association rule mining


Web mining uses many data mining techniques; it is not purely an application of traditional data mining due to the heterogeneity and semi-structured or unstructured nature of the Web data. Web mining process is similar to the data mining process. The difference is usually in the data collection. In traditional data mining, the data is often already collected and stored in a data warehouse. For Web mining, data collection can be a substantial task, especially for Web structure and content mining, which involves crawling a large number of target Web pages. Once the data is collected, we go through the same three-step process: data pre-processing, Web data mining and post-processing. However, the techniques used for each step can be quite different from those used in traditional data mining.

Web mining is the application of data mining techniques to extract knowledge from web data, i.e. web content, web structure, and web usage data mining.

Web usage mining

Web-usage mining (WUM), an emergent domain in web mining that has greatly concerned both academia and industry in recent years. WUM is the process of discovering and interpreting patterns of user access to web information systems by mining the data collected from user interactions with the system. A typical WUM system consists of two tiers: 1) tracking, in which user interactions are captured and acquired, and 2) analysis, in which user access patterns are discovered and interpreted by applying typical data-mining techniques to the acquired data. There are three main tasks for performing WUM- pre-processing, pattern discovery and pattern analysis. As below:-

Pre-processing: Commonly used as a preliminary data mining practice, data pre-processing transforms the data into a format that will be more easily and effectively processed for the purpose of the user. The different types of pre-processing in WUM are- usage, content, and structure pre-processing.

Pattern Discovery: WUM can be used to uncover patterns in server logs but is often carried out only on samples of data. The mining process will be ineffective if the samples are not a good representation of the larger body of data. The various pattern discovery methods are- Statistical Analysis, Association Rules, Clustering, Classification, Sequential Patterns, and Dependency Modeling.

Pattern Analysis: The need behind pattern analysis is to filter out uninteresting rules or patterns from the set found in the pattern discovery phase. Most common form of pattern analysis consists of a knowledge query mechanism such as SQL. Content and structure information can be used to filter out patterns containing pages of a certain usage type, content type, or pages that match a certain hyperlink structure.


Web personalization is a strategy, a marketing tool, and an art. Personalization requires implicitly or explicitly collecting visitor information and leveraging that knowledge in your content delivery framework to manipulate what information you present to your users and how you present it. Personalization for its own sake has the potential to increase the complexity of site interface and drive inefficiency into the architecture. It is the capability to customize customer communication based on knowledge preferences and behaviours at the time of interaction. Web personalization process include (a) The collection of Web data, (b) The modeling and categorization of these data (pre-processing phase), (c) The analysis of the collected data, and (d) The decision making/final recommendation phase that should be performed.

Proposed Architecture:

a) Collection of Web data - Implicit data includes past activities/clickstreams as recorded in Web server logs and/or via cookies or session tracking modules. Explicit data usually comes from registration forms and rating questionnaires.

b) Pre-processing of Web data - Data is frequently pre-processed to put it into a format that is compatible with the analysis technique to be used in the next step. Pre-processing may include cleaning data of inconsistencies, filtering out irrelevant information according to the goal of analysis (example: automatically generated requests to embedded graphics will be recorded in web server logs, even though they add little information about user interests), and completing the missing links (due to caching) in incomplete click through paths.

c) Analysis of Web data - Also known as Web Usage Mining, this step applies machine learning or Data Mining techniques to discover interesting usage patterns and statistical correlations between web pages and user groups. This step frequently results in automatic user profiling, and is typically applied offline, so that it does not add a burden on the web server.

d) Decision making/Final Recommendation Phase - The last phase in personalization makes use of the results of the previous analysis step to deliver recommendations to the user. The recommendation process typically involves generating dynamic Web content on the fly, such as adding hyperlinks to the last web page requested by the user.

Personalization Techniques:

Personalization techniques can be divided in three parts:

Content-Based Filtering

In this, the user model includes information about the content of items of interest- whether these are web pages, movies, music, or anything else. Using these items as a basis, the technique identifies similar items that are returned as recommendations. One of the limitations when using content-based techniques is that it leads to over - specialization.

Collaborative -Based filtering

In social or collaborative filtering, the system constructs rating profiles of its users, locates other users with similar rating profiles and returns items that the similar users rated highly. Scalability is a problem because computation grows linearly with the number of users and items. The advantage of social filtering, compared to content-based techniques, is that the pool from which recommendations originate is not restricted to items for which the active user has demonstrated interest. The pool will also include items that other users, users that are in some respect similar, have rated highly. This can prove to be instrumental in enhancing the user's model: social filtering systems give the user the opportunity to explore new topics and items.

Rule-Based filtering

It selects only the appropriate service in formations by comparing the query result produced from the Search Manager (Search Module) which is the user interface of the search engine and the rule of the user fetched from the user profile registry.

Collaborative Filtering Techniques

The fundamental assumption of CF is that if users X and Y rate n items similarly, or have similar behaviours(e.g., buying, watching, listening), and hence will rate or act on other items similarly.CF techniques use a database of preferences for items by users to predict additional topics or products a new user might like. In a typical CF scenario, there is a list of m users {u1, u2. . . um} and a list of n items {i1, i2, . . . , in}, and each user, ui has a list of items, Iui, which the user has rated, or about which their preferences have been inferred through their behaviours. CF algorithms are required to have the ability to deal with highly sparse data, to scale with the increasing numbers of users and items, to make satisfactory recommendations in a short time period, and to deal with other problems like synonymy (the tendency of the same or similar items to have different names), shilling attacks, data noise, and privacy protection problems. Early generation collaborative filtering systems, such as Group Lens, use the user rating data to calculate the similarity or weight between users or items and make predictions or recommendations according to those calculated similarity values.

Basically CF techniques divided into three parts:

Memory-based collaborative filtering

Method-based collaborative filtering

Hybrid recommenders

Memory-based collaborative filtering:

Memory-based CF algorithm uses the entire or a sample of the user-item database to generate a prediction. Every user is part of a group of people with similar interests. By identifying the so-called neighbours of a new user (or active user), a prediction of preferences on new items for him or her can be produced.

The neighbourhood-based CF algorithm, a prevalent memory-based CF algorithm, uses the following steps:

-Calculate the similarity or weight, wi, j, which reflects distance, correlation, or weight, between two users or two items, i and j.

- Produce a prediction for the active user by taking the weighted average of all the ratings of the user or item on a certain item or user, or using a simple weighted average.

-When the task is to generate a top-N recommendation, we need to find k most similar users or items (nearest neighbours) after computing the similarities, then aggregate the neighbours to get the top-N most frequent items as the recommendation.

Model based collaborative filtering:

The design and development of models (such as machine learning, data mining algorithms) can allow the system to learn to recognize complex patterns based on the training data, and then make intelligent predictions for the collaborative filtering tasks for test data or real-world data, based on the learned models. Model-based CF algorithms, such as Bayesian models, clustering models, and dependency networks, have been investigated to solve the shortcomings of memory-based CF algorithms. Usually, classification algorithms can be used as CF models if the user ratings are categorical, and regression models and SVD methods and be used for numerical ratings. Bayesian Belief Net CF Algorithms. A Bayesian belief net (BN) is a directed, acyclic graph (DAG) with a triplet N,A, Θ, where each node n ∈ N represents a random variable, each directed arc a ∈ A between nodes is a probabilistic association between variables, and Θ is a conditional probability table quantifying how much a node depends on its parents. Bayesian belief nets (BNs) are often used for classification tasks.

Hybrid recommenders

Hybrid CF systems combine CF with other recommendation techniques (typically with content-based systems) to make predictions or recommendations. The two major classes of CF approaches, memory-based and model-based CF approaches, can be combined to form hybrid CF approaches. The recommendation performances of these algorithms are generally better than some pure memory-based CF algorithms and model-based CF algorithms.

v. association rule mining

Association rule mining are one of the major techniques of data mining and it is the most common form of local-pattern discovery in unsupervised learning systems. It serves as a useful tool for finding correlations between items in large databases. The terms used in these rule are:

Support: The support supp(X) of an itemset X is defined as the proportion of transactions in the data set which contain the itemset.

supp(X) = no. of transactions which contain the itemset X / total no. of transactions

Confidence: The measure of certainty or trustworthiness associated with each discovered pattern. The confidence for an association rule X implies Y is the ratio of the number of transaction that contains X U Y to the number of transaction that contains X.

con f (X->Y) = supp (XUY) / supp(X)

Large Item Set: A large item set is an item set whose number of occurrences is above a threshold or support. Here Item is: article in the basket and Itemset: a group of items purchased together in a single transaction.

Association rule mining is the most used technique in Web Usage Mining generally applied to databases of transactions where each transaction consists of a set of items. When applied to Web Usage Mining, association rules are used to find associations among web pages that frequently appear together in users' sessions. When applied to Web Usage Mining, association rules are used to find associations among web pages that frequently appear together in users' sessions. The most common approach to finding association rules is to break up the problem into two parts:

1. Find all frequent itemsets: By definition, each of these itemsets will occur at least as frequently as a pre-determined minimum support count.

2. Generate strong association rules from the frequent itemsets: By definition, these rules must satisfy minimum support and minimum confidence.

Many algorithms have been proposed to solve the problem of detecting frequent itemsets in transaction database. Most of them can be classified into two categories, candidate generation and pattern growth.

Apriori represents the candidate generation approach. Apriori is a Breadth First Search Algorithm (BFS) which generates candidate k+1-itemsets based on frequent k-itemsets. The key idea of Apriori algorithm is to make multiple passes over the database.

The advantages of using apriori algorithm are

Uses large item set property.

Easily parallelized.

Easy to implement.

FP-growth is a representative pattern growth approach. It is a Depth First Approach (DFS) and uses a special data structure, FP-Tree, for compact representation of the original database. FP-growth detects the frequent itemsets by recursively finding all frequent 1-itemsets in the conditional pattern base that is efficiently constructed based on the node link structure associated with FP-Tree. FP-growth doesn't explicitly generate candidates; its detection of the item supports is equivalent to generating 1-itemset candidates implicitly.

The advantages of FP-Growth algorithm are

Uses compact data structure.

Eliminates repeated database scan.


Apriori is a seminal algorithm for finding frequent itemsets using candidate generation. It is characterized as a level-wise complete search algorithm using anti-monotonicity of itemsets, "if an itemset is not frequent, any of its superset is never frequent". By convention, Apriori assumes that items within a transaction or itemset are sorted in lexicographic order.

Let the set of frequent itemsets of size k be Fk and their candidates be Ck. Apriori first scans the database and searches for frequent itemsets of size 1 by accumulating the count for each item and collecting those that satisfy the minimum support requirement. It then iterates on the following three steps and extracts all the frequent itemsets:

1. Generate Ck+1, candidates of frequent itemsets of size k +1, from the frequent itemsets of size k.

2. Scan the database and calculate the support of each candidate of frequent itemsets.

3. Add those itemsets that satisfies the minimum support requirement to Fk+1.

The Apriori algorithm is shown below. Function apriori-gen in line 3 generates Ck+1 from Fk in the following two step process:

1. Join step: Generate Rk+1, the initial candidates of frequent itemsets of size k + 1 by taking the union of the two frequent itemsets of size k, Pk and Qk that have the first k−1 elements in common.

Rk+1 = Pk ∪ Qk = {iteml, . . ., itemk−1, itemk ,


Pk = {iteml, item2, . . . , itemk−1, itemk }

Qk = {iteml, item2, . . . , itemk−1, i temk' }

where, iteml < item2 < · · · < itemk < itemk'.

2. Prune step: Check if all the itemsets of size k in Rk+1 are frequent and generate Ck+1 by removing those that do not pass this requirement from Rk+1. This is because any subset of size k of Ck+1 that is not frequent cannot be a subset of a frequent itemset of size k + 1.

It is evident that Apriori scans the database at most kmax+1 times when the maximum size of frequent itemsets is set at kmax.


F1= (Frequent itemsets of cardinality 1);

For(k=1;Fk !=0;k++) do begin

Ck+1 = apriori-gen(Fk);//New candidates

for each transaction t in database do

increment the count of all candidates in

Ck+1 that are contained in t

Fk+1 = candidates in Ck+1 with min_support

return Uk Fk


Apriori algorithm, in spite of being simple, has some limitation. They are:

It is costly to handle a huge number of candidate sets.

It is tedious to repeatedly scan the database and check a large set of candidates by pattern matching, which is especially true for mining long patterns.

In order to overcome the drawback inherited in Apriori, an efficient FP-tree based mining method, FP-growth is used. It is an order of magnitude faster than Apriori, and is also faster than tree-projection which contains two phases, where the first phase constructs an FP tree, and the second phase recursively researches the FP tree and outputs all frequent patterns.


FP-growth algorithm is an efficient method of mining all frequent itemsets without candidate's generation. FP-growth utilizes a combination of the vertical and horizontal database layout to store the database in main memory. Instead of storing the cover for every item in the database, it stores the actual transactions from the database in a tree structure and every item has a linked list going through all transactions that contain that item.

The algorithm mines the frequent itemsets by using a divide and- conquer strategy as follows: FP-growth first compresses the database representing frequent itemset into a frequent-pattern tree, or FP-tree, which retains the itemset association information as well. The next step is to divide a compressed database into set of conditional databases (a special kind of projected database), each associated with one frequent item. Finally, mine each such database separately. Particularly, the construction of FP-tree and the mining of FP-tree are the main steps in FP-growth algorithm.

A frequent pattern tree is a tree structure defined below.

1. It consists of one root labeled as "root", a set of item prefix sub-trees as the children of the root, and a frequent-item header table.

2. Each node in the item prefix sub-tree consists of three fields: item-name, count, and node-link, where item-name registers which item this node represents, count registers the number of transactions represented by the portion of the path reaching this node, and node-link links to the next node in the FP-tree carrying the same item-name, or null if there is none.

3. Each entry in the frequent-item header table consists of two fields, (1) item-name and (2) head of node-link, which points to the first node in the FP-tree carrying the item-name.

Construction of FP-Tree:

•First, create the root of the tree, labeled with "null".

• Scan the database D to create 1-itemset and then L and scan again for second time.

• The items in each transaction are processed in L order (i.e. sorted order).

• A branch is created for each transaction with items having their support count separated by colon.

• Whenever the same node is encountered in another transaction, just increment the support count of the common node or Prefix.

• To facilitate tree traversal, an item header table is built so that each item points to its occurrences in the tree via a chain of node-links.

• Now, The problem of mining frequent patterns in database is transformed to that of mining the FP-Tree.


We have taken the view that Web personalization is an application of data mining and therefore must be supported during the various phases of a typical data mining cycle. We defined a framework for web personalization expert based on web mining and collaborative filtering techniques.

We adopted collaborative filtering technique combined with association rule mining technique, especially apriori algorithm respectively to associate the usage pattern of the clients in particular website.

The main drawback of Apriori algorithm is that the

candidate set generation is costly, especially if a large number of patterns and/or long patterns exist. The main drawback of FP-growth algorithm is the explosive quantity of lacks a good candidate generation method. Future research can combine FP-Tree with Apriori candidate generation method to solve the disadvantages of both apriori and FP-growth. The new algorithm will reduce the storage space, improves the efficiency and accuracy of the algorithm.