Personalize Web Search With User Diversity Prediction Computer Science Essay

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

Abstract- As per the various investigations performed, many personalization algorithms have been identified. But it is still unclear whether personalization is consistently effective on different queries for different users and under different search contexts. Among the different techniques used to improve the web search a click based approach found to be more permissive if the user browse history is available. Here preliminarily a click based approach is implemented in the client side system by storing the client search information in the local system. Then from this information, user's topical interest pattern is mined and a hybrid approach is implemented. It is also found that most of the personalization techniques apply personalization uniformly irrespective of any consideration. A Click Entropy is used here as a technique to predict the need of personalization. It is found that the web search performance improves considerably with the click based personalization method if the user search history is available. Using a prediction model is also found to be a solution for this and it is better than using the personalization simply for all queries.

Keywords: ClickEntropy, ODP, P-Click, Topical Interest,


The World Wide Web is one of the greatest resources of information. The retrieval of proper information to fulfill the intention of the information seeker is one of the major challenges in the history of World Wide Web. In most of the search engines when queries are issued, most return the same results irrespective of the users. Different users may have completely different information needs and goals when using precisely the same query. Personalized search is considered a solution to address these problems, since it can provide different search results based upon the preferences of users. The effort to improve the effectiveness of web searching started with the history of Internet. Different techniques [2], [3], [4], [5], [8], [14], [15] were implemented by various search providers to satisfy the user with proper and user intended search results. Most of the approaches were based on collecting the interest of users in manual or an automated way. For each user a profile [12], [13], [17] was created and it depicts the interests of each users.

By personalizing the web search the results will be retrieved based on the interest of user. The results will be from various disciplines without considering the context of the user. For example the search for the term "mouse" will return the results related to the computer peripheral mouse, the rat, the cartoon character Mickey Mouse etc. In a personalized search the results will be exactly from the context of the user. Various personalization strategies have been proposed. However, they are far from optimal. One problem of current personalized search is that most proposed algorithms are uniformly applied to all users and queries [2], [3], [5]. The evaluation of these strategies reveals that this approach is not effective and therefore it is suggested that queries should not be handled in the same general manner [1].

In this proposed work, as a solution for the above problem, a prediction method [1], [6] is used to decide whether to personalize or not to personalize. A measure of Click entropy [1] is used to measure whether users have diverse information needs by issuing a query. Experimental results show that personalization brings significant search accuracy improvements on queries with large click entropy and has little effect on queries with small click entropy. Personalization algorithms can even harm search accuracy on some queries.

Based on a large scale personalized search evaluation framework [1] based on real-world query logs a click based personalized search method finds to be more effective compared to the topic based search method which was a common mean to employ personalized web search. In this method the user interest can be easily identified using the number of clicks they performed in a particular page.

At the end, although click-based methods outperform topic-interest-based ones on the average, topic interest-based methods perform best for some queries. Therefore a hybrid approach to combine strengths of these personalization methods intelligently, for a better ranking is adopted in a real-world personalized search engine.

The remaining sections are organized as follows: In Section 2 a detailed literature survey of related works is discussed. In Section 3 the proposed technique is explained in the aspect of understanding the problem and solution domain. In Section 4 the details of the partial implementation in the view of single system is explained. In Section 5 the system is viewed in an Intranet aspect. In Section 6 the integration of two methods are described. In Section 7 click entropy is proposed to predict whether a personalization algorithm should be used for given queries and experiment with predicting when Web search results can be improved using a personalization method based on a user's long-term interests. Then the work is concluded and the future work to attain the entire objectives is briefed in Section 8.


There are two categories of work related to this work; personalizing Web search methods, and predict the need of personalization.

A. Personalized Web Search Methods

The efforts to personalize the web search started with the history of search engines. Most of the techniques were based on the user profiles, where a user profile depicts the user interests and other details regarding the user and his search to improve his/her searching. Once the user profile is constructed, it is used to compare it with either the past result sets, or any topic categorical [15] group, or any other specific data sources to produce the relevant result to the user.

In an approach, user profile are created and categorizing the web pages, and comparing these concepts to obtain the result according to the interest of the user. This technique studies different ways to model a user's interests and shows how these models - also called profiles - can be deployed for more effective information retrieval and filtering. A user profile is created over time by analyzing surfed pages to identify their content and by associating that content with the length of the document and the time that was spent on it [2], [4]. When pages about certain subjects are visited again and again, this is taken as an indication of the user's interest in that subject. Websites are clustered into regions, regions are clustered into super regions, super regions into hyper regions, etc., thus forming a hierarchy of regions. User interests are inferred by analyzing the web pages the user visits. For this purpose, it is necessary to determine the content, or characterize these surfed pages. This is done by using a hierarchy of concepts, or rather ontology. This ontology is based on a publicly accessible browsing hierarchy. Each node of the browsing hierarchy is associated with a set of documents that are used to represent the content of this node. For each of the surfed pages a keyword vector is calculated. This page vector is compared with the keyword vectors associated with every node to calculate similarities. The nodes with the top matching vectors are assumed to be most related to the content of the surfed page.

In another approach which is known as Open Directory Project (ODP) [3] is one of the largest efforts to manually annotate web pages. This effort involves over 65,000 editors and resulted in metadata specifying topic and importance for more than 4 million web pages. The additional criterion for web page ranking, namely the distance between a user-profile defined using ODP topics and the sets of ODP topics covered by each URL returned in regular web search. Investigate the boundaries of biasing Page-Rank on subtopics of the ODP in order to automatically extend these metadata to the whole web. Each user has to select several topics from the ODP, which best fit her interests to generate the user profile. Then, at run-time, the output given by a search service (from Google, ODP Search, etc.) is re-sorted using a calculated distance from the user profile to each output URL. One Personalized Page-Rank Vector (PPV) [21] is computed for each user. The personalization aspect of this algorithm stems from a set of hubs, each user having to select her preferred pages from it. PPVs can be expressed as a linear combination of PPVs for preference vectors with a single non-zero entry corresponding to each of the pages from the preference set. The extension of the manual ODP classifications from its current 4 million entries to an 8 billion Web is performed based on an analysis of how topic classifications for subsets of large page collection can be extended to this large collection via topic-sensitive biasing of Page-Rank values.

Another approach utilizes clickthrough data in the form of a three order tensor on which a 3-mode analysis is performed using the higher-order singular value decomposition technique to automatically capture the latent factors that govern the relations among these multi-type objects: users, queries and Web pages to improve Web search. Web users' interest is discovered here by analyzing the clickthrough data objects like user, query, and web page and relationship among these objects. The clickthrough data may reflect web users' interests and may contain patterns that users found their information. This clickthrough data is represented by a three dimensional, multilinear extension of the matrix Singular Value Decomposition. When using a search engine to find information: a user(u) submit

s a query(q), the search engine returns a list of URLs and the corresponding descriptions of the target Web pages, then the user clicks on the pages(p) of interest. After some time of usage, the search engine accumulates a collection of clickthrough data, which can be represented by a set of triplets (u, q, p). This is given as the input to the CubeSVD algorithm and the output is the reconstructed tensor , which measures the associateds among the users (u), queries (q) and web pages (p). In the CubeSVD algorithm, the tensor value measures the preference of a (user, query) pair on a Web page. If the page click frequency is used as tensor value, the algorithm is inclined to biasing towards tensor elements with high frequency. Three other weighting approaches are tried here: The first is a Boolean model. That is, for each (user, query) pair, if a page is clicked on, then the tensor value associated with the three objects is 1, otherwise 0. The second is by re-weighting of click frequency. a method used in IR community is used here. For each clickthrough data triple (u, q, p), the weight of the corresponding tensor value is a re-weighting of the page click frequency. The third approach is similar with the second one. Here it is taken into account the Inverse Document Frequency (IDF) [10] of a Web page (that is, frequency of a page visited by different users).

B. Predict the Need of Personalization

In most of the approaches the personalization applied uniformly to all the search queries. But the experiments showed that this will considerably reduce the performance of the personalization method.

In the context of Web search, for some queries, everyone who issues the query is looking for the same thing. For other queries, different people want very different results even though they express their need in the same way. To have a proper personalization the variability in user intent using both explicit relevance judgments and large-scale log analysis of user behavior patterns is examined.

It focuses on how well variation in the implicit measures predicts variation in the explicit measures, and look at what other factors can account for variation in the implicit measures [6]. Queries are characterized using a variety of features of the query, the results returned for the query, and the query's interaction history [17]. Using these features the predictive models is built to identify the queries that will benefit most from personalization, and explore which features are the most valuable for prediction. Using the datasets, query ambiguity is measured [6], [9], [20] by calculating the variation in the explicit relevance judgments or the user behavior available from the log data. The major measure of ambiguity is the potential for personalization curve. Here the features used to predict query ambiguity are mainly query, and result set and sometimes the log for query history. This ambiguity has been understood by a comparison between explicit and implicit measures, correlating features and implicit measures and the influence on the implicit measures. Several reasons are identified, why a query could have high click entropy or a large gap in the potential for personalization curve, and yet not be a good candidate for personalization, including variation in the results presented, variation in the task, and variation in result quality.


Different technologies for personalizing the web search are emerging day by day. Even an optimum solution had not yet reached, many improved and promising solutions based on certain aspects are introduced. Among them few of the latest technologies are identified and analyzed and tried to find out the best solution among the existing and implement it with improvements [1]. These techniques are broadly grouped into click based and topic based. Both the technology depends on the past history of user searches. In this method a hybrid approach of the both are considered. A new and novel idea of predicting the need of personalization also evaluated to decide whether all the search process needs to be personalized or not.

System Overview

The overview of the experiment can be viewed as depicted in the flowchart given in the Fig 3.1. The system works with the query issued by the web searcher. Now based on the various aspects on the query and the results from the previous history, it is predicted whether the searching requires a personalization or not [1], [6], [16]. Once the decision over the prediction is over and if it decides the need of personalization the system chooses a personalization method and the personalized search result is returned to the user. Here the results of normal search are rearranged according to the ranks of the pages based on the user interest. Otherwise the system bypasses the personalization and proceeds with the normal WEB search.


Insert Query

Need to be personalized?



Use Personalization

Use Normal Search

Display Result


Fig 3.1 System Overview

Personalization Method

It is found that among the various personalization methods the click based methods perform much better. The interest of a user on a particular web page for a search query can be easily identified using the click-through information. The idea used in the implementation of click based personalization is given below.

Historical Click Based Algorithm

It is supposed that for a query q submitted by a user u, the web pages frequently clicked by u in the past are more relevant to u than those seldom clicked by u [1]. This method is called as P-Click Algorithm. Thus, a personalized score on page p can be computed by;

SP-Click(q,p,u)= (3.3) Here, |Clicks(q, p, u)| is the number of clicks on web page p by user u for query q in the past. |Clicks(q, •, u)| is the total number of clicks for query q by u, and β is a smoothing factor (β=0.5 in this work). Actually, |Clicks(q, •, u)| and β are used to normalize |Clicks(q, p, u)|. The disadvantages found in this approach are; first it is not applicable for new queries that the user has never asked, second it may hold back the discovery of newly available results because old clicked documents will be ranked to the top of the result list.

In the simplest form it can be implemented in a client machine. Here the search details and click-through information are collected from the web browser and stored in the client system itself. Here the personalization is done for each query. The personalization is implemented when the same search query is repeated. A log entry for each query is maintained in the system with the click-through information like the total number of clicks for a particular query, number of clicks on a particular page for the same query etc. From these informations the above said method is used to rank the page for the particular query and implement personalization.

The same method can be used in a small network like an intranet also. The idea is to personalize the interest of each member of a small group of people. This idea can be utilized in any kind of organizations. Here the log is maintained at a server through which the search service is provided. Each user is identified as each system, which can be identified in the network using the IP and MAC addresses of each system. The log entry from the search queries from each machine are maintained at the server and the search result is provided according to the click-through information collected from each machine from their previous searches.

The performance of personalization can be improved to a great extend by combining the click based method to a topic based approach. This hybrid approach, if implemented properly will give a better results compared to most of the other personalization methods. An approach planned to adopt is to extract user interest from the click based method for a long term and this can be used to create the user profile. This user profile can be depicted as a vector value. This can be compared with the topic vectors in internet like ODP. This topical interest based method is explained below.

User-Topical-Interest-Based Algorithms

Here an approach based on long term user topical interests (L-Topic), is considered. In L-Topic, when a user submits a query, each returned web page is first mapped to a category vector. Then, the similarity between the user profile vector and the page category vector is computed:

sim(cl(u),c(p))= (3.4)

Here, c(p) is the category vector of web page p. User profile cl(u) is computed based upon his/her past clicked web pages by the following equation:

cl(u)= (3.5)

Here, P(u) is the collection of web pages that were visited by user u in the past. P(p/u) can be thought of as the probability that user u clicks web page p. w(p) is an impact weight for page p when user profiles are generated.

The personalization score of document p is defined as;



Re-Ranking Evaluation Framework

The steps included in personalized re-ranking [1], [7], [19] are as follows;

First, download the top 50 search results from the Windows Live search engine for the query string. The downloaded result items are denoted with U and denote the ranking list with TOriginal.

Second, compute a personalized score for each item xi Ñ” U using a given personalization algorithm and then generate a new rank list TPersonalized with respect to U. Result items in TPersonalized are sorted by personalized scores in descending order.

Third, combine the two rank lists TOriginal and TPersonalized to generate a final rank list TRerank. TRerank will be returned to users in personalized search. This process is done by using a popula technique known as Rank Aggregation Method. In this method among the three classes of methods like distributional based, heuristic, and stochastic search, the heuristic and stochastic search classes are applicable in search result combination where the aggregation of a small number of long lists needed to be considered. In the current scenario a famous method, known as Borda's Ranking Aggregation Method, for aggregation is used. In this method proposed by Borda, aggregate ranks were computed based on arithmetic average for full ranked lists. Many other aggregation functions like ARM (arithmetic average), GEM (geometric mean), L2N (l−2 norm) and MED (median) and modifications are proposed and applied with Borda's Aggregation method successfully. Once the personalized result is given to the searcher the performance of the clicked URLs can be calculated using their ranks using the equation given in the following section.

Prediction Method

Here click entropy [1], [11] is proposed to measure the variation in user information needs and further build models to predict when a query will benefit from a specific personalization algorithm. Click entropy is defined as follows;

ClickEntropy(q)= (3.10)

Here, ClickEntroy(q) is the click entropy of query q. P(q) is the collection of web pages clicked on query q. P(p|q) is the percentage of clicks on web page p among all clicks on q, i.e,

P(p|q) = (3.7)

Click entropy is a direct indication of query click variation. If all users click only one identical page on query q, the ClickEntroy(q) = 0. A smaller click entropy means that the majority of users agree with each other on a small number of web pages. In such cases, there is no need to do any personalization. A large click entropy indicates that many web pages were clicked for the query. This may mean the following:

A user has to select several pages to satisfy his information need, which means that the query is most likely an informational query. In this case, personalization can help to filter the pages that are more relevant to users by making use of historical selections.

Different users have different selections on this query, which means that the query is an ambiguous query. In this case, personalization can be used to provide different web pages to different users

According to the evaluation the improvement of personalized search performance increases with click entropy being larger, especially when click entropy ≥1.5.

For click-based methods P-Click and G-Click, the improvement is limited for the queries with click entropy being between 0 and 0.5. In the case of low click entropy, even the best performer, method G-Click, has only 0.37 percent improvement, which is not statistically significant. It indicates that users have general consistent clicks for the queries with low click entropy, and thus, no personalization is necessary for the current search results. On the contrary, for the queries where click entropy>=2.5, the result is obviously different. It indicates that queries with higher click entropy benefit more from personalization.


In this work various aspects of personalizing web search identified. From the comparison made between the previous techniques and this technology, the current technology found to be more advantageous on solving the web search problems addressed. It is found that personalization method cannot be uniformly applied to all the queries. As a solution Click Entropy is applied here. From the click through data the prediction of whether to use the personalization strategy or follow the normal web search strategy, need to be implemented. Once the decision is taken from the clicked information, the P-Click algorithm need to be implemented to find the page score for each retrieved page for the specific query. After the pages browsed by the user for a query are ranked with the page score, the entire result list need to be re-ranked according to the page score identified. This list is given as the personalized search result to the user. Although click-based methods outperform topic-interest-based ones on the average, topic interest-based methods perform best for some queries. Therefore the strengths of these personalization methods are combined intelligently to have a better ranking in a real-world personalized method. In this work the Web Usage Pattern identification is the major task and based on which the entire tasks are framed.

This method can be utilized by a real world search engine by implementing it in the search engine server by identifying each user properly. This will enhance the performance of the search engine to a great extend.