Recommendation Algorithm For Web Pages Computer Science Essay

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

The Output of recommendation can be a prediction or recommendation, based on the type of the input given to the system. Input may be either rating or data, where rating is defined as the suggestion by user on items and data which may be age, gender and education of users. Prediction is the opinion of users to any item which gives lower error rates and recommendation is the one which is most liked by users. Recommendation systems found their application in the field of e-commerce and internet.

Data mining is the process of analyzing data from different perspectives and summarizing it into useful information. This technique is used in recommendation for analyzing the data and uses them for finding patterns in a set of data. This paper focus on the various recommendation algorithms applies in Web Pages and Items separately. The motivation behind this paper is to analyze the various algorithms used for the recommendation of Web Pages and Items. The remaining part of this paper is organized as follows: Recommendation of Web Pages is described in section II, Section III defines the algorithms used for recommendation of Items, and finally section IV concluded the papers.

II. Recommendation for web pages

The recommendation of web pages can be done using several methods. Below from section A to F we describe each of the methods in brief:

A. Page ranking algorithm

Larry Page and Sergey Brin (CoFounder of Google)[3] proposed a page rank algorithm for computing the ranking of web pages. They proposed a page rank algorithm for their search engine. The proposed algorithm focused on the link structure of the web. The importance of any page is dependent on the appropriate rank score. The Rank score is based on the inlinks of a page, where inlinks means from where the links came. As compared to the traditional method, inlinks from good pages carry more weight , because as the number of sites on a page increases, the importance of a page is also increased. The page rank depends upon the back links, if the sum of the rank of back links on any page is more, than that page have more page rank , compare to others. The page rank method can be best demonstrated by the figure below:


Fig1: Demonstration of Page Rank Algorithm.

As seen from the figure there are fewer links to C still Page C has a higher Page Rank than Page E because one link to C comes from an important page (B) and hence is of high value.

Simplified version of page rank is defined by the equation

Pr (u) = c (1)

Where you represent a web page, B (u) is the set of pages that point to you, PR (u) and PR (v) are rank scores of page u and v respectively,denotes the number of outgoing links of pages v, c is a factor used for normalization.

Modified version of page rank algorithm is given by the equation

PR (u)=(1-d)+d (2)

Where d is the damping factor that is usually set to 0.85, d can be thought of as a probability of the user's following the direct link and (1-d) as the page rank distribution from non directly linked pages.

In matrix form it is

G=α S+ (1-α) D

Where S is stochastic matrix, D is damping factor, α=. 85.

Some of the most popular techniques which help in page rank computation are as follows first is the breadth first search, which is used to figure out the structure of the network. Second, the sparse matrix generally used for less memory storage without affecting the ranking of the matrix. Third is compressed row vector which is helpful to accelerate the process of multiplying the matrix. However by using compressed row vector only L addition and L multiplication is required in place of N addition and 2N multiplication, where L and N are the size of a row vector. Description of the methods proposed by various researchers regarding page rank computation is as follows:

Power iteration [16] is one of the most popular page rank computation method for computing the page rank values. For small set of pages it is easy to solve the equation system which determines the page rank values, but in the case of web pages, which contains billions of documents and in such cases it is not easy to compute the page rank score by an inspection method in such situation power iteration method is proving to be more promising one for calculating page rank. The page rank calculation is defined as follows, Initially the power iteration method assumes the page rank of every page as 1. These rank values are iteratively substituted in the page rank equations for finding the final values. Generally many iterations may be needed to normalize the page ranks and the obtained page rank is effective since it is calculated based on the previous scores .Google still uses this method for computing the page rank and Google update its page ranks at least once per month.

Another method for calculating the page rank is proposed by N. Osipova. et. al [2] named as the Monte Carlo method. The proposed algorithm considers the information on all pages during the simulation runs along with information of last visited pages. The Monte Carlo has several advantages beyond the power iteration method [16 ], it shows better performance after one iteration with higher accuracy for relatively important pages, in addition Continuous page rank update and parallel implementation are the two other advantages of Monte Carlo. There are several variants of Monte Carlo method proposed in the paper such as Monte Carlo end point with random start, Monte Carlo end point to cyclic start, Monte Carlo complete path, a Monte Carlo complete path stopping at dangling nodes and Monte Carlo complete path with random start .The following observation was observed by the author 1. Monte Carlo with cyclic start shows better results than the Monte Carlo with random start 2. Monte Carlo complete path stopping at dangling nodes outperforms Monte Carlo end point 3. The Monte Carlo complete path is more efficient than Monte Carlo end point. Although the Monte Carlo method is more efficient than power iteration method, still Google use the power iteration method for the page rank computation because of relative error as only after the first iteration Monte Carlo gives the 1%error. Another factor is that, In Monte Carlo algorithm relative error will reduced on average by the factor 1/√m while power iteration method decrease exponentially with m, where m is iteration. However there are several drawbacks of Monte Carlo but still in the case where high precision is needed Monte Carlo is preferred because Monte Carlo provides simplicity in implementation and allows parallel implementation.

Apart from Page rank algorithms there are many other algorithms for the recommendation of web pages which were described as follows.

B. Weighted Page Rank

Wenpu Xing and Ali Ghorbani [4] proposed a weighted page rank algorithm which is the extension of the page rank algorithm. In the proposed algorithm it assumes that the larger rank value is assigned to more popular pages. The Popularity is depending on the number of incoming and outgoing links. This algorithm does not divide the rank value based on the number of outgoing links as page rank algorithm did. Outgoing links get the weight according to their Popularity. The important parameter of this algorithm is back links and forward links. Weighted page rank algorithm classified the pages into four categories based on the relevancy to a given query. These are very relevant pages (VR), relevant pages (R), weak relevant pages (WR), irrelevant pages (IR). The weighted page rank algorithm applies the relevant rule which finds out the relevancy score of each page in the list of pages, which differentiate the WPR and PR. The experimental results concluded that weighted page rank algorithm gives more relevancy score and it has less Complexity (< O (log N)) comparing to page rank algorithm (O (log N)).

C. Page Content Algorithm

Page rank and weighted page rank algorithm give importance to links rather than content. So there an algorithm who gives importance to the content of the pages. So, Jaroslav. et. al [5] gives a new ranking algorithm called page content algorithm which mainly focus on content. This algorithm employs web content mining technique with page relevance ranking. The method computes the new score for top n pages and only relevant pages are returned. It combines that parameter which seems to be more important for analyzing the content of web pages. In page content algorithm important parameter is content while in weight page rank important parameter is back and forward links. However the relevancy of page content is high compared to page rank and weighted page rank, But still page content algorithm is not given any extra preference to important pages which is the major drawback of the content based algorithm.

D. Hits Algorithm

Kleinberg proposed a web structure mining based algorithm called an Hits algorithm [6] to rank the web pages. Hits algorithm is the one of the link based algorithm which is applied to sub graph after the search is done on the complete graph. Important parameters of this algorithm are back links, forward links, and content. The proposed method computes the page rank by computing the authority and the hub score for highly relevant pages. This algorithm returned the relevant and most important pages. It has less complexity than page content rank algorithm but high from the weighted page rank algorithm. The main drawback of hits algorithm is the poor efficiency because of the worst performance in real time and secondly the topic drifts, means some non relevant pages come for a given query due to the tightly connected arrangement of documents.

E. Salsa Algorithm

Lempel. et. al [7] proposed an approach for link-structure analysis and the TKC ( Tightly -Knit community Effect) [8] effect, which combines the random walk idea of Page Rank with the hub/authority idea of HITS called Stochastic Approach for Link Structure Analysis algorithm (SALSA). The TKC is a highly interconnected set of web pages. This effect mainly occurs in link analysis algorithm when the site are not authoritative on the topic or pertains to just one aspect of the topic as a result the scores gets high. A significant advantage of proposed algorithm is that the weightings can be computed explicitly without the iterative process.

F. Flexi rank algorithm

Hits algorithm used for ranking the web page have one drawback, that they do not consider textual content, So Debajyoti. et. al [1] proposed a Flexi Rank algorithm, which focused on the syntactic classification of web pages. The Classification used in FlexiRank is not dependent on the semantics of page content. It takes a proper class of pages for a given query based on the user requirement. Classification will be like an indexed page, home page, article, advertisement pages etc. In place of working on a number of pages, FlexiRank give a way to rank a few number of pages. In Flexi Rank changes in user query gives the changes in weighted, for example article type page authority weight is high, while for advertisement type pages hub weight is high. The main purpose of FlexiRank algorithm is to provide the flexibility to the user. Flexi Rank provides flexibility in two areas. Firstly in property selection and secondly in finding the weight of selected properties. Flexi Rank provides accurate results along with flexibility because of the ease of change in weight of the parameter which is used for ranking like relevance weight, hub and authority weight and link analysis of pages.

III. Recommendation for items.

The recommendation of items can be done using several methods. Below from section A to F we describe each of the methods in brief.

A. Collaborative Filtering

A collaborative filtering is a technique in which every user interacts with other user and providing an option to the set of product for establishing the quality of product. But the traditional collaborative filtering system has a drawback that they are not scalable to large data set and does not provide the quality of recommendation user. So , to overcome above mentioned problems a new recommendation algorithm is needed which produced high quality recommendation even for large scale data .To address these issues Sarwar. et. al [13] proposed a new algorithm named item based collaborating filtering recommendation algorithm. In the proposed algorithm author initially compute the user-item matrix, for identifying different relationship, and these relationships indirectly compute the recommendation for users. The author analyses the different technique for calculating item similarity and based on which recommendation is performed. The experimental result reveals the fact that item based algorithm gives better result that user based recommendation algorithm.

B. Adsorption algorithm

With the increase in the number of items (videos) in YouTube or any other site, the choice for the user to select items (videos) of their interest also increases. But, unfortunately gives rise to the problem for the user to select the items (videos) of their interest in it.So, for the removal of this problem. Baluja. et. al [12] proposed a novel method named Adsorption which provides personalized video suggestion for users based on the analysis of the user video graph with the help of random walk. The adsorption algorithm is used for classification and learning when of the labelled object and a graph structure defines the universe of labels and unlabeled objects. The proposed algorithm constructs a personalized page which provides user recommendations as per their viewing habits along with the latest and most popular videos. With the help of adsorption algorithm the author tries to improve the efficacy of suggestion in You Tube. The author also performs a recommendation test done in three months snapshots of the live data from YouTube.

C Item rank

Marco [9] proposed an algorithm named Item Rank for ranking the popular items, based on user preferences, which help in recommending highly ranked items to the interesting user. Item rank is the biased version of page rank algorithm [3] and uses the concept of collaborative filtering . In item rank algorithm bipartite graph is created between users and movies. User watched the movie and suggest same movie for similar user. The similar user is identified by the similarity measure, computed by a Pearson correlation coefficient (example in Sarwar et al. 2001).

The Item rank is calculated by:

IRui = α∙ C ∙ IRui + (1-α) ∙ dui

C is correlation matrix and dui is damping factor, α=0∙85

The main purpose of introducing the Item Rank algorithm is the time and space complexity needed by traditional algorithms (Average Commute Time (CT), Pseudoinverse of Laplacian Matrix (L+)) for ranking web pages is much higher. This complexity measure has no effect on small data but as the data set becomes large the time and space complexity both increases, so complexity play a vital role. The CT and L+ algorithm needed 2. │N│. │M│times to run the algorithm, while item rank algorithm needed only│ N│ times to run the algorithm, where N is no of users and M is the no of items in movie lens datasets.) Which is much less in comparison to CT and L+.

D. Ranking by Naive Bayes

Sometimes the problem in suggesting the true popular item to the users, can give more importance to some items, but may affect the ranking of the resulting items for other items , whose ranking is dependent on the resulting items. So, Gouthami. et. al [10] proposed an algorithm called Naive Bayes for ranking and suggesting popular items. It is basically a model based recommender and provide fast and highly scalable model. For ranking, the Naive Bayes algorithm uses the Bayes theory concept which is based on conditional probabilities. Naive Bayes algorithm work on every attribute contained in data treats as each is equally important and independent of each others. The algorithm constructs a decision tree, which helps in ranking the popular items.

E. Ranking by random walk

Noulas. et. al[14] proposed a random walk method for ranking the items (venues). Previously proposed collaborative filtering (CF) algorithm have problems in working with mobility data, because the result is not satisfactory while for online recommendation scenario collaborative algorithm is best. This random walk approaches overcome the problem of relationship between check-in, social and spatial data with others. It has linked structure of connected items in which every item has transitioned probabilities, which helps for the random walker to choose the nodes. Random walker moves from one node to another node according to transition probabilities. Random walk stays on every node for different amount of time, after that steady state comes for each node. This is the steady state probabilities of each node. These probabilities the output of the Random walk model .Than random walk start from a node and find out the constant probabilities for a node , if found then return to the target node. Node those are closer to target node rank high compare to other nodes. Rank each place in decreasing order of steady state probabilities. Random walk performs shows 5-18% improvement compared to other algorithms.

F. Contextual Walk

The Recommendation System is useful in solving the problem of information overloading, but most of the existing recommended system only focussed on rating and not on the context. The Context can be time and location of the recommendation, movie information like an actor, director, writer etc. So, Toine Bogers [11] proposed a context walk which is a recommendation algorithm that includes a different type of contextual information. Context walks easily allow adding the contextual feature like time, mood information and social network information. The main purpose of proposing this algorithm is to remove the two drawbacks. Firstly contextual information is difficult to collect and secondly it is difficult to produce a computable formalization of contextual information. Toine applies the context walk algorithm on the datasets of movie and context is connected by a link that gives a contextual graph on which random walk is applied. The advantage of the context walk is that it can support many recommendation tasks with the same random walk model without retaining the information such as recommending movie for the group of users or tag recommendation.

IV. Conclusion

Recommendation is the emerging technology that helps user to find data of their interest. Recommendation becoming important because as the information in the world increasing day by day and there should be some measures which helps in finding the most valuable information among them. So there are two basic entities appear on any recommendation system namely user and item. So Survey done by we mainly focus on these entities and research work done with it. Our paper is divided into two categories, and various algorithms were described. This article surveyed work done regarding the recommendation of web pages and items.