1 Introduction
As more and more data become available through the Internet, the average user is often confronted with a situation where he is trying hard to find what he is looking for, under the pile of available and often misleading information. This phenomenon is described as the information overload problem and became the reason of development of an interesting concept of information retrieval: recommender systems.
As information overload is defined the problem occurring when the load of information at which a user is exposed to, goes beyond his processing capacity. Recommender systems are systems that attempt to give a solution to the overload problem by making personalized suggestions to the user about items that have not yet considered based either on features of the items themselves, available information on preference patterns of the user or patterns of other users (Montaner, 2003).
A recommender system can be conceived as an advisor that helps the user navigate through the chaotic information space and alleviate his difficulty in choosing items that are the most appropriate for him between all the available options.
1.1 Motivation
Motivation for me to deal with the field of recommender systems was The Netflix Prize competition (http://www.netflixprize.com/). Netflix is an online movies rental store and on 2006 they announced a competition in which developers were called to make a recommender system that would improve the accuracy of their existing system, by 10% or more. Although I did not have the chance to actively get involved in the competition, it became the seed that led to the idea of this project.
Witnessing the interest of the research community in response of the Netflix Prize and the constructive competition that followed, inspired me to undertake this project in order to investigate further this challenging field and put my own ideas into test.
1.2 Project aim and objectives
Aim of this project is to build a recommender system and make it competitive compared to existing approaches by incorporating appropriate techniques to improve its performance. In order to achieve this aim the following objectives have to be accomplished.
· Investigate, through the literature review, the different methodologies suggested in the bibliography to enhance the performance of recommender systems.
· Identify techniques that applied to the developed system can lead to an improved performance.
· Evaluate the effectiveness of these techniques by designing an experimental setup meeting the standards set in the literature and testing the system`s performance using appropriate metrics.
1.3 Document structure
The rest of this document is organized as follows:
Chapter 2, Literature Review, provides background information on recommender systems. The differences on the implementation based on the implementation context, the common challenges faced and the methods of evaluating recommender systems are discussed. Finally in order to demonstrate the variety of techniques used some of the most up to date proposed approaches are presented.
Chapter 3, Experimental Design and Implementation This chapter describes the proposed approach to the recommendation problem. Starting point is a base model that uses classification algorithms to predict the ratings over the entire item space. By identifying the weaknesses of the base model, we suggest an alternative approach. In this chapter the details of the experimental procedure followed in order to test the performance of both the base and the proposed systems are presented along with the experimental results for the base system which set the benchmark for comparison.
Chapter 4, Results and Evaluation This chapter describes in detail the implementation decisions taken during the development of our approach and the way they affect the performance of the system. The results of the experiments for the proposed system are presented, followed by a discussion about their interpretation. Finally the proposed approach is compared with the base system in order to evaluate the effectiveness of our technique.
Chapter 5, Conclusions This chapter concludes the project by discussing what was accomplished in the research conducted and by describing the focus of future work.
2 Literature Review
This chapter provides background information on key points pertaining to recommender systems. It discusses how the implementation context dictates variations of approaches, what are the common challenges met regardless the implementation details and what are the metrics used in order to measure the performance of recommender systems. Finally some of the most up to date proposed approaches are presented in order to demonstrate the variety of techniques used.
2.1 Context of implementation
Recommender systems were developed and work under very different contexts of implementation, from recommending music to suggesting banking products. While the main working idea behind each of those systems remains the same, to filter the information space and present to the user the items that most closely satisfy his needs and taste, the different implementation contexts leads to a number of variations in the techniques used.
2.1.1 Recommending in a community of users
Recommending content within the boundaries of a group of users sharing the same information space is the kind of recommender systems considered the most traditional. A number of users are gathered forming a virtual community which is involved in a common subject and express their opinion on different items of their interest space through explicit ratings. Representatives of this category are the recommendation systems for movies such as MovieLens and IMDB, music such as Last.fm and Pandora, or books such as the WhatShouldIReadNext.
The main characteristic of implementations in this area is that they take advantage of the existence of the explicit ratings from the users for items. Based on these ratings the system can form an opinion about the preferences of each individual and base the recommendations on a user profile built upon this information (Mukherzee, 2003). The recommendation can come from the user`s own preference, in the contentbased approach (we suggest you will like item A, because in the past you liked item B, and items A and B are similar), from the preferences of other users in the collaborative filtering approach (we suggest you will like item A because some other users that have similar taste to you, liked item A, so we assume you will like it too), or from combinations of these two approaches in hybrid systems (Burke, 2002).
2.1.2 Recommender systems in Ecommerce
Recommender systems were introduced and today are widely used into the field of the ecommerce with the most known representative probably being Amazon.com. The object of such systems is to recommend to the user products that he may be most interested in buying among all the available alternatives.
Characteristics of ecommerce recommender systems are that usually there are no explicit ratings available from the user. Because of this, such systems can either base their recommendations on purely contentbased techniques or try to model the user preferences based on implicit information, with the latter being the most common approach (Mobasher, 2000). Such implicit information can be the buying history of the user or their navigation patterns through the shops site (Cho, 2002).
Another characteristic of systems for ecommerce is their often ephemeral relationship with the customer. Unlike the rating communities discussed before, the user here has no reason to return to the site unless he intends to buy something. That can vary from few times a week in the case of a supermarket portal, to one time a year in the case of a computer selling site, or even rarer if the product is cars (Felfernig, 2008). This characteristic constraints the ecommerce systems to work with very limited available information about the user. In such cases collaborative filtering techniques or even the content based ones cannot perform well. To solve this problem a number of different techniques such as the constraint based (Felfernig, 2008) and the knowledge based recommendation (Burke, 1999) have been developed.
The techniques for constraint and knowledge based recommendations become even more important in the case of more complex products, such as financial services and tourism packages, for which the use of recommender systems become increasingly popular. Examples of such systems are Triplehop's TripMatcher and VacationCoach's MePrint for tourism (Berka, 2004) and FSAdvisor for financial services (Felfernig, 2005). Here, the only way to suggest effectively a service to the user is through specific expert domain knowledge and the satisfaction of certain needs and restrictions in a stepwise process (Jannach, 2009).
2.1.3 Recommending Web content
Recommender systems are being used as part of the wider effort towards a more personalized web experience. Web pages, news items or articles are suggested to the user according to his preferences. As in the ecommerce area, here as well there is usually absence of explicit ratings from the user. The user preference modeling is made through the study of his web usage patterns. The web logs are analyzed and information such as the pages visited, the path followed and the time spent in each page are exploited in order to implicitly construct the user profile (Nasraoui, 2003).
There are mainly two points that differentiates the web content recommendation. The first one is the continuous changing item space. In a news recommender system for instance, the news pages are updated daily or even more often. There is no point in trying to build a preference model for specific content as it will be soon be outdated. The best we can do is by using the usage patterns to try to extract more general preference patterns for the user and use them in the recommendation procedure. For example if a user read an article last night about the soccer game between Juventus and Milan, it would be very shortsighted to assume that the user is interested only in the matches between the two specific teams. We could generalize by saying that he is a fan of one of the two clubs. Or by generalizing more, that he is interested in the Italian soccer league, or perhaps in the wider European soccer, or soccer in general as a sport? And here exactly lies the main challenge of the web usage recommender systems, to find the happy medium between overspecialization and accurate recommendations.
The second challenge comes from the fact that web content lacks structured, specific features that can accurately characterize it. In a traditional movie recommendation application for example, movies have a number of features describing them, such as the genre, year of production, director and cast that can help us place them in the useritem preference matrix and on which we can base the recommendation on. Web recommendation systems on the other hand, must extract such features from the content of the pages, and for this reason the use of web mining techniques is also necessary.
2.2 Types of Recommender Systems
Modern recommender systems can be classified into three broad categories, contentbased recommender systems, collaborative filtering systems and hybrid systems. In the following section is provided a brief description of these categories accompanied by some of the most recent representative systems proposed in the literature.
2.2.1 Content based recommender systems
Contentbased filtering approaches recommend items for the user based on the descriptions of previously evaluated items. In other words, they recommend items because they are similar to items the user has liked in the past (Montaner, 2003).
Examples of recent approaches include:
(Zenebe, 2009) Made use of fuzzy modeling techniques in the item features description, the user feedback and the recommendation algorithm over a contentbased recommender system platform.
(Felfernig, 2008) Explore the use of constraintbased recommendation in his implementation, where the recommendation is viewed as a process of constraint satisfaction. The final recommendation comes from the gradual satisfaction of a given set of requirements.
2.2.2 Collaborative filtering recommender systems
The collaborative filtering technique matches people with similar interests and then makes recommendations on this basis. Recommendations are commonly extracted from the statistical analysis of patterns and analogies of data extracted explicitly from evaluations of items given by different users or implicitly by monitoring the behavior of the different users in the system. (Montaner, 2003).
Examples of recent approaches include:
(Acilar, 2009) Propose a collaborative filtering model, constructed based on the Artificial Immune Network Algorithm (aiNet). Through the use of artificial immune network techniques, the system tries to address the data sparsity and scalability problems by describing the data structure, including their spatial distribution and cluster interrelations.
(Campos, 2008) Make use of fuzzy logic to deal with the ambiguity and vagueness of the ratings, while at the same time uses Bayesian network formalism to model the way the user's ratings are related.
(Shang, 2009) In his implementation a multichannel representation is used where each object is mapped to several channels linked to certain ratings. The users are then connected to the channels according to the ratings they have given. The similarity measure of user pairs is given by applying diffusion process to the userchannel bipartite graph.
(Chen, 2009) Propose the use of orthogonal nonnegative matrix trifactorization in order to alleviate the sparsity problem and to solve the scalability problem by simultaneously clustering rows and columns of the useritem matrix.
(Lee, 2009) In his approach Lee combines the two types of collaborative filtering techniques, the userbased and the itembased. The resulting predictions are then associated by weighted averaging.
(Jeong, 2009) Introduce the user credit as a new way to measure the similarity between users in a memory based collaborative filtering environment. The user credit is the degree of one's rating reliability that measures how adherently the user rates items as others do.
(Yang, 2009) Propose a collaborative filtering approach based on heuristic formulated inferences. The main idea behind this approach is that any two users may have some common interest genres as well as different ones. Based on this the similarity is calculated, by considering users' preferences and rating patterns.
(Bonnin, 2009) Use Markov models inspired from the ones used in language modeling and integrate skipping techniques to handle noise during navigation. Weighting schemes are used to alleviate the importance of distant resources.
(Zhang, 2008) Suggest a Topical PageRank based algorithm, which considers item genre to rank items. It is made an attempt to correlate ranking algorithms for web search with recommender systems. Specifically, it is attempted to leverage Topical PageRank, to rank items and then recommend users with toprank items.
(Rendle, 2008) Propose the use of kernel matrix factorization, a generalized form of the regularized matrix factorization. A generic method for learning regularized kernel matrix factorization models is suggested, from which an online update algorithm is derived that allows solving the newuser/newitem problem.
(Umyarov, 2009) In his research, Umyarov combines external aggregate information with individual ratings in a novel way in his approach.
(Takacs, 2009) Focus his research on the use of different techniques of matrix factorization applied to the recommendation problem. He proposes the use of incremental gradient descent method for weight updates, the exploitation of the chronological order of ratings and the use of a semipositive version of the matrix factorization algorithm.
(Yildirim, 2008) Propose an itembased algorithm, which first infers transition probabilities between items, based on their similarities and then computes the predictions by modeling finite length random walks on the item space.
(Weimer, 2008) Suggest as extension to the maximum margin matrix factorization technique the usage of arbitrary loss functions, while an algorithm for the optimization of the ordinal ranking loss is used.
(Koren, 2009) Introduce the tracking of temporal changes in the customer's preferences in order to improve the quality of the recommendations provided.
(Hijikata, 2009)Propose a discoveryoriented collaborative filtering algorithm. that uses not only the traditionally used in collaborative filtering approaches profile of preference but also the so called profile of acquaintance, used to map the knowledge, or the lack of it, about items.
(Schclar, 2009) Propose the use of an ensemble regression method in which during iterations, interpolation weights for all nearest neighbors are simultaneously derived by minimizing the root mean squared error.
(Koren, 2010) Introduce a new neighborhood model based on the optimization of a global cost function. A second, factorized version of the neighborhood model is also suggested, aiming to improve the scalability of the algorithm.
(Kwon, 2008) Aim to find new recommendation approaches that can take into account the rating variance of an item in the procedure of selecting recommendations.
(Amatriain, 2009) Try to improve the system`s accuracy by reducing the natural noise in the input data via a preprocessing step, based on rerating the items and calibrating the recommendations accordingly.
(Park, 2008) Proposes the clustering of the items with low popularity together, using the EM algorithm in combination with classification rules, in order to improve the quality of the recommendations for the items with few ratings.
(Ma, 2009) Propose a seminonnegative matrix factorization method with global statistical consistency, while at the same time suggest a method of imposing the consistency between the statistics given by the predicted values and the statistics given by the data.
(Massa, 2009) Propose to replace the step of finding similar users on which the recommendation will be based, with the use of a trust metric, an algorithm able to propagate trust over a network of users in order to find peers that can be trusted by the active user.
(Lakiotaki, 2008) Propose a system that exploits multicriteria ratings to improve the modeling of the user's preference behavior and enhance the accuracy of the recommendations.
2.2.3 Hybrid recommender systems
Hybrid recommender systems combine two or more recommendation techniques to achieve better performance and overcome problems faced by their onesided counterparts. The ways that recommendation systems can be combined differs greatly. A good overview is given in (Burke, 2002).
Examples of recent approaches include:
(Albdavi, 2009) Suggest a recommendation technique in the context of online retail store, called hybrid recommendation technique based on product category attributes which extracts user preferences in each product category separately in order to provide more personalized recommendations.
(Porcel, 2009) Propose a fuzzy linguistic recommender system designed using a hybrid approach and assuming a multigranular fuzzy linguistic modeling.
(AlShamri, 2008) Propose a hybrid, fuzzygenetic approach to recommender systems. In order to improve scalability, the user model is employed to find a set of likeminded users. In the resulting, reduced set, a memorybased search is then carried out to produce the recommendations.
(Givon, 2009) Propose a method that uses socialtags alone or in combination with collaborative filteringbased methods to improve recommendations and to solve the coldstart problem in recommending books when few to no ratings are available. In their approach tags are automatically generated from the content of the text in the case of a new book and are used to predict the similarity to other books.
(Nam, 2008) Focusing their research on solving the userside cold start problem, develop a hybrid model based on the analysis of two probabilistic aspect models using pure collaborative filtering to combine with users' information.
(Gunawardana, 2009) Make use of unified Boltzmann machines, as probabilistic models that combine collaborative and content information in a coherent manner.
contains a synopsis of the different approaches discussed on this chapter.
Researcher / Year 
Type 
Main Techniques Used 
Problem Focused 
Data Sets Used 
Zenebe 2009 
Content Based 
Fuzzy Sets 
Accuracy 
MovieLens 
Felfernig 2008 
Content Based 
Constraint driven recommendation 
Use in domains with complex rarely rated items 
 
Acilar 2009 
Collaborative Filtering 
Artificial Immune Networks algorithm 
Data Sparsity Scalability 
MovieLens 
Campos 2008 
Collaborative Filtering 
Bayesian Networks Fuzzy Logic 
Process the uncertainty involved in the recommendation 
MovieLens 
Shang 2009 
Collaborative Filtering 
Multichannel representation Diffusion process on the userchannel bipartite graph 
Accuracy 
Netflix MovieLens 
Chen 2009 
Collaborative Filtering 
Orthogonal nonnegative matrix trifactorization 
Data sparsity Scalability 
MovieLens 
Park 2008 
Collaborative Filtering 
Combination of EM clustering and classification rules 
Cold Start Accuracy 
MovieLens 
Lee 2009 
Collaborative Filtering 
Combination of userbased and itembased CF 
Data sparsity Accuracy 
EachMovie MovieLens 
Jeong 2009 
Collaborative Filtering 
Use of “user credit” as degree of rating reliability 
Coldstart Accuracy 
MovieLens 
Yang 2009 
Collaborative Filtering 
Heuristic formulated inferences 
Accuracy 
EachMovie MovieLens 
Bonnin 2009 
Collaborative Filtering 
Markov model Skipping techniques to handle noise 
Accuracy 
Bank Intranet web logs 
Zhang 2008 
Collaborative Filtering 
Topical Page Rank algorithm 
Accuracy 
MovieLens 
Rendle 2008 
Collaborative Filtering 
Regularized kernel matrix factorization 
Coldstart Scalability 
Netflix MovieLens 
Umyarov 2009 
Collaborative Filtering 
Combination of external aggregate information with user ratings 
Accuracy 
Netflix MovieLens 
Takacs 2009 
Collaborative Filtering 
Matrix Factorization 
Scalability 
Netflix Jester MovieLens 
Yildirim 2008 
Collaborative Filtering 
Random walk itembased algorithm 
Data sparsity Scalability 
MovieLens 
Weimer 2008 
Collaborative Filtering 
Maximum margin matrix factorization 
Data privacy Crossdomain predictions 
WikiLens 
Koren 2009 
Collaborative Filtering 
Tracking of temporal changes in the customer's preferences 
Modeling drifting user preferences 
Netflix 
Hijikata 2009 
Collaborative Filtering 
Discoveryoriented CF algorithm 
Recommendation diversity 
Music ratings dataset built for the experiment 
Schclar 2009 
Collaborative Filtering 
Ensemble regression method 
Accuracy 
MovieLens 
Koren 2010 
Collaborative Filtering 
Optimization of global cost function 
Accuracy Scalability 
Netflix 
Kwon 2008 
Collaborative Filtering 
Rating diversity consideration 
Accuracy Diversity 
MovieLens 
Amatriain 2009 
Collaborative Filtering 
Natural data noise reduction 
Accuracy 
Customized movie rating dataset 
Mass 2009 
Collaborative Filtering 
Use of trust in the neighbor finding 
Cold Start Data Sparsity 
Dataset from Epinions.com 
Lakiotaki 2008 
Collaborative Filtering 
Multicriteria ratings 
Accuracy 
Dataset from Yahoo! movies 
Ma 2009 
Collaborative Filtering 
Seminonnegative matrix factorization 
Accuracy Scalability 
EachMovie 
Albdavi 2009 
Hybrid 
Hybrid recommendation based on product category attributes 
More personalized recommendations 
Web logs from online retail store 
Porcel 2009 
Hybrid 
Fuzzy linguistic modeling 
Accuracy 
Digital Library dataset 
AlShamri 2008 
Hybrid 
Fuzzygenetic 
Data sparsity Scalability 
MovieLens 
Gunawardana 2009 
Hybrid 
Boltzmann machines 
Coldstart 
MovieLens TaFeng supermarket dataset 
Nam 2008 
Hybrid 
Combination of pure CF with users information 
Coldstart 
MovieLens 
Givon 2009 
Hybrid 
Automatic tag generation from text 
Coldstart 
Corpus of full text books 
Table 1 Different approaches proposed in the Recommender System literature after 2008
2.3Challenges
Recommender systems suffer from some common problems. The most usual ones and those that have drawn the most of the researcher's attention are the cold start and the data sparsity problems that can potentially lead to poor recommendations. Also due to their nature of implementation, recommender systems often face scalability problems. Other than these, there are a number of smaller problems that can also affect negatively the performance of the system and have become the reasons behind the introduction of some of the more innovative techniques at the recommender systems landscape. Such problems are the interest drift, the noisy data and the lack of diversity.
2.3.1 Cold Start
The coldstart problem occurs when items must be proposed to a new user without having previous usage patterns to support these recommendations (new user problem), or when items are newly introduced to the dataset thus lacking ratings from any user (new item problem) (Rashid, 2002). Both of the two faces of the cold start problem are commonly met in the recommendation system field, and result in poor recommendation quality. Modern commercial systems are constantly expanding and new items are added in constant basis, while new users become members as often. The collaborative filtering techniques are especially sensitive to the coldstart problem and for this reason a solution often suggested is the hybridization of the system with the use of a contentbased method that can be used in the recommendation procedure of the new items or users.
2.3.2 Data Sparsity
In large ecommerce systems there are millions of participating users and equally as many items. Usually even the most active users have purchased or rated only a very little fraction of the whole collection. This leads to a sparse useritem matrix which affects the ability of the recommender system to form successful neighborhoods. Making recommendations on poorly formatted neighborhoods, results at poor overall recommendation quality (Acilar, 2009). The above problem is known as the data sparsity problem, and is one of the most challenging in the recommendation system field.
2.3.3 Scalability
Modern recommender systems are applied to very big datasets for both users and items. Therefore they have to handle very high dimensional profiles to form the neighborhood while the calculation cost of the algorithms used grows with both the number of items as well as with the number of users (Acilar, 2009). Recommendation tactics that may work well and be effective when applied to small datasets under lab testing conditions may fail in practice because they cannot be effectively applied in real usage scenarios.
2.3.4 Other Potential Problems
Apart from the three commonly faced challenges mentioned above, researchers have tried to address a number of different problems. In this section we briefly present some of them.
· The interest drift problem. By the term interest drift in the recommender systems context we refer to the phenomenon that the taste and the interests of users may be altered over time or under changing circumstances, leading to inaccurate recommendation results (Ma, 2007). A once valid recommendation may not still be accurate after the user has changed his preference patterns. In order to counter fight this, the recommendation models should not be static but it must evolve and adapt itself to the changing preference environment in which it is called to work in.
· The noisy data problem. At the case of systems where the input data are explicit (e.g. ratings) and not implicit (like web logs), there is an extra data noise added coming from the vagueness of the ratings themselves as a product of the human perception (Campos, 2008). The given ratings are only an approximation of the user's approval on the artifact that he is rating and are restricted by the rating scale`s accuracy. For example in a rating scale of five stars, a user may give a movie three stars, but if he had the opportunity to rate the same movie in a percentage scale he may give something different than 60%. Results may be even more different if the scale he was called to rate the movie on, was something like “I hated it  it was average  I loved it”.
Moreover a fuzziness of the rating is also introduced by the user himself and his own ratings may differ at another time, place or emotional condition. It has been reported (Amatriain, 2009) that if users are called to rate again movies that have seen and rated at the past, their new ratings will differ respectably from their original ratings. Cases that users did not even remember seeing the movie they have rated at the past were also not uncommon.
Finally there are deviations in the ratings characterizing the overall voting trends of either the user, or the items. For example a user may be strict and have a tendency to give lower ratings than the average reviewer, or from the item`s point of view, there may be deviations affecting positively or negatively the ratings that the items receives. From another perspective, a movie that is considered “classic” may tend to receive higher ratings than it would normally receive without its reputation affecting the audience, while those observed trends may or may not be static over the time. (Koren, 2009) For example there is a chance that a viewer becomes stricter as he grows up and as a result his ratings become more biased towards the lower end of the scale compared to his past the ratings.
All these factors introduce noise in the data and can have a negative effect on the accuracy of the recommendations.
· The lack of diversity problem. Most of the researcher`s efforts are focused on making the recommendation produced by a recommender system more accurate. Lately thought, there are argues raised that accurate recommendations are not always what the user may be expecting from a recommender system (Zhang, 2008). To start with, the logic of such a system is to help the user select items for which he has not formed an opinion of his own yet. If the system keeps suggesting items that are too similar to the ones he is already familiar with, then the systems selfcancels, to a point, his own purpose. We can assume that the user can speculate the rating of an item too close to an item he is already knows about, without the need of an elaborate recommender system. What the user is looking for is from the system to help him estimate the rating of an item that he could not rate himself without the assistance of the recommender, solely based on his own experience. This problem is also referred as the overspecialization problem describing the situation where items too close to the items already returned from the user are returned as recommendations (Abbasi, 2009).
2.4 Evaluation Metrics for Recommender Systems
Evaluating a recommender system can be a complex procedure. Many different metrics have been proposed to evaluate the successfulness of recommender systems. In the following sections are presented the most commonly used.
2.4.1 Accuracy metrics
Accuracy is the most widely used metric for recommender systems (Burke, 2002). It measures how close the predicted by the system values are to the true values. It can be expressed as in equation (1).
We can more formally formulate the equation (1) as in (2)
Where P(u, i) is the predictions of a recommender system for every particular user u and item i, and p(u, i) is the real preferences, while R is the number of recommendations shown to the user. In the accuracy metric the P(u, i) and p(u, i) are considered binary functions and r(u, i) is 1 if the recommender presented the item to the user and 0 otherwise.
One common accuracy metric is the Mean absolute error (MAE) that is defined by the equation (3) and measures the average absolute deviation between each predicted rating P(u, i) and each user's real ones p(u, i). N is the total number of the items observed (Breese, 1998)
Variations of MAE include mean squared error, root mean squared error, or normalized mean absolute error (Goldberg, 2001).
From these, the most widely used, especially after chosen to be the metric used for the judgment of the entries at the Netflix Prize contest, is the root mean squared error which is defined as in the equation (4).
2.4.2 Information Retrieval metrics
Since recommender systems logic and techniques are close to the Information Retrieval (IR) discipline, it comes as no surprise that some of the metrics of IR are also present at the recommender systems field. Two of the most widely used metrics are the precision and recall (Cleverdon, 1968).
The calculation of precision and recall is based on a table, as the below, that holds the different possibilities of any retrieval decision (Hernandez del Olmo, 2008).
Relevant 
Non Relevant 

Retrieved 
a 
b 
Non Retrieved 
c 
d 
Table 2 Confusion matrix of retrieval decision outcomes
In recommender system terminology, a relevant information is translated to a useful (close to the user`s taste) item while a nonrelevant would be an item not satisfying the user.
Precision (eq.5) is defined as the ratio of relevant items selected to number of items selected
Precision represents the probability that a selected item is relevant. It determines the capability of the system to present only useful items, excluding the nonrelevant ones.
Recall (eq.6), is defined as the ratio of relevant items selected to the total number of relevant items available. Recall represents the probability that a relevant item will be selected and is an indication of the coverage of useful items that the system can obtain.
Based on the precision and recall line of thought are the Fmeasure metrics (eq.7) which attempt to combine the behavior of both of the metrics in a single equation.
The most commonly used Fmeasure metric is the F1, where (Hernandez del Olmo, 2008) and is defined as in eq.8
Another metric originating from the information retrieval field and often used in the recommender system evaluation is the Receiver Operating Characteristic (ROC) analysis (Hanley, 1982). The ROC curve represents the recall against the fallout (eq.9).
Objective of the ROC analysis is to maximize the recall while at the same time minimize the fallout
2.4.3 Rank accuracy metrics
The output of the recommendation is often a list of suggestions presented to the user from the most relevant to the least relevant. To measure how successful the system was on this, a category of metrics, called rank accuracy metrics was introduced. Rank accuracy metrics measure how accurate the recommender system can predict the ranking of a list of items presented to the user.
Two of the most commonly used rank accuracy metrics are the halflife utility metric and the Normalized Distancebased Performance Measure (NDPM). (Herlocker, 2004).
The halflife utility metric is used to evaluate the utility of a ranked list of recommendations, where the utility Ra (eq.10) is defined as the difference between the user rating for an item and the rating baseline for this item. A half life parameter is used to describe the strength of the decay in an exponential decay function showing the likelihood of a user to view each successive item in the list. In equation 10, ra, j represents the rating of user a on item j of the ranked list, d is the baseline rating, and α is the halflife. The halflife is the rank of the item on the list such that there is a 50% chance that the user will view this item.
Normalized Distancebased Performance Measure (Eq. (11)) can be used to compare two different weakly ordered rankings (Balabanovic, 1997).
In the above equation C− is the number of contradicting preference ratings between the user and the system recommendation, where the system believes that an item “a” will be preferred over an item “b”, while the true preference of the user is the opposite. Cu is the number of compatible preference relations, where the user rates item “a” higher than item “b”, but the system ranks the two items equally and Ci is the total number of pairs of items rated by the user, for which one is rated higher than the other.
2.4.4 Suggesting the nonobvious
While the accuracy metrics provide a good indication of the recommender system`s performance, there must be a distinction made between the accurate and the useful results (Herlocker, 2004). For example, a recommendation algorithm may be adequately accurate by suggesting to the user popular items with high average ratings. But often this is not enough. To some extent this kind of predictions are selfexplanatory and offer no useful information to the user, as they would be the items for which the user would less likely need help to discover by himself.
The coverage can be defined as a measure of the domain of items over which the system can make recommendations (Herlocker, 2004). In its simplest form, coverage is expressed as the percentage of the items for which the system can form a prediction over the total number of items.
Along the same line of thought, other metrics such as novelty (Konstan, 2006) and serendipity(Murakami, 2008)have been proposed, for measuring how effectively the system recommends interesting items to the user which he might not otherwise come across.
3 Experimental Design and Implementation
This chapter describes the proposed approach to the recommendation problem. Starting with a base model we identify its weaknesses and we will try to improve its performance by incorporating our approach. Both the base and the proposed systems are tested in order to compare their relative performance. In this chapter the details of the experimental procedure followed are presented along with the experimental results for the base system which set the benchmark for comparison.
3.1 Recommendation as classification
Recommending items to a user can be seen as a classification task: Given a set of known relationships, train a model that will be able to predict the class of unseen instances. If we consider each useritem pair as instance and the rating as the class, recommendation can be treated as classification where the system is called to assign the unknown degree of preference of a user towards a given item to one of the possible class values, consisting of the points of the rating scale. Assuming a two point scale, we have a binary classification problem and the outcome could be that a user will either like or dislike the item, depending on the classification result. In a more multivariate scale, the result could be that the instance of useritem pairUI is assigned to class 4 or in other words that user U is predicted to give item I a rating of 4. If we treat the rating as continuous value instead of nominal, the task becomes a regression problem.
It is under this prism that the recommendation problem is viewed in the work conducted as part of this dissertation. Prediction models are built on the known instances and then used to predict the values of the unknown ratings.
3.2 Data set
The MovieLens dataset was used for the experiments conducted. MovieLens dataset is created by the GroupLens Research Project group at the University of Minnesota through the MovieLens web site (movielens.umn.edu) and contains 100.000 ratings on a numeric five point scale (15) for 1682 movies provided by 943 users, with each user having rated at least 20 movies. Simple demographic data consisting of the age, gender, occupation and zipcode are provided for the users, while the information about the movies is title, release date, video release date, IMDB url and genre.
3.2.1 Data preparation
Trying to train any model only on the original dataset would prove highly ineffective since the information provided would not be sufficient to produce rigid rules. For this reason the data had to be enhanced in a way that it would allow the model to obtain as much information as possible during the training step.
Following the methodology of Park (2008), a number of derived variables were produced from the original data and used as independent variables in the models. More specifically the derived variables used are:
 c_aver_rating: The average rating of the user for the items he has rated at the past.
 c_quantity: The number of items the user has rated.
 c_seen_popularity: The average popularity of the items that the user has rated at the past.
 c_seen_rating: The overall average rating of the items the user has rated before. The overall average rating is the average of all the ratings given to the item by all the users.
 c_like_popularity: The average popularity of the items that were rated higher by the user, compared to his average rating.
 c_like_rating: The overall average rating of the items rated higher than the user`s average rating.
 c_dislike_popularity: The average popularity of the items that were rated lower by the user, compared to his average rating.
 c_dislike_rating: The overall average rating of the items rated lower than the user`s average rating.
 I_aver_rating: The average rating of the item.
 I_popularity: The popularity of the item.
The variables 18 are the user related values while variables 9 and 10 are the item related variables.
It should be noted here that Park (2008) proposes the use of an extra item related variable, namely I_likability defined as the difference between the rating of the user and the item`s average rating. The problem with this is that since we treat the rating as the class, there would be no way to know the value of the I_likability for a new instance beforehand. Although the value could be calculated for the experimental dataset, this would not be feasible in a real case scenario. For this reason the I_likability variable was not used in the variables set.
The original data were loaded to an SQL database from the MovieLens dataset and then the derived values were calculated and stored. The final form of the table, used in the data mining models can be seen in. A snapshot of the actual data in the enhanced data table can be found in Appendix . Even though the variables UserID, MovieID, DateStamp and InstanceID are part of the dataset they were ignored from the models during the training step since they provide no useful information and would add noise at the data.
3.3uilding the predictive models
The Weka machine learning toolkit was used in order to build the predictive models of the experiment. Weka provides a big collection of classification, association and clustering algorithms and can be run from the GUI, using the command line or called from within a program as external library. The later was the approach followed since it provides a greater flexibility on the process. Java was chosen as the implementation language since Weka itself is written in Java and the routines provided could be used from within the developed program without any intermediate steps needed. Although Weka can be used with a number of different environments (including .NET) that would add unnecessary complexity to the project. Eclipse was used as the implementation platform.
One of the early and important decisions that had to be taken during the implementation was whether to treat the rating, which was the class for the models, as nominal or numeric value. This choice would dictate the range of models that could be used as some can work only with nominal classes while other only with numeric. While the rating is in fact nominal (a user can rate an item with 3 or 4 but not with 3.5) I chose to treat it as numeric, in the expectation of providing this way a finer granularity to the system and producing more accurate and useful results. If the predicted value was needed to be sent back to the user as actual recommendation it would be easy to round it to one of the allowed ratings within the rating scale.
Again following the initial methodology of Park (2008), for each of the items in the item list a separate predictive model was built. If we need to predict the rating of item I having 200 ratings for a user U, the model is built using those 200 known instances and is used to predict the unknown rating for U. In this way for a dataset containing n different items, n different models will be built.
Five different types of predictive models were implemented and tested:
 Simple Linear Regression (SLR)which “learns a linear regression model based on the single attribute that yield the smallest squared error”. (Witten, 2005: 409).
 Locally Weighted Learning (LWL)which “assigns weight using an instancebased method. After this the classifier is built from the weighted instances” (Witten, 2005: 414).
 RBFNetwork (RBF) which “implements a Gaussian radial basis function network deriving the centers and widths of hidden units using kmeans and combining the outputs obtained from the hidden layers using logistic regression if the class is nominal and linear regression in the case of numeric class (Witten, 2005: 410).
 Sequential Minimal Optimization algorithm for support vector regression (SMOreg) which “implements the sequential minimal optimization algorithm (SMO) training a support vector classifier with the use of polynomial or Gaussian kernels.” (Witten, 2005: 410). SMOreg is the SMO version for regression problems.
 M5Rules which “obtains regression rules from model trees built using M5' “ (Witten,2005: 409)
These five models were chosen in an attempt to test the effectiveness of our approach using a diverse set of techniques namely linear regression, radial basis function networks, lazy classifiers, support vector machine classifiers and model trees.
3.4 Base Model evaluation
In order to evaluate the predictive models built, we use two performance measures, Mean Absolute Error (MAE) and Root Mean Square Error (RMSE). As discussed in section, MAE measures the average absolute deviation between each predicted rating P(u, i) and each user's real ones p(u, i) over the number N of items and is given by the equation
while RMSE is given by the equation
Since the prediction of the ratings occurs by solving the problem as regression, these two metrics will give a good indication of how well the predicted values produced by the algorithms approximate the actual rating values.
We applied 10fold cross validation and calculated the MAE and RMSE. The original dataset was randomly partitioned in 10 independent samples and each time one of the samples was used as test data while the rest used as training data for the model. The procedure was repeated 10 times, with each of the subsamples acting as test data exactly once. The resulting MAE and RMSE errors are the average errors over the 10 repetition of the testing. 10 was chosen as the number of folds because according to Witten (2005), 10 is shown to produce the better estimation of errors among the rest of the kfold alternatives. Moreover 10fold cross validation is a commonly used technique in the recommender system literature for error estimation, making the comparison of our results with the results of published work easier.
The errors calculated for each of the movies using the 5 models are presented in The errors are ordered by the popularity of the movie.
The overall average MAE errors over the whole set of items for each model are summarized in, along with their standard deviations
Model 
Overall average MAE error 
LWL 
0.878232± 0.2165 
M5Rules 
0.823094 ± 0.1959 
SMOreg 
0.810135 ± 0.2269 
SLR 
0.816061 ± 0.1862 
RBF 
0.864512 ± 0.1676 
Table 3 Overall average MAE for the different predictive models
The overall average RMSE errors over the whole set of items for each model are summarized in along with their standard deviations.
Model 
Overall averageRMSE error 
LWL 
1.092771 ± 0.2460 
M5Rules 
1.032141 ± 0.2360 
SMOreg 
1.020523 ± 0.2554 
SLR 
1.015515 ± 0.2170 
RBF 
1.063652 ± 0.1806 
Table 4 Overall average RMSE for the different predictive models
The experimental results we got from this series of tests do seem to match with the findings that Park (2008) reports, as confidently as we can conclude to this based on the volume and the format of the results presented at their publication.
Observing the graphs and the overall error tables we see that the different prediction models do have different performance with the difference being generally constant over the different items popularity. SMOreg and SLR are the two best predicting models, with LWL being the worst.
But the two Figures indicate an underlying trend that governs all the models and is more important from the sheer performance of each algorithm. It shows that the errors increase as the popularity of the items decrease, almost doubling from the one edge to the other. For example the MAE error for the SMOreg algorithm is 0.666542 for the biggest average popularity of 420, and becomes 1.169974 for the smallest popularity. This problem occurs because the prediction models don't have enough data to produce accurate rules for the less popular items.
In order to illustrate the potential impact of this weakness to the recommender`s performance we present at the histogram of the items rating frequencies for the MovieLens dataset.
As can be seen in the figure, the number of items having few ratings is very significant. If the recommendation will be done solely based on the approach presented at section the system will often perform poorly. It is important to try and improve the performance of the models for this exactly the problematic area.
3.5 Enhancing the system`s efficiency via latent factor modeling.
Dimensionality reduction techniques such as Latent Semantic Indexing (LSI) originate from the information retrieval field and try to solve the problems of polysemy and synonymy (Deerwester, 1990). By using LSI we attempt to capture latent associations between the users, items and the given ratings, not apparent in the initial dataset. The reduced dimensionality space resulting is less sparse than the original data space and can help us determine clusters of users or items by discovering the underlying relationships between the rating instances (Sarwar, 2000). LSI using Singular Value Decomposition (SVD) as the underlying matrix factorization algorithm is the most commonly technique used in the recommender system literature, and this is going to be used in the current experimental setup, aiming to improve the prediction model performance.
3.5.1 Singular Value Decomposition (SVD)
SVD is a matrix factorization technique often used for the production of low rank approximations of matrices. Given a m×nmatrix R, SVD factors R into three matrices as
Where U and V are orthogonal matrices of size m×r and n×r respectively, with r being the rank of matrix R. Sis a diagonal r×r matrix having all singular values of matrix R as diagonal entries (s1, s2, ..., sr) where si>0 and s1 ≥ s2 ≥ ... ≥ sr.
SVD has the very useful ability to provide the best lowrank approximation matrix Rkof matrix R, in terms of the Frobenius norm R  RkF. For some value k<r we can produce a matrix Sk by reducing matrix S to have only the k largest diagonal entries. In the same way we can reduce matrices U and V by removing rk columns from U and rk rows from V, and obtain the reduced matrices Uk and Vk. Rkcan then be reconstructed as Rk = Uk×Sk×VkT. Matrix Rk is the closest rankk approximation of matrix R (Gong, 2009).
3.5.2 Applying SVD to generate predictions
The initial step in using SVD to the recommender system is to produce the matrix R. Matrix R is the useritem matrix where users are the rows of the matrix, items are the columns and the values consist of the ratings so the value Ri,j represents the rating of user Ui for the item Ij.
Because of the sparse nature of the data, the resulting matrix R is also very sparse. As a preprocessing step, in order to reduce the sparsity we must fill the empty values of R with some meaningful data. Two available choices are to use the average ratings of the user (rows average) or the average ratings of the items (columns average). Following Sarwar (2000) we proceed with the later as in his findings reports that the items average provided better results. The same author suggests as next step in the preprocessing procedure to normalize the data of matrix R. This is done by subtracting the user`s average from each rating. Let the resulting matrix after the two preprocessing steps be Rnorm. Normalization as described above was used during the implementation.
We can now apply SVD to Rnorm and obtain matrices U, S, V and reduce the matrices to dimension k obtaining matrices Uk, Sk and Vk. These matrices can now be used in order to produce the prediction value Pi,j of the ith user for the jth item as:
where is the average rating of user i.
3.5.3 Using SVD to form neighborhoods
Although applying SVD to produce score predictions as discussed above is really useful, SVD is used in a rather different context in the current implementation. What we are really interested in is the ability to form quality neighborhoods of either users or items. The reduced dimensional space produced by the SVD is less sparse than the original space. This advantage can lead to a better performance during the neighbor selection (Sarwar, 2000).
The matrix is the kdimension representation of the users while the matrix is the kdimension representation of the items. We can calculate the similarity between the observations using a distance measure, such as cosine similarity, Pearson`s correlation, mean squared distance or Spearman correlation at one of the two matrices and produce clusters of users or items respectively. What we are going to evaluate in this implementation is whether the formation of clusters of items in the reduced dimension space combined with the base model will improve its accuracy.
The similarity measure used for building the neighborhood was the cosine similarity. Cosine measure finds the cosine of the angle between two vectors A and B in order to calculate their similarity. It is defined as:
3.5.4 Training the models
As before, the prediction of the ratings were produced by building the data mining models on the training set, created using again 10fold cross validation as on the first group of experiments. The big difference this time was that the models were built not for every item but for each cluster of items.
This means that in order to predict the rating of user Ui for the item Ij, first it was determined in which cluster Ii belonged to and then the model was built using the data from all the items belonging to that cluster.
The expectance from this approach is to provide to the predictive model enough information in order to produce quality rules. This was especially important for the items with few ratings as shown in section for which the lack of enough supportive information lead the models in producing inaccurate predictions.
4 Results and Evaluation
This chapter describes the implementation decisions tested during the development of our approach and the way they affect the performance of the system. For each implementation decision detailed results from the experiments performed are presented, followed by a discussion about their interpretation. Finally the proposed approach is compared with the base system in order to evaluate the effectiveness of our technique.
4.1 Application of the technique in the implementation
As described in section , using Matlab the useritem matrix was created by porting the user, item, rating triplets from the SQL database containing the original dataset. The matrix was then normalized and SVD was applied. The resulting matrices U,S,V were finally reduced in kdimensionality.
The choice for the value of k was based on the findings of both (Sarwar, 2000) and (Gong, 2009) for the same dataset and was fixed at 14.
Next step of the process was the formation of the neighborhoods. The Kmeans clustering algorithm using cosine distance was applied on the reduced 1682×14 matrix allocating each of the items to a corresponding cluster. The resulting cluster memberships were finally ported back to the database in order to be used from the system in the prediction model building.
4.2 Defining the parameters of the system
4.2.1 Choosing the number of clusters
One parameter that had to be considered was the number of clusters used during the neighborhood formation. In order to verify if and how much the choice of the number of clusters affected the prediction quality, five different clustering sizes were tested, and using the procedure described in the previous section the MAE and RMSE errors were calculated for each one. In order to get comparable results the same prediction model was used in all the 5 repetition of the experiment. This model was arbitrary chosen between the two better performing models and was the SMOreg.
The number of clusters used in the study was 10, 30, 50, 70 and 100. The MAE and RMSE errors produced by those cluster sizes, using SMOreg as the model can be seen in and respectively, labeled as MAE_ SVD_10 for the MAE errors of the experiment using 10 as the number of clusters, MAE_ SVD_30 the MAE errors of the experiment using 30 as the number of clusters etc. The results were once again averaged and ordered by item popularity as before.
The overall average MAE errors over the whole set of items for each case are summarized in.
Cluster Size 
Overall average MAE error 
10 
0.725248 ± 0.0441 
30 
0.718343 ± 0.0505 
50 
0.716717 ± 0.0591 
70 
0.71663 ± 0.0614 
100 
0.716177 ± 0.0644 
The overall average RMSE errors over the whole set of items for each case are summarized in.
Cluster Size 
Overall average RMSE error 
10 
0.92667 ± 0.0454 
30 
0.920781 ± 0.0564 
50 
0.918162 ± 0.0672 
70 
0.916922 ± 0.0708 
100 
0.917614 ± 0.0750 
Table 6 Overall RMSE errors for the different number of clusters using SMOreg as predictive model
Although the cluster size is usually identified at the literature as one of the highly influential variables on the neighbor based recommender systems, in the context that was used as part of the current implementation we can see that generates small variations at the final error values, with the cluster size of 70 being marginally better.
In order to formally determine if the performance of the model changes significantly across the different numbers of clusters we perform paired ttests. For each of the pairs of results we perform paired ttest for the RMSE errors of each instance at 95% confidence interval. The paired ttest were conducted using Matlab`s function [h,p]=ttest(x,y) using the default significance level alpha=0.05, where x and y the vectors containing the full set of RMSE errors observed for cluster size x and y accordingly. The test was repeated for every combination of cluster sizes. The results from the 10 ttests are summarized at
Pair of cluster sizes tested (xy) 
P value 
tstat 
sd 
Significantly Different 
1030 
5.2272e007 
5.0465 
0.0396 
YES 
1050 
8.2876e008 
5.3956 
0.0535 
YES 
1070 
6.2111e009 
5.8551 
0.0565 
YES 
10100 
2.5565e007 
5.1845 
0.0593 
YES 
3050 
0.0595 
1.8865 
0.0471 
NO 
3070 
0.0056 
2.7740 
0.0472 
YES 
30100 
0.0330 
2.1343 
0.0504 
YES 
5070 
0.3624 
0.9112 
0.0462 
NO 
50100 
0.7038 
0.3803 
0.0489 
NO 
70100 
0.5890 
0.5405 
0.0434 
NO 
Summarizing the Ttests we see that at the 5% significance level the data do provide sufficient evidence to conclude that the accuracy of the proposed approach differs for different number of clusters, for the cluster pairs 1030, 1050, 1070, 10100, 3070, 30100 and does not provide sufficient evidence for the pairs 3050, 5070, 50100 and 70100. Only for the difference of the 10 as cluster size the evidence is strong, reinforcing our initial observation that the cluster size is not affecting critically the accuracy unless the number of clusters is very small relatively to the item space.
4.2.2 Defining the importance of the predictive model
In section we provided the relative performance of the 5 different predictive models, applied on the unclustered data. The question that arises at this point is whether this relative performance will be the same when the models are used in combination with the SVD produced clusters.
In order to answer to this question we conducted a set of experiments in which the 5 models were used, all for the same number of clusters (70). Their performance is visualized in
The overall average MAE error rate for each predictive model, with the same cluster size (70) are summarized in
Cluster Size 
Overall average RMSE error 
SVD_70_LWL 
0.7711 ± 0.0662 
SVD_70_M5Rule 
0.7199 ± 0.0585 
SVD_70_SMOreg 
0.71663 ± 0.0614 
SVD_70_SLR 
0.7549 ± 0.0637 
SVD_70_RBF 
0.8251 ± 0.0837 
The overall average RMSE error rates for each predictive model, with the same cluster size (70) are summarized in
Cluster Size 
Overall average RMSE error 
SVD_70_LWL 
0.9625 ± 0.0751 
SVD_70_M5Rule 
0.9085 ± 0.0685 
SVD_70_SMOreg 
0.916922 ± 0.0708 
SVD_70_SLR 
0.9456 ± 0.0711 
SVD_70_RBF 
1.0227 ± 0.0807 
From the above diagrams we can see that the relative performance of the models did change. SMOreg and M5Rules are now the two best performing models, while SLR moved third (from second best at the initial implementation). Also while RBF was close to the rest of the models at the unclustered experiments we see that is clearly the worst performing model now.
Another indication from the above diagram is how the cluster quality affects the resulting accuracy. We can observe that all the 5 models follow a uniform pattern with peaks and bottoms of errors occurring at the same points. This let us speculate that since the only thing shared between the models is the cluster in which they are called to operate in, it is the quality of this cluster that affects the output errors.
4.3 Comparison of the clustered with the unclustered approach
The most important comparison made as part of this work is the one showing the difference in performance of the suggested technique using SVD as way to form clusters of items with the original model approach of the predictive models built for each item.show the MAE and RMSE errors for the same model, SMOreg, as it was proved to work well in both occasions. The number of clusters used in this comparison for the SVD version is set to 70, the best performing identified size
we can see that the approach that uses clustering through SVD performs constantly better than the original version that builds separate prediction models for each item.
Most importantly the performance of the improved system less aggressively affected by the low number of ratings and remains almost steady across the whole spectrum of the rating frequencies.
present the magnitude of the performance improvement between the two methods for the MAE and RMSE errors respectively, sorted by the average item popularity. The improvement rate is calculated as:
For 46 out of the 48 averaged intervals, the approach using SVD in order to cluster the data improves the accuracy of the recommender. Once again can be observed how the difference in the performance of the two methods scales with the number of the available ratings per item. The less popular the item is, the greater is the improvement of the SVD based technique.
To determine the statistical significance of the improvement in the performance of the model using the proposed methodology compared to the initial implementation that built separate predictive models for each item, we perform paired ttest for the MAE and RMSE errors of each instance at 95% confidence interval. Where for both the tests:
H0 : (Base Model mean error) = (Proposed model mean error)
H1 : (Base Model mean error) > (Proposed model mean error)
The error rate series compared are the results of the five different predictive algorithms applied at the unclustered data and the same algorithms applied at the data using clustering via Singular Value Decomposition with 70 as the number of clusters. The test results can be seen in
Predictive models compared (xy) 
P value 
tstat 
sd 
Significantly Different 
LWL  SVD_LWL 
0.00 
17.8968 
0.2032 
YES 
M5Rule  SVD_ M5Rule 
0.00 
19.4605 
0.1800 
YES 
SMOreg  SVD_ SMOreg 
0.00 
14.9327 
0.2125 
YES 
SLR  SVD_ SLR 
0.00 
12.1380 
0.1711 
YES 
RBF  SVD_ RBF 
0.00 
9.3732 
0.1428 
YES 
Table 10 Paired Ttest results. Comparison of the MAE error rate difference between the 5 predictive models applied on the original data versus using SVD clustering (95% confidence interval)
Predictive models compared (xy) 
P value 
tstat 
sd 
Significantly Different 
LWL  SVD_LWL 
0.00 
19.2274 
0.2299 
YES 
M5Rule  SVD_ M5Rule 
0.00 
19.1513 
0.2191 
YES 
SMOreg  SVD_ SMOreg 
0.00 
14.7433 
0.2385 
YES 
SLR  SVD_ SLR 
0.00 
11.8022 
0.2011 
YES 
RBF  SVD_ RBF 
0.00 
8.9711 
0.1549 
YES 
Table 11 Paired Ttest results. Comparison of the RMSE error rate difference between the 5 predictive models applied on the original data versus using SVD clustering (95% confidence interval)
Interpreting the result of the Ttests we reject the null hypothesis in favor of the alternative and we can say that at the 5% significance level the data do provide sufficient evidence to conclude that the accuracy of the proposed approach differs from the accuracy of the base model. We are 95% confident that the proposed approach improves the accuracy of the system.
4.3.1 The proposed solution addressing the cold start problem
As discussed in section introducing new items to a recommender system can lead to poor performance. In section we showed by evaluating the base model how the low number of ratings affected negatively the performance of the system.
In this set of experiments we showed that the proposed method improves significantly the accuracy of the base model, that its accuracy is less sensitive to the item popularity, and that the improvement introduced compared to the unclustered model is bigger for the items with few ratings.
The above three attributes of the proposed approach are good steps towards the solution of the coldstart problem. While still the accuracy drops going from the high end of the popularity scale towards the low end, it always remains close to the overall mean error. That means that a newly introduced item will no longer receive poorly accurate recommendations because it cannot provide enough information to support the creation of effective rules from the classification models since by using the information of the items belonging to the same cluster we can improve the accuracy of the recommendation.
4.4 Execution time performance and scalability discussion
Scalability as discussed in section is an important aspect for any recommender system. Characteristic of the recommender systems is that they deploy in large useritem spaces that depending upon the implementation context can extent to several millions of transactions in a large scale Ecommerce site. In the case that browsing patterns are used to indicate the product preference these transactions will be more than the sheer combination of usersitems and usersusers for user similarity based systems or itemsitems for item similarity based system as in the approach developed here. At the same time the recommendation must be presented to the end user in a timely manner and many recommendations per second must be produced for all the active customers in a hightraffic site.
In order to test the potential scalability of the proposed system we performed measurements of the execution time required to produce recommendations over the entire experimental Dataset of 100.000 instances.
4.4.1 Execution time performance of the proposed system across different models
presents the average execution time (in seconds) needed to complete a full test run for each one of the different predictive models and the number of clusters set at 70. The times are the average of five repetitions of the experiment on the development machine consisting of an Intel Core 2 Duo T9400 processor (2.53 GHz, 1066MHz FSB, 6MB L2 cache) with 4 GB DDR2 memory running 32bit Windows Vista. The machine load was tried to be kept minimum and uniform across all the experiments.
It should be noted here that the times presented here can be used in order only to compare the relative execution time required for the different predictive algorithms as an unknown, and probably big, percentage of the total execution time was used during the crossvalidation and attribute filtering steps by Weka making the results not representatives of a real use scenario.
We can see that the execution times needed greatly vary depending on the algorithm used, with the worst performing algorithm being SMOreg and the best one being Simple Linear Regression. While this was expected and directly linked to the inner algorithm complexity, what was an interesting observation was the fact that the analogy better accuracy  worst performance did not hold truth for all the models under test. To demonstrate this clearer presents groups the RMSE errors and the execution time needed for the different models. For example we can see that M5Rules produce low error rates while performing reasonably well timewise.
Utterly there is not a safe answer to the question what is the best predictive model. The decision may vary depending on whether we want to improve accuracy or response time.
4.4.2 Execution time performance of the proposed system across different cluster sizes
The number of clusters used during the neighborhood formation had an immediate effect on the time performance of the system. The less clusters used (more items per cluster) the more time the predictive models need to be built. presents the execution time needed for the different number of clusters used, with SMOreg as the predictive algorithm.
From the results we got during this experiment we are led to the conclusion that while having more clusters means that more models have to be trained (one model per cluster) the reduction in time that occurs for the training of each cluster eventually leads to better performance.
Scaling the system to be applied in a much bigger real use scenario would presuppose the finding of the best performing number of clusters to be used, depending on the population of the dataset. Although this appears feasible, from the evidences we have, we cannot safely assume that the accuracy of the predictions will follow the same patterns as these of the experiment, linking the accuracy with the number of clusters described in section
Nevertheless the fact that we have strong evidence to conclude that the worst (accuracywise) performing number of clusters was the smallest one (10) while the best (accuracywise) performing models introduced insignificant changes (for example 70100) and using the results from the executiontime series of tests indicating that the best (timewise) performing number of clusters were the bigger ones we can say that we can improve the throughput of the system by increasing the number of clusters without suffering loss of accuracy.
4.4.3 Singular Value Decomposition related scalability
Singular value decomposition is computationally expensive. For a m x n matrix of m users and n items, SVD requires time in order of O((m+n)3) (Deerwester, 1990). While during the experimental procedure this was not crucial, since the useritem matrix R was of dimension 943 x 1682 only (with average SVD execution time 13.84 seconds using Matlab) and even while SVD can be calculated offline, the cost would prove prohibitive for large scale datasets containing millions of customers and items (Sarwar, 2000), (Papagelis, 2005). As a result alternative techniques for factorization should be considered as for example in (Sarwar, 2002) and (Ma, 2009).
5 Conclusions
This chapter concludes the project by discussing what was accomplished by the research conducted and by describing the focus of future work.
5.1 Achievements
By conducting accuracy tests in accord with the experimental procedure followed at the recommender systems literature and interpreting the significance of our results with paired Ttests, we concluded that our proposed approach significantly improves the overall accuracy of the recommendations, compared with the base system where at the first part of our experimentations we showed that the use of classification models alone to predict the ratings leads to poor performance especially for items of low popularity.
The most important achievement of the system, is its effectiveness on reducing the negative effects of the itemside cold start problem.
By using clustering in the reduced dimension space via Singular Value Decomposition we managed to improve the accuracy of the classification models at this problematic area. The fact that the performance of the proposed model presents smaller deviations across all the ranges of item popularity compared to the base model means that a newly introduced item will no longer receive poor recommendations compared to the items with many ratings that can inherently provide enough information to support the predictive models.
5.2 Future work
In the future we would like to investigate how effective will be the formation of clusters of users instead of items, in the reduced dimension space in combination with classification models and how the size of these clusters will affect the system`s performance.
We would also like to test whether treating the class as nominal will have an impact at the prediction accuracy compared to the numeric approach followed in this work.
Another thing we would like to investigate is what will be the performance of the proposed system tested on a different dataset. The initial idea was to use more than one datasets to evaluate the performance of the method, but it was dropped due to time restrictions.
Finally, since the classification algorithms proved to be very sensitive to the amount of available information, it we would be interesting to investigate if we can achieve better performance by enriching the data with demographic and contextual information.