Implementation And Result Discussion Computer Science Essay

Published:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

CHAPTER 5

The general overview of our system is shown in Figure 5.1. The web crawler uses the URLs provided in the Each Product dataset to download Product content from IMDB. After appropriate preprocessing, the downloaded content is stored in the Product Content Database. The Each Product dataset also provides the user-ratings matrix, which is a matrix of users versus items, where each cell is the rating given by a user to an item. We will refer to each row of this matrix as a user ratings vector. The user-ratings matrix is very sparse, since most items have not been rated by most users. The content based predictor is trained on each user-ratings vector and a pseudo user-ratings vector is created. A pseudo user-ratings vector contains the user's actual ratings and content-based predictions for the unrated items. All pseudo user-ratings vectors put together form the pseudo ratings matrix, which is a full matrix. Now given an active user's ratings, predictions are made for a new item using CF on the full pseudo ratings matrix.

Fig. 5.1: System Architecture

5.2 USE CASE MODEL

Fig. 5.2: Use case model

Use case diagram for restaurant recommender system. It contains two actors one is user and second one is Administrator.

5.3 CLASS MODEL

Fig. 5.3: Class Model

5.4 SEQUENCE MODEL

Fig 5.4: Sequence Model

5.5 ACTIVITY MODEL

Fig. 5.5: Activity Model

5.6 ALGORITHM

In Content boosted collaborative filtering, pseudo user-rating vectors of all users are combined into pseudo rating matrix. Similarity between the active users is computed by using the Pearson correlation coefficient the following steps describe the proposed algorithm [18].

Input: User Rating Matrix

Output: Recommendation for Product

Description:

Step1. A pseudo user-rating vector for all users in the database is created by Using Harmonic Mean Weighting Factor (HMW).

Step 2: The pseudo user-ratings vector, consists of the item ratings provided by the user u. ru,i denotes the actual rating provided by user u for item i, while Cu,i is the rating predicted by the content-based system.

Step 2.1 Apply k-mean clustering on data and then provide output of it as input to naÃ¯ve bayes.

Step3. Compute pseudo rating matrix V by combine the pseudo user-ratings vectors of all users.

Step4. Compute the similarity between active user a and another user u using the Pearson correlation coefficient by the pseudo user-ratings vectors va.

Step5. Computed mean-cantered ratings of the best-n neighbours of that user as weighted sum of the active user by incorporate a Self-Weighting factor in the final predictions.

Step6. Combined the above two weighting schemes to evaluate the final CBCF predictions.

5.7 ALGORITHM DESCRIPTION

5.7.1 K-Mean Clustering

The goal of clustering is to reduce the amount of data by categorizing or grouping similar data items together. The k-mean algorithm defines the centroid of a cluster as the mean value of the points within cluster [41].The goal of K-means is to ï¬nd the partition of cases which minimizes the sum of the dissimilarities between cases within each cluster. Various clustering algorithms are available. From all those available algorithm k-mean is one of the most popular and widely studied clustering methods for points in Euclidean space. Even if the number of instances is substantially large this algorithm is computationally efficient. The K-means algorithm has an advantage in comparison to other clustering methods (e.g. Hierarchical clustering methods), which have non-linear complexity.

The popularity for this algorithm resides in its ease of interpretation, simplicity of implementation, speed of convergence and adaptability to sparse data. Other Partition Methods K-medoids is more costly than K-mean. K-means clustering is a basically a partitioning method. It partitioned the data into k mutually exclusive clusters, and

Returns the index of the cluster to which it has assigned each observation.

K-means clustering operates on actual observations and creates a single level of clusters. K-mean clustering is often more suitable than hierarchical clustering for large amounts of data. K-means uses an iterative algorithm that minimizes the sum of distances from each object to its cluster centroid, over all clusters. This algorithm moves objects between clusters until the sum cannot be decreased further [41].

The k means clustering algorithm used to cluster the users into some groups as clustering centers. User clustering techniques work by identifying groups of users who appear to have similar ratings. Once the clusters are created, predictions for a target user can be made by averaging the opinions of the other users in that cluster [43].Specific algorithm as follows:

Input: clustering number k, user-item rating matrix

Output: smoothing rating matrix

Description:

Begin

Select user set U={U1, U2, â€¦, Um};

Select item set I={I1, I2, â€¦, In};

Choose the top k rating users as the clustering

CU={CU1, CU2, â€¦, CUk};

The k clustering center is null as c={c1, c2, â€¦, ck};

do

for each user UiâˆˆU

for each cluster center CUjâˆˆCU

calculate the sim(Ui, CUj);

end for

sim(Ui, CUm)=max{sim(Ui, CU1), sim(Ui, CU2), â€¦, sim(Ui, CUk);

cm=cmâˆªUi

end for

for each cluster ciâˆˆc

for each user UjâˆˆU

CUi=average(ci, Uj);

end for

end for

while (C is not change)

End

Through the calculating the vacant user's rating by user clustering algorithm, It gained the dense users' ratings.

5.7.2 CB - NaÃ¯ve Bayes

To provide content-based predictions we treat the prediction task as a text-categorization problem. We view restaurant content information as text documents, and user ratings 0-5 as one of six class labels. We implemented a bag-of-words naive Bayesian text classifier extended to handle a vector of bags of words; where each bag-of-words corresponds to a restaurant feature. We use the classifier to learn a user profile from a set of rated restaurant i.e. labelled documents. The learned profile is then used to predict the label (rating) of unrated restaurant.

NaÃ¯ve Bayes Algorithm

Let D be a training set of tuples and their associated class labels. As usual, each tuple Is represented by an n-dimensional attribute vector, X = (x1, x2, : : : , xn), depicting measurements made on the tuple from n attributes, respectively, A1, A2, : : : , An.\

Suppose that there are m classes, C1, C2, : : : , Cm. Given a tuple, X, the classifier will predict that X belongs to the class having the highest posterior probability, conditioned on X. That is, the naÃ¯ve Bayesian classifier predicts that tuple X belongs to the class Ci if and only if P(cijx) > P(cjjx) for 1 =< j =< m; j != i:

Thus we maximize P(cijx). The classci for which P(cijx) is maximized is called the maximum posteriori hypothesis. By Bayes' theorem below equation,

P(Ci|X) =

As P(X) is constant for all classes, only P(XjCi)P(Ci) need be maximized. If the class prior probabilities are not known, then it is commonly assumed that the classes are equally likely, that is, P(C1) = P(C2) = â€¦â€¦.. = P(Cm), and we would therefore maximize P(XjCi). Otherwise, we maximize P(XjCi)P(Ci). Note that the class prior probabilities may be estimated by P(Ci)=|Ci,D| / |D|,where |Ci,D| is the number of training tuples of class Ci in D.

Given data sets with many attributes, it would be extremely computationally expensive to compute P(X|Ci). In order to reduce computation in evaluating P(XjCi), the naive assumption of class conditional independence is made. This presumes that the values of the attributes are conditionally independent of one another, given the class label of the tuple (i.e., that there are no dependence relationships among the attributes). Thus

P(XjCi) == P(x1|Ci) x P(x2|Ci)â€¦â€¦â€¦P(xn|Ci). We can easily estimate the probabilities P(x1|Ci), P(x2|Ciâ€¦â€¦. P(xn|Ci) from the training tuples. Recall that here xk refers to the value of attribute Ak for tuple X. For each attribute, we look at whether the attribute is categorical or continuous-valued. For instance, to compute P(X|Ci).

In order to predict the class label of X, P(X|Ci)P(Ci) is evaluated for each class Ci. The classifier predicts that the class label of tuple X is the class Ci if and only if

P(X|Ci)P(Ci) > P(X|Cj)P(Cj) for 1 =< j =< m, j != i.

In other words, the predicted class label is the class Ci for which P(X|Ci)P(Ci) is the maximum.

5.7.3 CF- Pearson Co-relation

Content Based Collaborative Filtering first proposed by [8] uses neighbourhood-based recommendations. Here subsets of users are chosen based on their similarity to the active user, and a weighted combination of their ratings is used to produce predictions for the active user.

Input: Pseudo Rating Matrix

Output: Recommendation

Description:

The algorithm is given in the following steps.

Step1. All active users are equally weighted with respect to similarity.

Step2. Measure the similarity between users as the Pearson Correlation Coefficient between their ratings vectors. Pearson correlation coefficient can be computed by the formula.

Step3. Select the n users which are having the highest similarity with the active user.

Step4. Form the neighbourhood of active user recommendations predictions are computed as the weighted average of deviations from the neighbour's mean.

5.7.4 Important Terms [18]

5.7.4.1 Harmonic Mean Weighting

The accuracy of a pseudo user-ratings vector computed for a user depends on the number of restaurant he/she has rated. If the user rated many items, the content-based predictions are good and hence his pseudo user-ratings vector is fairly accurate. On the other hand, if the user rated only a few items, the pseudo user-ratings vector will not be as accurate. We found that inaccuracies in pseudo user-ratings vector often yielded misleadingly high correlations between the active user and other users. Hence to incorporate confidence (or the lack thereof) in our correlations, we weight them using the Harmonic Mean weighting factor (HM weighting).

5.7.4.2 Self Weighting

Recall that in CF, a prediction for the active user is computed as a weighted sum of the mean cantered votes of the best-n neighbours of that user. In our approach, we also add the pseudo active user to the neighbourhood. However, we may want to give the pseudo active user more importance than the other neighbours. In other words, we would like to increase the confidence we place in the pure-content predictions for the active user. We do this by incorporating a Self-Weighting factor in the final prediction.

5.8 IMPLEMENTATION DETAIL

We implemented restaurant recommendation system that uses content boosted collaborative filtering techniques to select and rank restaurants a user interacts with the system by submitting an entry point, either a known restaurant or a set of criteria, and is shown similar restaurants. The user then interacts with the system in a dialog, critiquing the system's suggestions and interactively refining the search until an acceptable option is achieved. Consider a user who starts browsing by entering a query in the form of a known restaurant, Imperial Palace on Dr. Yagnik Road Rajkot. The system finds a similar Rajkot restaurant that combines Gujarati and Punjabi influences, as well as other similar restaurants that are ranked by their similarity. The user, however, is interested in a cheaper meal and selects the 'Less Rs.' button. The result shown is a creative Indian restaurant in a cheaper price bracket: The user can continue browsing and critiquing until an acceptable restaurant has been located.

The similarity relation between restaurants is decomposed into a set of independent attributes, such as the 'niceness' of a restaurant, that correspond to users' high-level perception or interest in the object. For each such attribute, a local similarity metric is defined, which measures how similar two items are with respect to that attribute. Two restaurants with the same price would get the maximum similarity rating on the metric of price, but may differ greatly on another metric, such as quality or type of cuisine. Each local metric has a small range of integer values as its output. Once the similarity relations are defined, they are ordered to create a retrieval strategy. For example, in the restaurant recommender system, the attributes were cuisine, price, quality, and atmosphere applied in rank order. The metrics are combined in a cascaded way: each subsequent metric used only to break ties between restaurants ranked equally by higher level metrics. One benefit of combining the local metrics in this way is that the global metric is much simpler to design and its behavior easier to predict, as compared to the weighted combinations often found in CBR systems. The actual knowledge content of metric is fairly shallow. It knows that all other things being equal, a more expensive restaurant are worse, and a more highly-rated restaurant is better. Its most complex metric is the one that compares restaurants on the basis of cuisine. This metric is based on a semantic network containing approximately 150 different cuisines. Similarity between cuisines is represented as the inverse of distance in the network. This system differs from other recommender systems in several ways. Most strikingly, it uses semantic ratings, evaluations that tell the system not just the user's preference thumbs up or thumbs down but also the reason behind the rating: too expensive, not fancy enough, etc. The semantic ratings give a level of interactivity not found in other recommender systems, and allow it to tailor its suggestions to the user's particular need of the moment, rather than adapting to overall preferences over time. The CoFIND system (Dron et al., 1999) also used semantic ratings, allowing users to create dimensions along which to rate web pages as 'good for beginners', 'comprehensive', etc. In CoFIND, however, these assessments were not used to navigate between options. Relevance feedback (Salton & McGill, 1983) is a well-known technique in information retrieval in which a user's query is refined by taking into account responses to returned items. This system critique-based navigation differs from relevance feedback approaches in both explicitness and flexibility. In relevance feedback approaches, the user selects some retrieved documents as being more relevant than others, and the system determines how a query is refined. In Find Me systems, critiques supply concrete domain-specific feedback that adjusts the search in a particular direction.

The switching and mixed hybrids are appropriate in a non-uniform case, but we did not find that these system recommendations were weaker in any one part of the recommendation space. In addition, we wanted to maintain the conversational interaction that is the hallmark of Find Me systems. That meant that recommendation had to be seen as a direct response to the user's last critique rather than something more holistic. Thus, neither a CB/CF meta-level hybrid nor a CB/CF feature augmentation would have been appropriate, since in both of these configurations; it is the collaborative part this is actually recommending. CB/CF feature augmentation and CB/CF meta-level hybrids would have been possible, but in the either case, the knowledge-based part of the system would have had to make inferences from collaboratively-generated features. This would have entailed additional knowledge engineering exactly what we sought to avoid. That leaves the weighted/cascade category of hybrids. Since this system similarity assessment technique already used a cascade, we found it most elegant to simply add collaborative recommendation as a final cascaded step. To see how such a system would differ from the existing system, consider the following example: Alice connects to system, a version of system that includes a collaborative filtering component. She starts browsing for Gujarat restaurants by entering the name of her favorite restaurant at home, Greens Restaurant in Ahmedabad. A green is characterized as serving 'Gujarati' and 'Punjabi' cuisine. The top recommendation is Lords Restaurant, which serves 'Chinese' and 'North-Indian.' It turns out that Alice is, in fact, a vegetarian, so she critiques the system's cuisine choice and moves back towards vegetarian recommendations. After the system has built up a bigger user base, another new user Bob approaches the system with the same starting point: Greens. Since the recommendation given to Alice was under-discriminated, her feedback and that of other users allow the system to more fully discriminate Bob's recommendation, and return Jane's, a vegetarian restaurant, now preferring it over Lords Restaurant. This thought experiment suggests that a cascade using both content-based and collaborative-filtering techniques may produce a recommender system with some of the best characteristics of both. Initial suggestions are good, since there is a knowledge base to rely on. As the system's database of ratings increases, it can move beyond the knowledge base to characterize users more precisely. Because the knowledge base is always present, users are not trapped by their past behavior. If Alice decides to stop being a vegetarian, she will be able to get recommendations for steakhouses by entering one as a starting point.

BASED FILTERING

The proposed hybrid filtering transparently creates and maintains user preferences. It assists users by providing both collaborative filtering and content-based filtering, which are updated in real time whenever the user changes his/her current page using any navigation technique. The WebBot uses the link provided in the restaurant dataset to download restaurant content from database. WebBot keeps track of each Individual user and provides that a user online assistance. The assistance includes two lists of recommendations based on two different filtering paradigms: collaborative filtering and content-based filtering. WebBot updates the list each time the user changes his/her current page. Content-based filtering is based on the correlation between the content of the pages and the user preferences. The collaborative filtering is based on a comparison between the user path of navigation and the access patterns of past users. Hybrid filtering may eliminate the shortcomings in each approach. By making collaborative filtering, we can deal with any kind of content and explore new domains to find something interesting to the user. By making content-based filtering, we can deal with pages un-seen by others. To overcome the problem of stateless connection in HTTP, WebBot follows users through tracking their IP address. To track user presence, a timeout mechanism is used to delete user's session information after a predetermined amount of idle time. So that, a connection after the specified period having the same IP is identified as a new user. This method is fairly easy to implement. Consequently, the IP of a proxy server may represent two or more people who are accessing the same web site simultaneously in their browsing sessions, causing an obvious conflict. However, the reality is that many large sites use this method and have not any clashes. The restaurant dataset also provides the user-ratings matrix; which is a matrix of users versus items, where each cell is the rating given by a user to an item. We will refer to each row of this matrix as a user-rating vector. The user-ratings matrix is very sparse, because most users have not rated most items. The content-based predictor is trained on each user-ratings vector and a pseudo user-ratings vector is created. A pseudo user-ratings vector contains the user's actual ratings and content-based filtering for the un-rated items. All pseudo user-ratings vector put together from the pseudo ratings matrix, which is a full matrix. Now given an active user's ratings, filtering predictions are made for a new item using Collaborative filtering on the full pseudo ratings matrix.

BUILDING A DATABASE

Our current prototype system, WebBot: Web Robot Agent uses a database of restaurant content information extracted from web page at restaurant database. Therefore, the system's current content information about restaurant name consists of textual metadata rather than the actual text of the items themselves. A restaurant database subject search is performed to obtain a list of restaurant-description of broadly relevant titles. WebBot then downloads each of these data and uses a simple pattern based information extraction system to extract data about each restaurant. Information extraction is the task of locating specific pieces of information from a document, thereby obtaining useful structured data from unstructured text. A WebBot follows the restaurant link provided for every restaurant in the restaurant dataset and collects information from the various links off the main database. We represent the content information of every restaurant as a set of features. Each feature is represented simply as a bag of words. Database produces the information about related facilities and restaurant names using collaborative filtering: however, WebBot treats them as additional content about the restaurant. The text in each feature is then processed into an un-ordered bag of words and the examples represented as a vector of bags of words.

5.11 PERFORMANCE EVALUATION

We used a subset of the restaurant dataset. This dataset contains 7,291 randomly selected users and 1200 restaurant for which content was available from database. To evaluate various approaches of filtering, we divided the rating dataset in test-set and training-set. The rating database is used a subset of the ratings data from the restaurant dataset. The training-set is used to predict ratings in the test-set using a commonly used error measure. The metrics for evaluating the accuracy of a prediction algorithm are used mean absolute error (MAE) and rank scoring measure (RSM) [9]. For evaluation, we uses the following methods: The proposed hybrid collaborative filtering and content-based filtering (RCB), a collaborative filtering (P Corr), the recommendation method using only the content-based filtering (Content), and a nave combined approach (N_Com). The naive combined approach takes the average of the ratings generated by the collaborative filtering and the content based filtering. The various methods were used to compare performance by changing the number of clustering users. Also, the proposed method was compared with the previous methods that use both collaborative filtering and content-based filtering method by changing the number of user evaluations on items. The aforementioned previous method includes the Soboroff method [10] that solved the sparse rating problem, the Fab method that solved the first-rater problem, and the Pazzani method [6] that solved both the sparse rating problem and the first-rater problem. Fig. 5.6 and Fig.5.7 shows the MAE and RSM of varying the number of users. as the number of users' increases, the performance of the RCB, and the P Corr also increases, whereas the method using content shows no notable change in performance. In terms of accuracy of prediction, it is evident that method RCB, which uses both collaborative filtering and content-based filtering, is more superior to method N_Com. Fig. 5.8 and Fig. 5.9 is used to show the MAE and RSM when the number of user's evaluations is increased. In Fig. 5.8 and Fig. 5.9, the Soboroff method, which has the first-rater Problem, shows low performance when there are few evaluations; the other methods outperform the Soboroff method. Although the Pazzani method, which solved both the sparse rating problem and the first-rater problem, along with the RCB show high rates RCB shows highest accuracy of all methods. Since we use a pseudo ratings matrix, which is a full matrix, we eliminate the root of the sparse rating problem and the first rater problem. Pseudo user-ratings vectors contain ratings for all items; and hence all users will be considered as potential neighbors. This increases the chances of finding similar users. The original user-ratings matrix may contain items that have not been rated by any user. In a collaborative filtering approach these items would be ignored. However in RCB, these items would receive a content-based prediction from all users. Hence these items can now be recommended to the active user, thus overcoming the first-rater problem.

J:\Graph1.jpg

J:\Graph2.jpg

J:\Graph3.jpg

J:\Graph4.jpg

Fig 5.9

The restaurant application has been implemented with Flash Macromedia for the user interface layer, MySQL for the database layer and PHP for business logic layer. As a source for restaurant descriptions we used restaurant database, so it was necessary to implement special application to download several thousand complete descriptions of the hotel, with list of hotel, user comments, and additional information like rating, place state, zip, fax, food category etc. All this data has been saved to the system and served as base for verification of implemented HA. The user after registration that consists of delivering both demographic data and favorite restaurant name and features is asked to rank several restaurants. Then the system recommends the user interface for the user that may be further modified. The main goal of the system is restaurant recommendation. We selected that each time we recommend 20 restaurant according to CBF and CF that define the HA. So far, we managed to make experiments only with several users, however we conducted standard usability studies with five typical users, giving them five typical tasks. The results have shown, verified by the questionnaires that they appreciate both recommended restaurant and user interface.

5.12 METHODOLOGY

We compare CBCF to a pure content-based predictor, a CF predictor, and a naive hybrid approach. The naive hybrid approach takes the average of the ratings generated by the pure content-based predictor and the pure CF predictor. For the purposes of comparison, we used a subset of the ratings data from the e data set (described earlier). Ten present of the users were randomly selected to be the test users. From each user in the test set, ratings for 25% of items were withheld. Predictions were computed for the withheld items using each of the different predictors. The quality of the various prediction algorithms were measured by comparing the predicted values for the withheld Ratings to the actual ratings.

Table 5.1: Summary of Results

ROC-4

Pure content-based (CB) predictor

1.069

0.6540

Pure CF

1.015

0.6480

NaÃ¯ve Hybrid

1.026

0.6155

Content-boosted CF

0.954

0.6951

5.13 IMPROVING CBCF

Due to the nature of our hybrid approach, we believe that improving the performance of the individual components would almost certainly improve the performance of the whole system. In other words, if we improved our pure content-based predictor or the CF algorithm, we would be able to improve our system's predictions. A better content based predictor would mean that the pseudo ratings matrix generated would more accurately approximate the actual full user-ratings matrix. This in turn, would improve the chances of finding more representative neighbours. And since the final predictions in our system are based on a CF algorithm, a better CF algorithm can only improve our system's performance. In our current implementation of the content-based predictor, we use a naive Bayesian text-classifier to learn a six way classification task. This approach is probably not ideal, since it disregards the fact that classes represent ratings on a linear scale. This problem can be overcome by using a learning algorithm that can directly produce numerical predictions. For example, logistic regression and locally weighted regression (Duda, Hart, & Stork 2000) could be used to directly predict ratings from item content. We should be able to improve our content-based predictions using one of these approaches. In addition, the CF component in our system may be improved by using a Clustered Pearson Predictor (CPP) (Fisher et al. 2000). The CPP algorithm creates clusters of users based on k-means clustering. Collaborative predictions are made by only using the cluster centroids as potential neighbours. Fisher et al. claim that this approach is more accurate than the pure CF algorithm, with the added advantage of being more scalable.

5.14 IMPLEMENTATION DETAIL

As for the user interface and prediction mechanism of the system, they have been implemented in Java with its servlet and JSP technologies. For recommender system, we have chosen to use JAVA as programming language to implement algorithm. This language is well compatible with Apache and MySQL... The implementation environment also includes Apache-Tomcat Application Server used together with the MySQL database, which has become the world's most popular open source database because of its consistent fast performance, high reliability and ease of use.