Covid-19 Update: We've taken precautionary measures to enable all staff to work away from the office. These changes have already rolled out with no interruptions, and will allow us to continue offering the same great service at your busiest time in the year.

Information Filtering System Based on Clustering Approach

2993 words (12 pages) Essay in Computer Science

03/04/18 Computer Science Reference this

Disclaimer: This work has been submitted by a student. This is not an example of the work produced by our Essay Writing Service. You can view samples of our professional work here.

Any opinions, findings, conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of UK Essays.



The quantity of web information has been increased day by day due to fast development of internet. Now-a-days people make their decision based on the available information from the internet. But the problem is how the people successfully choose or filter the useful information from the enormous amount of information. This problem is referred as information overload.

Recommendation System is a supportive tool to resolve the information overload problem. It is part of information filtering system used to recommend the user based on their own interest, neighborhood similarity and past history. Collaborative Filtering is one of the popular techniques widely used recommendation system.

Every recommendation system should ensure privacy for both user’s neighbour and their data. To overcome the scalability and model reconstruction problem, a power graph based private neighborhood recommendation system is proposed to ensure the user’s privacy. First, the compressed network is constructed and then the feature set is extracted from the compressed network using transformed data. The data is transformed using hybrid transformation fuses principal component analysis and rotation transformation to protect users’ privacy with accurate recommendations. Finally the item to be recommended is predicted which achieve better performance than the existing technique. MovieLens Dataset is used to evaluate this method.


Recommendation System is one of the information filtering system which provides valuable information to the users by filtering the information according to user’s interest. Traditional approaches of recommendation systems are collaborative filtering, content based filtering and hybrid Approach. Content Based Filtering (CBF) approach predicts the recommendation based on the rating given by the user for the similar items in past history. Collaborative Filtering (CF) recommends the user based on rating of that item by similar users. Hybrid approach combines both the approaches. All the approaches have their own advantage and disadvantage.

CF mainly classified as memory based CF and model based CF. Memory based CF first calculate the similarities between the requested user and all other user to find the neighbors then calculate the prediction based on identified neighbors rating pattern. Model based method first built a model based on the preference of the user. Main aim of the recommender system is to minimize the prediction error. The main issues in CF recommender system are scalability, sparsity and privacy.

  • Scalability: Large number of users and items in the network led to the increase in the computational complexity of the system. In E-commerce, scalability plays a important issue because it contain huge number of users.
  • Sparsity: All the users don’t show their interest to rate all the items they interact private, which will lead to data sparseness in the system. This will not give exact recommendation to the seekers.
  • Cold Start: Lack of information for new items and users in recommendation system will leads to unpredictable items in the system.
  • Privacy: Users may provide false information inorder to protect their personal information. This leads to inaccurate recommendation.

The proposed work mainly focuses on two fundamental issues in CF namely scalability and privacy. The first challenge is how to improve the scalability of CF, because these systems should search the entire user for finding the neighbors. The second problem is how to protect the individual users privacy while prediction. Both an issues lead to poor performance of the system. So the important challenge is to handle both a situation properly for better performance.


Recommendation system helps the people to get exact information based on neighbors’ pattern. Remarkable growth in e-commerce site makes the online vendors to develop their sales and profits. They use this technique which suggests product to users’ by their neighbors’ preference about the item. Scalability issue in RS mainly due to enormous growth in users tends to decline in accuracy of prediction on recommendation. Clustering approach reduces scalability problem by grouping the similar users. Recommender System may demand the users’ to expose their ratings to recommendation server to give a proper recommendation. But exposing the rating may allow the recommenders to learn the private information about users. Revealing rating may also direct to do violent behavior by several competitive companies’.


Several different clustering methods are adapted to reduce the scalability problem in RS. A new cluster based matrix tri-factorization is proposed to cluster the user and item simultaneously to get a better recommendation in model based CF. But when the new user enter the system it is necessary to rebuilt the whole model again for other user [].In [0] a cluster based binary tree is built by splitting the dataset and the recommendation is predicted based on the average rating of cluster. Later [] a combined k-means bisecting clustering is performed to overcome the scalability problem while preprocessing and pseudo prediction is adapted. But performance is not much better. Community based clustering model based CF is proposed [] to predict the recommendation but it underperform on outliers. Multilevel clustering is adapted to extract the subgraph which is clustered and propagated to reduce scalability which improved the performance than existing approach. But it will be more complicated when the aspect of the network increases. Therefore it is necessary to group the data in all the aspects to reduce the scalability.


In CF, neighbors are identified by collecting the information for the entire user. Thus the server maintain user preference, purchase, usage data etc which may contain identifiable information may violate the privacy. There are several techniques to protect the user’s sensitive information []. Initial method to ensure the privacy protection in CF was proposed by canny (2002a, 2002b), mainly focus on aggregation. In this method sensitive data are aggregated with some common distribution. In cryptographic approach, Individual user data can be protected using homomorphic encryption to avoid exposing of individual data but it requires high computational cost [5]. In perturbation approach, users mask their data before storing it in a central server. The central server collects the disguised data instead of original data to provide predictions with decent accuracy [18]. In [2] a randomized response techniques (RRT) is proposed to preserve users’ privacy by generating naive Bayesian classifier (NBC) based private recommendations. Another technique, data obfuscation was used to implement privacy preserving collaborative filtering algorithm [16]. In this technique, sensitive data are obfuscated through additive or multiplicative noise in order to protect individual privacy before allowing for analysis. The actual data can be revealed in this technique by applying reverse engineering process [7]. Sensitive information is either concealed or eliminated for the purpose of analyzing the data to extract the knowledge in anonymization technique. The major fault of this technique is some distinctive data may lead to the re-identification of data [1].

In proposed work, a scalable privacy preserving recommendation system is proposed. First the user to user network is constructed from the user preference then compressed network is formed based on the power graph approach. Then feature set extracted from the compressed network based on transformed rating to ensure the privacy during prediction. Finally the linear prediction model is adopted instead of similarity prediction to improve the accuracy besides reduces the complexity.


  • To protect the individual’s neighbour information while prediction based on clustering approach this reduces complexity of model reconstruction.
  • To protect the individual data using data transformation technique.


A cluster based approach is proposed to protect the individual neighborhood privacy and hybrid data transformation technique is proposed to protect the individual data with accurate recommendation using feature extraction based linear regression prediction.


Data Transformation

Experiment is performed using MovieLens Public (MLP) dataset which is the standard dataset to show the better performance of the proposed method. MovieLens dataset is collected by the GroupLens Research Project at the University of Minnesota. This data set consists of three different files of three different sizes 10M, 1M and 10K which mainly contain ratings of different movies provided by the users. To evaluate the proposed method 1M size dataset is used which contains 6040 users, 1 million ratings and 3900 items. The rating values are on five star scales, with five stars being the best and one star being the least. Data collected consist of four attributes separated with double colon as the delimiter [userid :: itemid ::rating :: Timestamp]. To evaluate the proposed work userid, itemid, rating is extracted from the dataset and then extracted data is converted into user x item matrix with dimension (6040 x 3952).Unrated items are filled with value zero to overcome computation complexity.

Data Transformation

A hybrid data transformation technique which fuses Principal Component Analysis (PCA) and Rotation Transformation (RT) is proposed to transform the data in order to protect the user’s data. The input to the PCA technique is the rating matrix. This technique first finds the principal components and then rotates these components which cannot be reverted easily. Rotation transformation will be efficient by identifying the appropriate range of angle such that to satisfy the least privacy requirement. Optimum privacy threshold is determined from range of angle which leads to good privacy protection effect. To determine the range of angles, sequence rotation should be performed on vectors with successive angles. For each pair of attributes, pairwise optimum privacy threshold is assigned by multiplying the privacy threshold and the privacy angle which should be maximum. To determine the privacy angle, calculate the variance of each attribute. For each pair of attribute, minimum variance will be considered as privacy angle. After determining pairwise optimum privacy threshold, select the range of angles to transform the pair of attribute. While choosing the range of angle make sure that it satisfying the following constraint which is mentioned in Eqn(1)

Var(Pi-Pi’) ≥PoptEqn(1)

An angle is randomly chosen from the interval to rotate each and every pair of principal component. After rotating the principal component, it is multiplied with normalized data in order to obtain the transformed data. The Transformed value of the original data for the data is shown in Table 2.

Private Neighborhood Network Construction

Original network is compressed using power graph analysis. Power graph analysis is a representation of complex network which represent the graph into power graph without loss of information. Graph can be clustered to construct a power graph using modular decomposition method in which modules represents the nodes with same neighbour. Power graph cluster both, the nodes and edges to obtain the most compressed network. Power graph analysis is widely used several biological networks such as protein-protein interactionnetworks, domain-peptide binding motifs,Gene regulatory network and Homology/Paralogy networks. Matrix R can be used to represent the social relationship between the users. If any two users rate the same item then there will be a relationship between them. Thus the user-user network is represented as where U is the set of users represented as nodes and is the set of relationship denoted as edges, and then a power graph is a graph defined on the power set of nodes which are connected by power edges. The concepts of power graphs are as follow: if there is a power edge between two power nodes, then nodes in one power nodes are connected to all the nodes in the second power node. In same way if all the nodes are connected to each other which is represented as power node with self loop.

Based on power graph analysis this module involves two steps, power node identification and power edge search. Power nodes are recognized using hierarchical clustering based on jaccard index. The greedy search is performed to identify the power edge.

Feature Set Extraction

After construction of private network, feature set of each user is extracted by categorizing the users into cold start user, powerful user, and malicious user. Cold start is a user who rate exactly twenty items. Powerful user is user who rate more than thirty five items and malicious user who rate less than twenty five items but the difference between any ratings of a particular item and the standard deviation of that item is greater than one. For the constructed compressed network following features are extracted for each category of the users to predict the rating of unrated item. Feature set of particular user includes features of directly connected power node and Friend Of A Friend (FOAF) in the other power nodes. Each category is measured according to number of particular category of user present in the power node and their joint probability of that particular category. Bayes theorem is used to calculate the most probable rating each category of user. The following Table shows the feature set measured for each user.

User type

Cold start users

Intense raters

Malicious users

All users




Number of Users

Rating Prediction

Number of Users

Rating Prediction

Number of Users

Rating Prediction

Number of Users

Rating Prediction

Power Node –X fit in









Direct Neighbours of X



















Table 5.2 Feature Set of User X

Linear Regression prediction

From the above extracted features a linear predictive model is constructed which is user for predicting particular item. Then top predicted items are given as recommendation to user. The model takes the following form as in Equation. (5.6)


where α represents the slope of the dependent variables, X represent the feature vectors and represents the error vector which is assumed to be zero. In linear regression, the value to be predicted is commonly computed from the best fitting line which reduces error in prediction.


The proposed method is preserves both the individual neighbors privacy and data privacy. It also reduces the scalability issues and give accurate recommendation when compared to the previous work (privacy preserving information filtering system). MAE obtained is compared with the proposed method in previous Chapter TRPC as in Table 5.7. Figure shows that MAE is reduced to 0.62 because of coupling of clustering approach with data transformation to handle large volume of data.


The power graph analysis helps to overcome the scalability problem by compressing the original network and results better recommendation to users. The existing methods apply power graph analysis to various domains for analyzing complex networks in a simpler way. And at the same time it also preserves the communities’ information. Therefore, in proposed work this type of clustering approach is used to preserve the neighbours information which also results better prediction. The efficiency of the proposed methodology is evaluated with the experimental results using MovieLens dataset. It performs better compared data transformation and clustering approach. This type of cluster based collaborative filtering recommendation helps to reduce the edges in the original network without loss of any information.

Get Help With Your Essay

If you need assistance with writing your essay, our professional essay writing service is here to help!

Find out more

Cite This Work

To export a reference to this article please select a referencing style below:

Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.

Related Services

View all

DMCA / Removal Request

If you are the original writer of this essay and no longer wish to have the essay published on the UK Essays website then please:

Related Lectures

Study for free with our range of university lectures!