Movie recommendation based on collaborative filtering method
来源:互联网 发布:怎样用java做网站 编辑:程序博客网 时间:2024/05/21 09:23
Movie recommendation based oncollaborative filtering method
Zhen Zhu
· Proposal
Online movie have becomeincreasingly popular in recent ten years, and tremendous volumes movieresources have been upload to internet every day. Also, the recommender systemhas become a vital part of Internet companies [1]. So, it is quite interestingand important to build an intelligent recommender system to identify theeffective of what are user’s real interests. So, we choose the recommendationprediction task onMovieLens datasets for project inthis course.
· Related work
Suchrecommendation problems are usually solved by the collaborative filtering (CF)[3] approach, which relies only on information about the behavior of users inthe past. There are two primary methods in CF: the neighborhood approach [4]and latent factor modeling [5]. Both of these methods have been proven to besuccessful for the recommendation problem. A key step in CF is to combine thesemodels. Methods such as regression, gradient boosting decision trees, andneural networks have been used to ensemble CF models.
· Outline of approach
Theframework of our approach shown in figure1
Fig.1 the framework of the approach
1. We use Movielens datasets for ourproject and transform the data form for data model.
2. Use several different algorithms tocalculate the similarity in users or items.
3 For user-based method, we need to doneighborhood calculation.
4 Use user-based CF and item-based CF torecommendation.
5 Design a evaluator to compare the resultsof different approaches.
· Similarity calculation
Euclidean similarity
Inthis way, we calculate similarity between two points by Euclidean distance. TheEuclidean distance between point p and q is the length of theline segment connecting them ( ).In Cartesiancoordinates,if p = (p1, p2,..., pn) and q = (q1, q2,..., qn) are two points inEuclidean n-space, then the distance(d) from p to q, or from q to p is given by the formula as following:
Theposition of a point in a Euclidean n-space is a Euclidean vector. So, p and q are Euclideanvectors, starting from the origin of the space, and their tips indicate twopoints. TheEuclidean norm, or Euclidean length, or magnitudeof a vector measures the length of the vector:
Log-likelihood similarity
Thenatural logarithm of the likelihood function, called the log-likelihood, ismore convenient to work for similarity calculation. Because the logarithm is amonotonically increasing function, the logarithm of a function achieves itsmaximum value at the same points as the likelihood itself, and hence thelog-likelihood can be used in place of the likelihood in maximum likelihoodestimation and related techniques. Finding the maximum of a function ofteninvolves taking the derivative of a function and solving for the parameterbeing maximized, and this is often easier when the function being maximized isa log-likelihood rather than the original likelihood function.
Thedistribution has two parameters αand β. The likelihood function is
.
Findingthe maximum likelihood estimate of β for a single observed value x looks ratherdaunting. Its logarithm is much simpler to work with:
GammaFunction:
Pearson CorrelationSimilarity
Pearson'scorrelation coefficient is the covariance of the two variables divided by theproduct of their standard deviations. Pearson's correlation coefficient whenapplied to a population is commonly represented by the Greek letter ρ and maybe referred to as the population correlation coefficient or the populationPearson correlation coefficient. The formula as following:
is the covariance, is the standard deviation of
Theformula for ρ can be expressed in terms of mean and expectation. Since
and are defined as above; is the mean of ; is the expectation.
Thenthe formula for ρ can also be written as
Cosine similarity
Cosinesimilarity is a measure of similarity between two vectors of an inner productspace that measures the cosine of the angle between them. The cosine of 0° is1, and it is less than 1 for any other angle. It is thus a judgment oforientation and not magnitude: two vectors with the same orientation have acosine similarity of 1, two vectors at 90° have a similarity of 0, and twovectors diametrically opposed have a similarity of -1, independent of theirmagnitude. Cosine similarity is particularly used in positive space, where theoutcome is neatly bounded in [0, 1].
Thecosine of two vectors can be derived by using the Euclidean dot productformula:
Giventwo vectors of attributes, A and B, the cosine similarity, cos (θ), isrepresented using a dot product and magnitude as following:
Theresulting similarity ranges from −1 meaning exactly opposite, to 1 meaningexactly the same, with 0 indicating de-correlation, and in-between valuesindicating intermediate similarity or dissimilarity.
Fordistance calculation, most common way is by following:
or
· Neighborhood calculation
Neighborhoodcalculation only use for collaborative filteringbased on user. Use this calculation to sort top N similar user for finalrecommendation.
Fixed-size neighborhoods(K-neighborhoods)
K-neighborhoods means the recommendations are derived from a neighborhood of the K mostsimilar users. We need to define a suitable K. If K were smaller. Therecommendation would base on fewer similar users, and it would exclude someless-similar users from consideration. If K were larger. The recommendationwould base on more users, and it might be add some less-similar users. Onlychoose suitable K can improve the affection of recommendation.
Fig.2 An illustration of defining a neighborhood of most similarusers by picking a fixed number of closest neighbors. Distance illustratessimilarity: farther means less similar.
Threshold-basedneighborhood
What if we don't want to build a neighborhood ofthe n most similar users, but rather try to pick the “pretty similar” users andignore everyone else? We could pick a similarity threshold and take any users thatare at least that similar. The threshold should be between -1 and 1, since allsimilarity metrics return similarity values in this range. In the moment, we mayuse the similarity metric above.
Fig.3 An illustration of defining a neighborhood of most-similarusers with a similarity threshold
· Collaborativefiltering
Collaborative filtering (CF) approaches assume thatthose who agreed in the past tend to agree again in the future. For example, acollaborative filtering or recommendation system for movie tastes could makepredictions about which movie a user should like by given a partial list ofthat user's tastes (likes or dislikes). CF methods have two important steps,firstly, CF collects taste information from many users. In the second step,using information gleaned from many users to predict users’ interest andrecommend Item to user. Researchers have devised a number of collaborative filteringalgorithms which mainly can be divided into two main categories, User-based andItem-based algorithms.
User-based CF
User-based CF is also called nearest-neighbor basedCollaborative Filtering, it utilize the entire user-item data to generate aprediction. These systems use statistical techniques to find users’nearest-neighbors, who have the similar preference. Once the nearest-neighborhoodof users are found, these systems use algorithms to combine the preferences ofneighbors to produce a prediction or top-N recommendation for the target user.The techniques are popular and widely used in practice.
User-based CF algorithms have been popular and successfulin several years, but the widespread use has revealed some potentialchallenges. In practice, users only have purchased few percent of all theitems, maybe 1% of 2 million items, so that recommender system based on nearestneighbor algorithms may be unable to make any item recommendations for aparticular user, and the accuracy of recommendations may be poor. Nearestneighbor algorithms require computation that grows with the number of users.With millions of users, a typical web-based recommender system running existingalgorithms will suffer serious scalability problems.
Item-based CF
Unlikethe User-based CF algorithm discussed above, the item-based approach focus onitems which the target user had rated. Then, calculate the similarity to thetarget item and then selects k most similar items. According to the weight onthose item. We find the most similar items to target user which have not beenused.
Onecritical step in the item-based collaborative filtering algorithm is to computethe similarity between items. The basic idea in similarity computation betweentwo items i and j is to build co-occurrence matrix. Then, using the Euclideandot product to calculate with target user’s rating. After that we can find themost similar items which has the highest score.
· Experiment
Data sets
Thisdatasets were collected by the GroupLens Research Project at the University ofMinnesota. The datasets are very suit for do recommendation research. This dataset consists of: 100,000 ratings (1-5) from 943 users on 1682 movies and eachuser has rated at least 20 movies with attributes (age, gender, occupation).Thedata was collected through the MovieLens web site (movielens.umn.edu) duringthe seven-month period from September 19th, 1997 through April 22nd, 1998. Thisdata has been cleaned up – users who had less than 20 ratings or did not havecomplete demographic information were removed from this data set.
Evaluation
We use precisionvalue and recall value to measure the result.
C is the number of the item target user like and identifiedby system.
N is the total number user like in test data.
T is the Identified number by system.
Results
Weuse both User-based CF and Item-based CF to experiments. Set Number ofneighborhood is 2. The value of neighborhood threshold is 0.5 and therecommender number is 5. The training data account for 70% and the testingdatasets account for 30%. The result as Table1 and Table2.
TABLEI. Evaluation on user-based CF
Similarity algorithm
User-based CF
Rating type
Neighborhood calculation
Precision
Recall
Difference
Euclidean similarity
Score(1-10)
K-neighborhoods
0.15406236275801483
0.14404609475032024
0.07389905172235824
Log-likelihood similarity
Score(1-10)
K-neighborhoods
0.11314553990610342
0.13617157490396936
0.8677294206556467
Cosine similarity
Score(1-10)
K-neighborhoods
0.15008714596949832
0.12059325650874954
1.1111111111111112
Pearson Correlation Similarity
Score (1-10)
K-neighborhoods
0.12151979565772673
0.10734101579171995
0.5506329113924048
Euclidean similarity
Boolean(0,1)
K-neighborhoods
0.08128205128205122
0.1065941101152369
2.8610988410290643
Log-likelihood similarity
Boolean(0,1)
K-neighborhoods
0.11395646606914228
0.13107127614169872
2.324895084732162
Cosine similarity
Boolean(0,1)
K-neighborhoods
0.09692307692307693
0.11662398634229627
2.6268349661343318
Pearson Correlation Similarity
Boolean(0,1)
K-neighborhoods
0.0725752508361203
0.07543747332479718
2.616772097320647
Euclidean similarity
Score(1-10)
Threshold-based neighborhood
0.03057017543859648
0.04244558258642761
0.5040645832107125
Log-likelihood similarity
Score(1-10)
Threshold-based neighborhood
0.0035851472471190786
0.0036491677336747777
0.8119956200953123
Cosine similarity
Score (1-10)
Threshold-based neighborhood
0.008903225806451634
0.008578745198463503
0.8116693771906173
Pearson Correlation Similarity
Score (1-10)
Threshold-based neighborhood
0.04111349036402571
0.03252240717029451
0.6220533633729025
Euclidean similarity
Boolean(0,1)
Threshold-based neighborhood
0.20906735751295338
0.26429790866410546
6.877570770773614
Log-likelihood similarity
Boolean(0,1)
Threshold-based neighborhood
0.10653008962868132
0.1333546734955189
82.5830467048691
Cosine similarity
Boolean(0,1)
Threshold-based neighborhood
0.224358974358974
0.2836961160904817
85.7349085590648
Pearson Correlation Similarity
Boolean(0,1)
Threshold-based neighborhood
0.17935943060498222
0.1814127187366625
5.8100268562854955
TABLEII. Evaluation on Item-based CF
Similarity algorithm
Item-based CF
Rating type
Precision
Recall
Difference
Euclidean similarity
Score(1-10)
0.0017925736235595393
0.0017925736235595393
0.7945934427298251
Log-likelihood similarity
Score(1-10)
0.0
0.0
0.8183123680491949
Cosine similarity
Score(1-10)
0.0
0.0
0.8303853550496187
Pearson Correlation Similarity
Score (1-10)
0.00870678617157492
0.009240290226205712
0.6706314833178705
Euclidean similarity
Boolean(0,1)
0.0025608194622279107
0.0025608194622279107
47.13259472322781
Log-likelihood similarity
Boolean(0,1)
0.12522407170294492
0.1432565087494666
91.85028229060038
Cosine similarity
Boolean(0,1)
0.07221510883482725
0.07407170294494238
106.26153096575278
Pearson Correlation Similarity
Boolean(0,1)
0.0015364916773367482
0.0015364916773367482
0.7373257024463155
Slope one
Score(1-10)
0.0015364916773367482
0.0015364916773367482
0.7373257024463155
Unexpectedly, the experiments results ofUser-based are mostly better than those of Item-based. Only prediction onrating on Boolean data are better than rating on score. Use threshold to chooseneighborhoods are better than use fixed size. The approach of cosine tocalculate the similarity is better than others. Value of P reach to about 22.4%and value of R reach to 28.4%.
· Conclusions
Wefind Similarity algorithm and CF approaches are sensitive to the datasets.Item-based method do better in many application scenarios. However, in thisdatasets are lower than User-based method. So, we may use different methods indifferent datasets. If it possible, we may use several approaches for testingand make the decision based on experimental results.
REFERENCES
[1]M. Hao, Z.Dengyong, L. Chao, L. M. R., and K. Irwin. Recommender systems with socialregularization. In Proceedings of the fourth ACM International Conference onWeb Search and Data Mining, WSDM '11, pages 287{296, New York, NY, USA, 2011.ACM.
[2]Yanzhi Niu, YiWang, Gordon Sun, Aden Yue, Brian Dalessandro, Claudia Perlich, Ben Hammer,2012. The Tencent Dataset and KDD-Cup'12. KDD-Cup Workshop
[3]Y. Niu, Y. Wang,G. Sun, A. Yue, B. Dalessandro, C. Perlich, and B. Hamner. The Tencent datasetand kdd-cup'12. In KDD-Cup Workshop, 2012., 2012.
[4]Y. Koren. 2009.The bellkor solution to the netflix grand prize. Tech. report
[5]M. Piotte and M.Chabbert. 2009. The Pragmatic Theory solution to the Netflix Grand prize. Tech.report
[6] R.Salakhutdinov and A. Mnih. Probabilistic matrix factorization. In Advances inNeural Information Processing Systems 20 (NIPS'07), pages 1257-1264, 2008
[7] Y. Koren, R.Bell, and C. Volinsky. Matrix factorization techniques for recommender systems.IEEE Computer, 42(8):30-37, 2009.
- Movie recommendation based on collaborative filtering method
- Movie recommendation based on collaborative filtering method
- Item-Based Collaborative Filtering Recommendation Algorithms
- -Item-Based Collaborative Filtering Recommendation Algorithms
- 个性化文献-《Item-Based Collaborative Filtering Recommendation Algorithms》
- [论文笔记]Item-based collaborative filtering recommendation algorithms
- 基于物品的协同过滤推荐算法——读“Item-Based Collaborative Filtering Recommendation Algorithms”
- 基于物品的协同过滤推荐算法——读“Item-Based Collaborative Filtering Recommendation Algorithms” .
- 基于物品的协同过滤推荐算法——读“Item-Based Collaborative Filtering Recommendation Algorithms”
- 基于物品的协同过滤推荐算法——读“Item-Based Collaborative Filtering Recommendation Algorithms” .
- 基于物品的协同过滤推荐算法——读“Item-Based Collaborative Filtering Recommendation Algorithms”
- 基于物品的协同过滤推荐算法——读“Item-Based Collaborative Filtering Recommendation Algorithms”
- 基于物品的协同过滤推荐算法——读“Item-Based Collaborative Filtering Recommendation Algorithms”
- 基于物品的协同过滤推荐算法——读“Item-Based Collaborative Filtering Recommendation Algorithms”
- 基于物品的协同过滤推荐算法——读“Item-Based Collaborative Filtering Recommendation Algorithms”
- 基于物品的协同过滤推荐算法——读“Item-Based Collaborative Filtering Recommendation Algorithms”<转>
- hadoop item based collaborative filtering use case
- 协同过滤 之 User-based Collaborative Filtering
- hive job oom问题
- 初用Android studio出现的问题
- IOS NSURL基本操作
- Android实现通过浏览器点击链接打开本地应用(APP)并拿到浏览器传递的数据
- 自用广告过滤规则,整合xwhyc大大的,很小才79K
- Movie recommendation based on collaborative filtering method
- 小金登陆游戏
- Android 中LruCache 原理与编程
- CSS 伪类、伪元素、选择器的使用
- spio从模式
- Unicode、ANSI、UTF-8、Unicode Big Endian的故事
- 查看程序符号表的几个命令
- 深入探究JVM(1) - Java的内存区域解析
- 欢迎使用CSDN-markdown编辑器