Slope One 协同过滤算法

来源：互联网发布：mpp 数据库编辑：程序博客网时间：2024/05/29 07:35

1 背景介绍

1.1问题描述

人们在网上收看电影时，常常会给看过的电影打分。从这些电影的打分情况可以发掘出一个用户的电影收看偏好。通过发掘出的用户偏好，可以为用户做出准确的电影推荐。在这个问题中，我们需要根据用户之前的电影打分记录，来预测该用户对一部未看过的电影的评分情况。

1.2协同过滤

上面描述的是一个典型的协同过滤推荐问题（Collaborative Filtering recommendation）。协同过滤技术简单来说就是利用某兴趣相投，拥有共同经验的群体的喜好来推荐使用者感兴趣的资讯，个人透过合作的机制给与资讯一定程度的回应（如上面说的评分），系统将回应记录下来以达到过滤的目的，进而帮助别人筛选资讯。回应不局限于特别感兴趣的，特别不感兴趣的资讯记录也相当重要[1]。

目前协同过滤技术分为三类：

(1)基于使用者(User-based)的协同过滤

主要思想是寻找具有相似爱好或者兴趣的相邻使用者，由相邻使用者对待预测物品的评分得出目标用户的可能评分。

(2)基于物品(Item-based)的协同过滤

由于用户的数量可能变化较大，User-based协同过滤算法可能在扩展性上有瓶颈。基于物品的协同过滤做出这样一个基本假设:能够引起使用者兴趣的项目，必定和之前评分高的项目相似。算法以计算项目之间的相似性来代替使用者的相似性。这个想法由BadrulSarwar等人的一篇论文于2001年提出[2]。

(3)基于模型(Model-based)的协同过滤

上面两种方法统称为Memory-based的协同过滤技术。基于模型的协同过滤技术利用数据挖掘技术（如贝叶斯模型，决策树等）对数据进行建模，再用得到的模型进行预测。

2 Slope One介绍

SlopeOne是一系列Item-based协同过滤算法，具有实现简单高效，而且精确度和其它复杂费时的算法相比不相上下的特点。其想法由DanielLemire等人于2005年提出[3]。

2.1 基本原理

这里引用原论文[3]上的一个例子来介绍：

这里要预测UserB对ItemJ的打分情况。SlopeOne的一个基本想法是：平均值可以代替两个未知物体之间的打分差异。由UserA对ItemI和ItemJ的打分情况可以看出，ItemI的平均评分要比ItemJ的平均评分低0.5。这样，SlopeOne方法认定UserB对ItemI的打分也比ItemJ的打分低0.5。因此会给问号处填上2.5。

2.2带权重的SlopeOne算法

现在要通过用户A对物品J和K的打分来预测物品L的打分。如果数据集中同时给J和L打分的用户有2000个，而同时给K和L打分的用户只有20个。直观来看，在对物品L打分时，用户A对物品J的打分比对K的打分更具参考价值。因此，计算时就需要考虑不同物品间平均分差的权重。

有n个人对事物A和事物B打分了，R(A->B)表示这n个人对A和对B打分的平均差（A-B）,有m个人对事物B和事物C打分了，R（C->B）表示这m个人对B和对C打分的平均差（C-B），现在某个用户对A的打分是ra，对C的打分是rc，那么A对B的打分是[4]：

rb= (n * (ra - R(A->B)) + m * (rc + R(B->C)))/(m+n)

3 算法实现

RateMatrix.java: 从输入的数据文件train.txt中读取数据，构造用户对电影的打分矩阵。

package common;import java.io.*;import java.util.*;public class RateMatrix {private  char[][] rateMat = null;HashMap<Integer,Integer> users = new HashMap<Integer,Integer>();HashMap<Integer,Integer> movies = new HashMap<Integer,Integer>();private File train = null;public RateMatrix(String path){train=new File(path);}public char [][] getRateMat(){computeRateMat();return this.rateMat;}public HashMap<Integer,Integer> getUsers(){return this.users;}public HashMap<Integer,Integer> getMovies(){return this.movies;}/** * get the ID set of users and movies, * conpute the num of each for the rateMat * @throws IOException */void computeUser_Movies() throws IOException{BufferedReader in=new BufferedReader(new FileReader(train));SortedSet<Integer> userset=new TreeSet<Integer>();SortedSet<Integer> movieset=new TreeSet<Integer>();String line;while((line=in.readLine())!=null){String[] uID_mID=line.split("\\s+");int userId=Integer.parseInt(uID_mID[0]);int movieId=Integer.parseInt(uID_mID[1]);userset.add(userId);movieset.add(movieId);}Iterator<Integer> userIter=userset.iterator();int i=0;while(userIter.hasNext()){users.put(userIter.next(), i++);}i=0;Iterator<Integer> movieIter=movieset.iterator();while(movieIter.hasNext()){movies.put(movieIter.next(), i++);}}/** *  compute the rateMatrix */void computeRateMat(){try {computeUser_Movies();this.rateMat=new char[this.users.size()][this.movies.size()];for(int i=0;i<rateMat.length;i++)Arrays.fill(this.rateMat[i], '0');BufferedReader buf=new BufferedReader(new FileReader(train));String line;while((line=buf.readLine())!=null){String[] uID_mID_rate=line.split("\\s+");int userID=Integer.parseInt(uID_mID_rate[0]);int movieID=Integer.parseInt(uID_mID_rate[1]);char rate=uID_mID_rate[2].charAt(0);// get the index of the userID in vector of usersint i=this.users.get(userID);// get the index of the movieID in vector of moviesint j=this.movies.get(movieID);this.rateMat[i][j]=rate;}} catch (IOException e) {// TODO Auto-generated catch blocke.printStackTrace();}}/** for test  * * */void ptMatrix(){int i=this.users.get(187);int j=this.movies.get(2373);System.out.println(i+" "+j+" "+this.rateMat[i][j]);}public static void main(String[] args){RateMatrix temp=new RateMatrix("train.txt");char[][] mat=temp.getRateMat();temp.ptMatrix();}}

SlopeOne.java:SlopeOne 的算法实现，从打分矩阵中计算各个电影之间的评分平均差和电影对出现的频率，主要方法是buildmDiffs( )。

package SlopeOne;import java.util.*;import common.RateMatrix;class MDiffRate{double diff;int num;public MDiffRate(double diff,int num){this.diff = diff;this.num = num;}}public class SlopeOne {char[][] rMat = null;HashMap<Integer,Integer> userID = null;HashMap<Integer,Integer> movieID = null;RateMatrix RateMatrix_Fac = null;MDiffRate[][] mDiffMatrix = null;public SlopeOne(String path){RateMatrix_Fac = new RateMatrix(path);this.rMat = RateMatrix_Fac.getRateMat();this.userID = RateMatrix_Fac.getUsers();this.movieID = RateMatrix_Fac.getMovies();this.mDiffMatrix = new MDiffRate[movieID.size()][movieID.size()];System.out.println("loading: "+this.userID.size()+" users,"+this.movieID.size()+"movies.");}public char[][] getRMat(){return this.rMat;}public HashMap<Integer,Integer> getUserID(){return this.userID;}public HashMap<Integer,Integer> getMovieID(){return this.movieID;}public MDiffRate[][] getMDiffs(){return this.mDiffMatrix;}void buildmDiffs(){for(int i=0;i<movieID.size();i++){for(int j=0;j<i;j++){if(j==i) continue;int frequency=0;double diffs=0;for(int z=0;z<userID.size();z++){if(rMat[z][i]!='0'&&rMat[z][j]!='0'){diffs+=(rMat[z][j]-rMat[z][i]);frequency++;}}if(frequency>0)   //have common lines{diffs=diffs/frequency;MDiffRate tempdiff1=new MDiffRate(diffs,frequency);MDiffRate tempdiff2=new MDiffRate(-diffs,frequency);mDiffMatrix[i][j]=tempdiff1;mDiffMatrix[j][i]=tempdiff2;}}}}}

SlopePredict.java: 用来对test.txt中的数据进行预测。

package SlopeOne;import java.io.BufferedReader;import java.io.BufferedWriter;import java.io.File;import java.io.FileReader;import java.io.FileWriter;import java.io.IOException;import java.util.HashMap;public class SlopePredict {File testFile = new File("test.txt");File outFile = new File("test.rate");SlopeOne slopeModel = null;char[][] rateMat = null;HashMap<Integer,Integer> userID = null;HashMap<Integer,Integer> movieID = null;MDiffRate[][] mDiffMatrix = null;    SlopePredict(String trainPath){slopeModel = new SlopeOne(trainPath);slopeModel.buildmDiffs();this.rateMat = slopeModel.getRMat();this.userID = slopeModel.getUserID();this.movieID = slopeModel.getMovieID();this.mDiffMatrix = slopeModel.getMDiffs();}        double predict(int userIndex,int movieIndex)    {    double frequencysum=0;    double weightsum=0;    double ans=0;    for(int i=0;i<movieID.size();i++)    {    if(rateMat[userIndex][i]!='0'){    if(mDiffMatrix[movieIndex][i]!=null)    {           weightsum += (((double)(rateMat[userIndex][i]-'0'))-mDiffMatrix[movieIndex][i].diff)        *mDiffMatrix[movieIndex][i].num;    frequencysum += mDiffMatrix[movieIndex][i].num;    }    }    }    ans=weightsum/frequencysum;    return ans;    }        void predict() throws IOException    {    BufferedReader in = new BufferedReader(new FileReader(testFile));    BufferedWriter out = new BufferedWriter(new FileWriter(outFile));    String line = null;int userID_temp,movieID_temp;while((line=in.readLine()) != null){int userIndex = -1, movieIndex = -1;String[] temp = line.split("\\s+");userID_temp = Integer.parseInt(temp[0]);movieID_temp = Integer.parseInt(temp[1]);if(userID.containsKey(userID_temp)){userIndex = userID.get(userID_temp);}if(movieID.containsKey(movieID_temp)){    movieIndex = movieID.get(movieID_temp);}double ans = -555;if(userIndex<0||movieIndex<0) //oops! new user! new movie!{if(userIndex<0&&movieIndex>=0){double sum = 0;int num = 0;for(int i = 0;i<userID.size();i++){if(rateMat[i][movieIndex] != '0'){sum += rateMat[i][movieIndex]-'0';num++;}}ans=sum/num;}if(userIndex >= 0&&movieIndex < 0){double sum = 0;int num = 0;for(int i=0;i<movieID.size();i++){if(rateMat[userIndex][i]!='0'){sum += rateMat[userIndex][i]-'0';num++;}}ans=sum/num;}if(userIndex<0&&movieIndex<0){ans = 3;}}else{ans = predict(userIndex,movieIndex); //the user and the movie exist in the train                                                                        //set}long res = Math.round(ans);if(res<=0) res = 1;if(res>=5) res = 5;out.write(res+"\n");out.flush();}out.close();in.close();    }    public static void main(String[] args) {// TODO Auto-generated method stubSlopePredict sp = new SlopePredict("train.txt");try {sp.predict();} catch (IOException e) {// TODO Auto-generated catch blocke.printStackTrace();}}}

References:

http://zh.wikipedia.org/wiki/Slope_one
SarwarB, Karypis G, Konstan J, et al. Item-based collaborative filteringrecommendation algorithms[C]//Proceedings of the 10th internationalconference on World Wide Web. ACM, 2001: 285-295.
LemireD, Maclachlan A. Slope one predictors for online rating-basedcollaborative filtering[J]. Society for Industrial Mathematics,2005, 5: 471-480.
http://my.oschina.net/liangtee/blog/124987

PS：这是这学期数据挖掘课的一个作业，之前还有几个作业。突然想起来把这次作业传博客上，聊作总结。前面几次作业我也打算总结出来，不过估计要等考完试加班那段时间了。说出口的话，就要尽力做到。先在这里说了，督促自己最后能好好总结。

0 0