推荐系统总结:重现推荐系统若干算法
来源:互联网 发布:分类信息网站源码下载 编辑:程序博客网 时间:2024/06/06 01:24
写在前面:
本文作为关于推荐系统的算法实现的总结,作为一个大二学生,刚接触研究推荐系统,作出的总结一定有很多的错误,望见谅指正。在学习途中参考了很多技术博客、项亮博士的《推荐系统实践》,有引用到他人总结会的尽量标注。
转载请注明出处:http://blog.csdn.net/czzffff
在暑假期间,所在推荐系统小组针对MovieLens(1M)数据集,重现了一些论文的几个算法,包括:
非个性化模型(Non-personalized models):Movie Average(MovieAvg)和Top Popular(TopPop);
邻域模型(Neighborhood models):Correlation Neighborhood models(CorNgbr)和Non-normalized Cosine Neighborhood(NNCosNgbr);
隐语义模型(Latent Factor Model):Asymmetric-SVD(AsySVD)、SVD++和PureSVD
重现论文:【RecSys '10,2010,Cremonesi】 Performance of recommender algorithms on top-n recommendation task
相关论文:【KDD '08 ,2008,Koren】Factorization Meets the Neighborhood:a Multifaceted Collaborative Filtering Model
【CARS-2011,2011,Cremonesi】Top-N recommendations on unpopular items with contextual knowledge
一、数据集
(a) 名称:MovieLens(1M)
(b) 介绍:包括6,040个用户对于3,900部电影的1,000,209个评分。时间发表于2000年,包含ratings.dat、users.dat、movies.dat三个文件。
Ratings.dat:用户id、电影id、评分(1~5)、时间标; 格式:UserID::MovieID::Rating::Timestamp;
Users.dat: 性别、年龄、职位、邮编; 格式:UserID::Gender::Age::Occupation::Zip-codeAll;
Movies.dat:电影id、标题、流派; 格式:MovieID::Title::Genres。
(c) 来源:http://grouplens.org/datasets/movielens/
二、数据预处理
将三个文件整合成一行(一行为一条记录)为如下格式的文件:
userId,movieId,rating,timestamp,age,gender,occupation,zipcode,movietitle,year,genres
论文中提出,从这些记录中随机抽取1.4%的记录作为探测集(probe set),余下的作为训练集(training set),将探测集(probe set)中评分为5的全部记录提取作为测试集(test set)。
三、评价标准
(a)评判值:召回率、准确率
(b)算法:
1) 名称:召回率、准确率
2) 算法步骤:
a) 步骤一:从测试集中抽取一条记录(包括用户ID、电影ID);
b) 步骤二:在未评分集中随机选取1000个该用户u未评过分的电影id,加上测试集的一个电影ID,共1001个电影ID;
c) 步骤三:通过提出的评分规则对该1001部电影评分,并基于得到的评分降序排序得到一个top-N推荐列表;
d) 步骤四:推荐前N(N为0到20的整数)个列表中的电影,若前N个中包含测试集中第一步中的电影ID,算作命中一次;
e) 步骤五:继续按顺序抽取测试集中下一条记录,循环以上步骤,算出每一个N的命中次数;
f) 步骤六:命中次数除以测试集记录条数作为召回率;
g) 步骤七:召回率除以N值作为准确率。
(c)相关公式:
四、算法描述(部分算法有源码实现)
(一)非个性化模型(Non-personalized models)算法
(1) 算法名称:Movie Average(MovieAvg)
算法步骤:
a) 步骤一:算出所有被评价过的每一部电影的平均评分;
b) 步骤二:根据电影平均分作为评分规则,对推荐列表电影降序排序。
(2) 算法名称:Top Popularity(TopPop)
算法步骤:
a) 步骤一:算出每一部电影的被评分次数;
b) 步骤二:根据电影被评分次数作为评分规则,越高次数,排名越高,对推荐列表电影降序排序。
(二)邻域模型(Neighborhood models)算法:
(1) 算法名称:Correlation Neighborhood(CorNgbr)
算法步骤:
a) 步骤一: 通过baseline estimates公式,得到每个用户对每部电影的基础评分bui;
b) 步骤二:用皮尔逊相关系数法计算相关电影(有共同用户评价过的)相似度sij,并得出收缩相似度dij;
c) 步骤三:设定K值,取K个相似度最高的电影项目,用作基于用户的协同过滤公式的计算;
d) 步骤四:由以上步骤得到的相似度、基本分,通过CorNgbr公式计算得出评分
相关公式:
算法说明:(1)baseline(bui)的计算
求baseline从我在两个技术博客中可以看到有三种方法,《SVD因式分解实现协同过滤-及源码实现》这篇作者为Dustinsea的文章中对baseline作了详 细的介绍并提到了两个方法求bui:
方法一,直接使用user,item的rating的平均值直接预估bi,bu,例如直接计算bu = sum(Ru)/len(Ru),其中Ru为用户u投票的集合, sum(Ru)为这些rating值得和, len(Ru)为该集合大小。bi = sum(Ri)/len(Ri), 其中Ri为用户i被投票的集合, sum(Ri)为这些rating的分值之和, len(Ri)为这个集合的大小。
方法二,其中rui为已知的投票, mu可直接统计, 对每个用户的参数bu, 对每个item的bi可求(相当于AX=B,求X,此处X即为bu,bi,可使用最小二乘法, 例如可使用Numerical Recipes: The Art of Scientific Computing中提供的优化函数) ,当然,最简单的方法就是直接根据当前的观测值, 直接统计出bu 和bi, 统计方式如下:
方法三,可以利用梯度下降法算出。可以参考《基于baseline和stochastic gradient descent的个性化推荐系统》这篇文章:
根据对bu,bi求偏导,得到梯度变化
(利用stochasticgradient descent算法使上述的目标函数值,在设定的迭代次数内,降到最小):
我将计算得到的bu,bi写入到文件以方便利用CorNgbr算法进行Top-N推荐,以下为代码实现,参考
基于baseline和stochastic gradient descent的个性化推荐系统中代码:
__author__ = 'zackchan1993@gmail.com''''Created on 2014/7@Author:ZackChan@E-mail:zackchan1993@gmail.com@Homepage: http://blog.csdn.net/czzffff'''from operator import itemgetter, attrgetterfrom math import sqrtimport randomdef load_data(): train = {} test = {} filename_train = 'G:\文献\movielens\movieLens\movielens处理后数据\whole.csv' filename_test = 'G:\文献\movielens\movieLens\Set\TestSet1.csv' for line in open(filename_train): (userId, itemId, rating, o1,o2,o3,o4,o5,o6,o7,o8) = line.strip().split(',') train.setdefault(userId,{}) train[userId][itemId] = float(rating) for line in open(filename_test): (userId, itemId, rating, o1,o2,o3,o4,o5,o6,o7,o8) = line.strip().split(',') test.setdefault(userId,{}) test[userId][itemId] = float(rating) return train,testdef calMean(train): sta = 0 num = 0 for u in train.keys(): for i in train[u].keys(): sta += train[u][i] num += 1 mean = sta*1.0/num return meandef initialBias(train, userNum, movieNum): mean = calMean(train) print("mean="+str(mean)) bu = {} bi = {} biNum = {} buNum = {} u = 1 while u < (userNum+1): su = str(u) for x in range(3953): bi.setdefault(str(x),0) for i in train[su].keys(): # bi.setdefault(i,0) biNum.setdefault(i,0) bi[i] += (train[su][i] - mean) biNum[i] += 1 u += 1 i = 1 while i < (movieNum+1): si = str(i) biNum.setdefault(si,0) if biNum[si] >= 1: bi[si] = bi[si]*1.0/(biNum[si]+25) else: bi[si] = 0.0 i += 1 u = 1 while u < (userNum+1): su = str(u) for i in train[su].keys(): bu.setdefault(su,0) buNum.setdefault(su,0) bu[su] += (train[su][i] - mean - bi[i]) buNum[su] += 1 u += 1 u = 1 while u < (userNum+1): su = str(u) buNum.setdefault(su,0) if buNum[su] >= 1: bu[su] = bu[su]*1.0/(buNum[su]+10) else: bu[su] = 0.0 u += 1 return bu,bi,meandef sgd(train, test, userNum, movieNum): bu, bi, mean = initialBias(train, userNum, movieNum) file_bu=open('newbu1.csv','w') file_bi=open('newbi1.csv','w') alpha1 = 0.002 beta1 = 0.1 slowRate = 0.99 step = 0 preRmse = 1000000000.0 nowRmse = 0.0 while step < 200: rmse = 0.0 n = 0 for u in train.keys(): for i in train[u].keys(): pui = 1.0 * (mean + bu[u] + bi[i]) eui = train[u][i] - pui rmse += pow(eui,2) n += 1 bu[u] += alpha1 * (eui - beta1 * bu[u]) bi[i] += alpha1 * (eui - beta1 * bi[i]) nowRmse = sqrt(rmse*1.0/n) print( "step: %d Rmse: %s" %(step+1,nowRmse)) if (nowRmse < preRmse): preRmse = nowRmse alpha1 *= slowRate step += 1 #输出bu和bi于文件中 # newbi={} for u in train.keys(): # for i in train[u].keys(): # newbi.setdefault(i,bi[i]) # file_bi.write(str(i)+','+str(bi[i])+'\n') file_bu.write(str(u)+','+str(bu[u])+'\n') for j in range(3953)[1:]: file_bi.write(str(j)+','+str(bi[str(j)])+'\n') return bu, bi, meandef calRmse(test, bu, bi, mean): rmse = 0.0 n = 0 for u in test.keys(): for i in test[u].keys(): pui = 1.0 * (mean + bu[u] + bi[i]) eui = pui - test[u][i] rmse += pow(eui,2) n += 1 rmse = sqrt(rmse*1.0 / n) return rmse;if __name__ == "__main__": train,test = load_data() bu,bi,mean = sgd(train, test,6040, 3952) print( 'the Rmse of test test is: %s' % calRmse(test, bu, bi, mean))
CorNgbr算法实现代码:可以先把相关电影相似度itemSim先算出写入文件中,因为在跑电影相似度时占6G内存,将相似度算出后写入到文件后再读取,跑这段代码占用3G内存。
部分代码参考《基于neighborhood models(item-based) 的个性化推荐系统》文章:
__author__ = 'zackchan1993@gmail.com''''Created on 2014/7@Author:ZackChan@E-mail:zackchan1993@gmail.com@Homepage: http://blog.csdn.net/czzffff'''from math import fabs,sqrtimport randomimport operatordef load_data(): train = {} test = {} numtest = 0 filename_train = 'D:\python341\MyCorNgbr\TrainingSet1.csv' filename_test = 'D:\python341\MyCorNgbr\TestSet1.csv' for line in open(filename_train): (userId, itemId, rating, o1,o2,o3,o4,o5,o6,o7,o8) = line.strip().split(',') train.setdefault(userId,{}) train[userId][itemId] = float(rating) for line in open(filename_test): (userId, itemId, rating, o1,o2,o3,o4,o5,o6,o7,o8) = line.strip().split(',') # test.setdefault(userId,{}) # test[userId][itemId] = float(rating) test[userId] = itemId numtest+=1 print("testnumber") print(numtest) return train,test,numtestdef load_unrated(): unrated = {} list1 = [] list2 = [] filename_unrated = 'D:\python341\MyCorNgbr\without_rated.csv' for line in open(filename_unrated): list1 = line.strip().split(',') list2=list1[1:] random.shuffle(list2) unrated.setdefault(list1[0],list2) # for userid,list in unrated: # random.shuffle(list) return unrateddef load_bui(): bu = {} bi = {} mean = 3.5813100089534955 filename_bu = 'D:\python341\MyCorNgbr\_newbu1.csv' filename_bi = 'D:\python341\MyCorNgbr\_newbi1.csv' for linebu in open(filename_bu): (userId,valbu)=linebu.strip().split(',') bu[userId]=float(valbu) for linebi in open(filename_bi): (movieId,valbi)=linebi.strip().split(',') bi[movieId]=float(valbi) return bu,bi,meandef initial(train): filename_sij = 'D:\python341\MyCorNgbr\sij.csv' file_sij = open(filename_sij,'w') average = {} Sij = {} num = 0 N = {} for u in train.keys(): for i in train[u].keys(): # mean += train[u][i] num += 1 average.setdefault(i,0) average[i] += train[u][i] N.setdefault(i,0) N[i] += 1 Sij.setdefault(i,{}) for j in train[u].keys(): if i == j: continue Sij[i].setdefault(j,[]) # print("testsij") Sij[i][j].append(u) #print("testsij") # mean = mean / num for i in average.keys(): average[i] = average[i] / N[i] pearson = {} itemSim = {} for i in Sij.keys(): pearson.setdefault(i,{}) itemSim.setdefault(i,{}) for j in Sij[i].keys(): pearson[i][j] = 1 part1 = 0 part2 = 0 part3 = 0 for u in Sij[i][j]: part1 += (train[u][i] - average[i]) * (train[u][j] - average[j]) part2 += pow(train[u][i] - average[i], 2) part3 += pow(train[u][j] - average[j], 2) if part1 != 0: pearson[i][j] = part1 / sqrt(part2 * part3) itemSim[i][j] = fabs(pearson[i][j] * len(Sij[i][j]) / (len(Sij[i][j]) + 100)) file_sij.write(str(i)) file_sij.write(',') file_sij.write(str(j)) file_sij.write(',') file_sij.write(str(itemSim[i][j])) file_sij.write('\n') # initial user and item Bias, respectly # bu, bi = initialBias(train, userNum, movieNum, mean) return itemSim,averagedef load_itemSim(): itemSim = {} filename_itemSim = 'D:\python341\MyCorNgbr\sij.csv' for line in open(filename_itemSim): (item1,item2,sim) = line.strip().split(',') itemSim.setdefault(item1,{}) itemSim[item1][item2] = float(sim) return itemSimdef CorNgbrModels(train,test,itemSim,mean,arrage,bu,bi,unrated,numtest): pui = {} sorted_pui = {} num = 0 list = [] arr = [0]*30 for u in test.keys(): pui.setdefault(u,{}) list = unrated[u][:1000] list.append(test[u]) for i in list: pui[u][i] = mean + bu[u] + bi[i] stat = 0 stat2 = 0 for j in train[u].keys(): if i in itemSim and j in itemSim[i]:# if itemSim.has_key(i) and itemSim[i].has_key(j): stat += (train[u][j] - mean - bu[u] - bi[j]) * itemSim[i][j] stat2 += itemSim[i][j] if stat > 0: pui[u][i] += stat * 1.0 / stat2 num += 1 sorted_pui = sorted(pui[u].items(), key=lambda x:x[1], reverse=True)#评分排序 listnum = 1 for k,v in sorted_pui: if( k == test[u]): break listnum=listnum+1 while(listnum<=20): arr[listnum]+=1 listnum+=1 for temp in arr[:21]: print(temp) temp = 0 while(temp<=20): arr[temp] = 1.0*arr[temp]/numtest print(arr[temp]) temp+=1 file_result = open('result.csv','w') file_result.write(',') for i in range(101)[1:21]: file_result.write(str(i)) file_result.write(',') file_result.write('\n') for j in arr[:21]: file_result.write(str(j)) file_result.write(',') file_result.write('\n') returnif __name__ =='__main__': train,test,numtest = load_data() unrated = load_unrated() bu,bi,mean = load_bui() itemSim,average = initial(train) CorNgbrModels(train,test,itemSim,mean,average,bu,bi,unrated,numtest)
上述代码运行结果与论文召回率还相差0.1左右,所以往后还需不断修改,代码仅作参考。
(2)算法名称:Non-normalized Cosine Neighborhood (NNCosNgbr)
算法步骤:
a) 步骤一: 通过baseline estimates公式,得到每个用户对每部电影的基础评分bui;
b) 步骤二:用余弦法计算相关电影(有共同用户评价过的)相似度sij,并得出收缩相似度dij;
c) 步骤三:设定K值,取K个相似度最高的电影项目,用作基于用户的协同过滤公式的计算;
d) 步骤四:由以上步骤得到的相似度、基本分,通过NNCosNgbr公式计算得出评分
相关公式:
(三)隐语义模型(Latent Factor Model)
(1) 算法名称: Asymmetric-SVD(AsySVD)
算法步骤:
a) 步骤一:由公式得到损失函数;
b) 步骤二:对p、q矩阵进行初始化;
c) 步骤三:通过随机梯度下降法的迭代得到最终的p、q矩阵;
d) 步骤四:由得到的p、q矩阵计算用户对电影的评分。
相关公式:
算法说明:
此算法可以参考项亮博士编著的《推荐系统实践》第2章2.5隐语义模型,也可参考《SVD因式分解实现协同过滤-及源码实现》,文章中对隐语义模型作了详细的解释说明。
__author__ = 'zackchan1993@gmail.com'import randomfrom math import sqrtimport math'''Created on 2014/7@Author:ZackChan@E-mail:zackchan1993@gmail.com@Homepage: http://blog.csdn.net/czzffff'''def load_data(): filename_train = 'G:\文献\movielens\movieLens\Set\TrainingSet1.csv' filename_test = 'G:\文献\movielens\movieLens\Set\TestSet1.csv' trainlist = [] testlist = [] numtest = 0 numtrain = 0 sumtrain = 0 mean = 0 for line in open(filename_train): (userId,movieId,rating,o1,o2,o3,o4,o5,o6,o7,o8) = line.strip().split(',') temp = (userId,movieId,float(rating)) trainlist.append(temp) sumtrain += float(rating) numtrain += 1 mean = sumtrain*1.0/numtrain print("mean = "+str(mean)) for line in open(filename_test): (userId,movieId,rating,o1,o2,o3,o4,o5,o6,o7,o8) = line.strip().split(',') temp = (userId,movieId,float(rating)) testlist.append(temp) numtest+=1 print("testnumber:"+str(numtest)) return trainlist,testlist,numtest,meandef load_unrated(): unrated = {} list1 = [] list2 = [] filename_unrated ='G:\文献\movielens\movieLens\Set\without_rated.csv' for line in open(filename_unrated): list1 = line.strip().split(',') list2=list1[1:] random.shuffle(list2) unrated.setdefault(list1[0],list2) return unrateddef InitBiasLFM(train,F): p = dict() q = dict() bu = dict() bi = dict() for u,i,rui in train: bu[u] = 0 bi[i] = 0 if u not in p: p[u] = [random.random()/math.sqrt(F) for x in range(0,F)] if i not in q: q[i] = [random.random()/math.sqrt(F) for x in range(0,F)] return p,q,bu,bidef Predict(u,i,p,q,bu,bi,mean): if u in bu and i in bi: ret = mean + bu[u] + bi[i] else: ret = mean if u in p and i in q : ret += sum(p[u][f]*q[i][f] for f in range(0,len(p[u]))) else: ret += 0 return retdef LearningBiasLFM(train, F, n, alpha, beta, mean ): p,q,bu,bi=InitBiasLFM(train,F) rmse = 0 num = 0 for step in range(0,n): for u, i, rui in train: pui = Predict(u,i,p,q,bu,bi,mean) eui = rui - pui # print("eui:"+str(step)+":"+str(eui)) rmse +=pow(eui,2) num += 1 bu[u] += alpha * (eui - beta * bu[u]) bi[i] += alpha * (eui - beta * bi[i]) for f in range(0,F): p[u][f] += alpha * (q[i][f] * eui - beta * p[u][f]) q[i][f] += alpha * (p[u][f] * eui - beta * q[i][f]) print("eui:"+str(step)+":"+str(eui)) alpha *= 0.9 rmse = sqrt(rmse * 1.0 / num) print(str(step)+':rmse = '+str(rmse)) return p,q,bu,bidef TestRMSE(testlist,p,q,bu,bi,mean): num = 0 rmse = 0 for u,i,rui in testlist: pui = Predict(u,i,p,q,bu,bi,mean) rmse += pow((pui - rui),2) num += 1 rmse = sqrt(rmse*1.0/num) return rmsedef TopNBiasSVD(testlist,p,q,bu,bi,mean,unrated,numtest): pui = {} arr = [0]*30 for u,i,rui in testlist: pui.setdefault(u,{}) list = unrated[u][:1000] list.append(i) for i in list: pui[u][i] = Predict(u,i,p,q,bu,bi,mean) sorted_pui = sorted(pui[u].items(), key=lambda x:x[1], reverse=True)#评分排序 listnum = 1 for k,v in sorted_pui: if(k == i): break listnum +=1 while(listnum<=20): arr[listnum]+=1 listnum+=1 for temp in arr[:21]: print(temp) temp = 0 while(temp<=20): arr[temp] = 1.0*arr[temp]/numtest print(arr[temp]) temp+=1 file_result = open('result.csv','a') file_result.write('\n') file_result.write(',') for i in range(101)[1:21]: file_result.write(str(i)) file_result.write(',') file_result.write('\n') for j in arr[:21]: file_result.write(str(j)) file_result.write(',') file_result.write('\n') returnif __name__ =='__main__': F = 50 n = 10 alpha = 0.02 beta = 0.01 filename_rmse = 'rmse.txt' file_rmse = open(filename_rmse,'a') # bu,bi,mean = load_bui() unrated = load_unrated() trainlist,testlist,numtest,mean = load_data() p,q,bu,bi = LearningBiasLFM(trainlist,F,n,alpha,beta,mean) rmse = TestRMSE(testlist,p,q,bu,bi,mean) print('testSet:'+str(rmse)) file_rmse.write('\n') file_rmse.write('F='+str(F)+',step='+str(n)+',alpha='+str(alpha)+',beta='+str(beta)+': '+str(rmse)) TopNBiasSVD(testlist,p,q,bu,bi,mean,unrated,numtest)上述代码运行结果与论文召回率还相差0.05左右,所以往后还需不断修改,代码仅作参考。
(2)算法名称: PureSVD
算法步骤:
a) 步骤一:由svdlibc包作矩阵分解得到正交矩阵Q
b) 步骤二:根据公式计算用户对电影的评分
相关公式:
- 推荐系统总结:重现推荐系统若干算法
- 推荐系统算法总结
- 推荐系统算法总结
- 推荐系统算法总结
- 推荐系统算法总结
- 推荐系统算法总结
- 推荐系统算法总结
- 推荐系统算法总结
- 推荐系统算法简单总结
- 【总结】Bandit算法与推荐系统
- 【推荐系统】深度推荐系统总结
- 推荐系统算法概述
- 推荐系统算法分类
- 推荐系统常用算法
- 推荐系统常用算法
- 邻近算法推荐系统
- 推荐系统算法
- 推荐系统的算法
- AsyncTask的用法
- FS386 親自下廚[第四更求保底月票]_FS386 親自下
- unix服务器服务
- 评论关于中国工厂代工的Android 4.0迷你PC的智能电视棒Mk802 CF卡
- 警告!不要购买安桥Htx22hdx直到你读这
- 推荐系统总结:重现推荐系统若干算法
- eclipse生成jar包(各种问题集合解决)
- 回文数
- 轻量级HTTP服务器Nginx(配置与调试Nginx)
- 常用的静态库操作命令lib
- 如何解决在asp中不能对access数据库中的表进行插入记录问题
- Spring jdbcTemplate 应用
- ViewPager onPageChangeListener总结
- hdoj.2010 水仙花数 20140721