chapter2:协同过滤
来源:互联网 发布:淘宝手表店推荐 编辑:程序博客网 时间:2024/05/18 03:34
https://github.com/zacharski/pg2dm-python
一、如何寻找相似用户
曼哈顿距离(Manhattan Distance)
|x1 - x2 | + | y1 - y2 |
欧式距离
sqrt( (x1-x2)^2 + (y1-y2)^2 )
N维下的思考
×××
一个缺陷
当没有缺失值时,曼哈顿距离和欧式距离非常好。缺失值的处理是一个活跃的学术研究问题
一般化
Python
## FILTERINGDATA.py## Code file for the book Programmer's Guide to Data Mining# http://guidetodatamining.com# Ron Zacharski#from math import sqrtusers = {"Angelica": {"Blues Traveler": 3.5, "Broken Bells": 2.0, "Norah Jones": 4.5, "Phoenix": 5.0, "Slightly Stoopid": 1.5, "The Strokes": 2.5, "Vampire Weekend": 2.0}, "Bill":{"Blues Traveler": 2.0, "Broken Bells": 3.5, "Deadmau5": 4.0, "Phoenix": 2.0, "Slightly Stoopid": 3.5, "Vampire Weekend": 3.0}, "Chan": {"Blues Traveler": 5.0, "Broken Bells": 1.0, "Deadmau5": 1.0, "Norah Jones": 3.0, "Phoenix": 5, "Slightly Stoopid": 1.0}, "Dan": {"Blues Traveler": 3.0, "Broken Bells": 4.0, "Deadmau5": 4.5, "Phoenix": 3.0, "Slightly Stoopid": 4.5, "The Strokes": 4.0, "Vampire Weekend": 2.0}, "Hailey": {"Broken Bells": 4.0, "Deadmau5": 1.0, "Norah Jones": 4.0, "The Strokes": 4.0, "Vampire Weekend": 1.0}, "Jordyn": {"Broken Bells": 4.5, "Deadmau5": 4.0, "Norah Jones": 5.0, "Phoenix": 5.0, "Slightly Stoopid": 4.5, "The Strokes": 4.0, "Vampire Weekend": 4.0}, "Sam": {"Blues Traveler": 5.0, "Broken Bells": 2.0, "Norah Jones": 3.0, "Phoenix": 5.0, "Slightly Stoopid": 4.0, "The Strokes": 5.0}, "Veronica": {"Blues Traveler": 3.0, "Norah Jones": 5.0, "Phoenix": 4.0, "Slightly Stoopid": 2.5, "The Strokes": 3.0} }def manhattan(rating1, rating2): """Computes the Manhattan distance. Both rating1 and rating2 are dictionaries of the form {'The Strokes': 3.0, 'Slightly Stoopid': 2.5}""" distance = 0 commonRatings = False for key in rating1: if key in rating2: distance += abs(rating1[key] - rating2[key]) commonRatings = True if commonRatings: return distance else: return -1 #Indicates no ratings in commondef computeNearestNeighbor(username, users): """creates a sorted list of users based on their distance to username""" distances = [] for user in users: if user != username: distance = manhattan(users[user], users[username]) distances.append((distance, user)) # sort based on distance -- closest first distances.sort() return distancesdef recommend(username, users): """Give list of recommendations""" # first find nearest neighbor nearest = computeNearestNeighbor(username, users)[0][1] recommendations = [] # now find bands neighbor rated that user didn't neighborRatings = users[nearest] userRatings = users[username] for artist in neighborRatings: if not artist in userRatings: recommendations.append((artist, neighborRatings[artist])) # using the fn sorted for variety - sort is more efficient return sorted(recommendations, key=lambda artistTuple: artistTuple[1], reverse = True)# examples - uncomment to runprint( recommend('Hailey', users))#print( recommend('Chan', users))
二、用户的评级差异
皮尔逊相关系数(Pearson Correlation Coefficient)
寻找某个感兴趣的用户的最相似用户
除了看上去或许有点复杂之外,上面公式的另一个问题在于算法时可能需要对数据进行多遍扫描。幸运的是,对于算法实现人员而言,还有另一个皮尔逊相关系数的近似计算公式:
from math import sqrtusers = {"Angelica": {"Blues Traveler": 3.5, "Broken Bells": 2.0, "Norah Jones": 4.5, "Phoenix": 5.0, "Slightly Stoopid": 1.5, "The Strokes": 2.5, "Vampire Weekend": 2.0}, "Bill":{"Blues Traveler": 2.0, "Broken Bells": 3.5, "Deadmau5": 4.0, "Phoenix": 2.0, "Slightly Stoopid": 3.5, "Vampire Weekend": 3.0}, "Chan": {"Blues Traveler": 5.0, "Broken Bells": 1.0, "Deadmau5": 1.0, "Norah Jones": 3.0, "Phoenix": 5, "Slightly Stoopid": 1.0}, "Dan": {"Blues Traveler": 3.0, "Broken Bells": 4.0, "Deadmau5": 4.5, "Phoenix": 3.0, "Slightly Stoopid": 4.5, "The Strokes": 4.0, "Vampire Weekend": 2.0}, "Hailey": {"Broken Bells": 4.0, "Deadmau5": 1.0, "Norah Jones": 4.0, "The Strokes": 4.0, "Vampire Weekend": 1.0}, "Jordyn": {"Broken Bells": 4.5, "Deadmau5": 4.0, "Norah Jones": 5.0, "Phoenix": 5.0, "Slightly Stoopid": 4.5, "The Strokes": 4.0, "Vampire Weekend": 4.0}, "Sam": {"Blues Traveler": 5.0, "Broken Bells": 2.0, "Norah Jones": 3.0, "Phoenix": 5.0, "Slightly Stoopid": 4.0, "The Strokes": 5.0}, "Veronica": {"Blues Traveler": 3.0, "Norah Jones": 5.0, "Phoenix": 4.0, "Slightly Stoopid": 2.5, "The Strokes": 3.0} }
def pearson(rating1, rating2): sum_xy = 0 sum_x = 0 sum_y = 0 sum_x2 = 0 sum_y2 = 0 n = 0 for key in rating1: if key in rating2: n += 1 x = rating1[key] y = rating2[key] sum_xy += x * y sum_x += x sum_y += y sum_x2 += pow(x, 2) sum_y2 += pow(y, 2) # now compute denominator denominator = sqrt(sum_x2 - pow(sum_x, 2) / n) * sqrt(sum_y2 - pow(sum_y, 2) / n) if denominator == 0: return 0 else: return (sum_xy - (sum_x * sum_y) / n) / denominator
print( pearson(users['Angelica'], users['Bill'] ) )print( pearson(users['Angelica'], users['Hailey'] ) )print( pearson(users['Angelica'], users['Jordyn'] ) )
三、最后一个公式---余弦相似度
不仅在文本挖掘中使用得非常普遍,而且也广泛用于协同过滤
例子:跟踪用户播放某首音乐的次数并基于该信息进行推荐
四、相似度的选择
如果数据受分数贬值(即不同用户使用不同的评级范围)的影响,则使用皮尔逊相关系数
如果数据稠密(几乎所有属性都没有零值)且属性值大小十分重要,那么使用诸如欧式距离或者曼哈顿距离
如果数据稀疏,考虑使用余弦相似度
五、K近邻
上面的问题:依赖单个“最相似”的用户进行推荐。该用户的其他怪癖爱好都会被推荐。
一种解决办法是基于多个相似的用户进行推荐。这里可以使用K近邻方法
利用K个最相似的用户来确定推荐结果。基本思路
六、Python的一个推荐类
import codecs from math import sqrtusers = {"Angelica": {"Blues Traveler": 3.5, "Broken Bells": 2.0, "Norah Jones": 4.5, "Phoenix": 5.0, "Slightly Stoopid": 1.5, "The Strokes": 2.5, "Vampire Weekend": 2.0}, "Bill":{"Blues Traveler": 2.0, "Broken Bells": 3.5, "Deadmau5": 4.0, "Phoenix": 2.0, "Slightly Stoopid": 3.5, "Vampire Weekend": 3.0}, "Chan": {"Blues Traveler": 5.0, "Broken Bells": 1.0, "Deadmau5": 1.0, "Norah Jones": 3.0, "Phoenix": 5, "Slightly Stoopid": 1.0}, "Dan": {"Blues Traveler": 3.0, "Broken Bells": 4.0, "Deadmau5": 4.5, "Phoenix": 3.0, "Slightly Stoopid": 4.5, "The Strokes": 4.0, "Vampire Weekend": 2.0}, "Hailey": {"Broken Bells": 4.0, "Deadmau5": 1.0, "Norah Jones": 4.0, "The Strokes": 4.0, "Vampire Weekend": 1.0}, "Jordyn": {"Broken Bells": 4.5, "Deadmau5": 4.0, "Norah Jones": 5.0, "Phoenix": 5.0, "Slightly Stoopid": 4.5, "The Strokes": 4.0, "Vampire Weekend": 4.0}, "Sam": {"Blues Traveler": 5.0, "Broken Bells": 2.0, "Norah Jones": 3.0, "Phoenix": 5.0, "Slightly Stoopid": 4.0, "The Strokes": 5.0}, "Veronica": {"Blues Traveler": 3.0, "Norah Jones": 5.0, "Phoenix": 4.0, "Slightly Stoopid": 2.5, "The Strokes": 3.0} }class recommender: def __init__(self, data, k=1, metric='pearson', n=5): """ initialize recommender currently, if data is dictionary the recommender is initialized to it. For all other data types of data, no initialization occurs k is the k value for k nearest neighbor metric is which distance formula to use n is the maximum number of recommendations to make""" self.k = k self.n = n self.username2id = {} self.userid2name = {} self.productid2name = {} # for some reason I want to save the name of the metric self.metric = metric if self.metric == 'pearson': self.fn = self.pearson # # if data is dictionary set recommender data to it # if type(data).__name__ == 'dict': self.data = data def convertProductID2name(self, id): """Given product id number return product name""" if id in self.productid2name: return self.productid2name[id] else: return id def userRatings(self, id, n): """Return n top ratings for user with id""" print ("Ratings for " + self.userid2name[id]) ratings = self.data[id] print(len(ratings)) ratings = list(ratings.items()) ratings = [(self.convertProductID2name(k), v) for (k, v) in ratings] # finally sort and return ratings.sort(key=lambda artistTuple: artistTuple[1], reverse = True) ratings = ratings[:n] for rating in ratings: print("%s\t%i" % (rating[0], rating[1])) def loadBookDB(self, path=''): """loads the BX book dataset. Path is where the BX files are located""" self.data = {} i = 0 # # First load book ratings into self.data # f = codecs.open(path + "BX-Book-Ratings.csv", 'r', 'utf8') for line in f: i += 1 #separate line into fields fields = line.split(';') user = fields[0].strip('"') book = fields[1].strip('"') rating = int(fields[2].strip().strip('"')) if user in self.data: currentRatings = self.data[user] else: currentRatings = {} currentRatings[book] = rating self.data[user] = currentRatings f.close() # # Now load books into self.productid2name # Books contains isbn, title, and author among other fields # f = codecs.open(path + "BX-Books.csv", 'r', 'utf8') for line in f: i += 1 #separate line into fields fields = line.split(';') isbn = fields[0].strip('"') title = fields[1].strip('"') author = fields[2].strip().strip('"') title = title + ' by ' + author self.productid2name[isbn] = title f.close() # # Now load user info into both self.userid2name and # self.username2id # f = codecs.open(path + "BX-Users.csv", 'r', 'utf8') for line in f: i += 1 #print(line) #separate line into fields fields = line.split(';') userid = fields[0].strip('"') location = fields[1].strip('"') if len(fields) > 3: age = fields[2].strip().strip('"') else: age = 'NULL' if age != 'NULL': value = location + ' (age: ' + age + ')' else: value = location self.userid2name[userid] = value self.username2id[location] = userid f.close() print(i) def pearson(self, rating1, rating2): sum_xy = 0 sum_x = 0 sum_y = 0 sum_x2 = 0 sum_y2 = 0 n = 0 for key in rating1: if key in rating2: n += 1 x = rating1[key] y = rating2[key] sum_xy += x * y sum_x += x sum_y += y sum_x2 += pow(x, 2) sum_y2 += pow(y, 2) if n == 0: return 0 # now compute denominator denominator = (sqrt(sum_x2 - pow(sum_x, 2) / n) * sqrt(sum_y2 - pow(sum_y, 2) / n)) if denominator == 0: return 0 else: return (sum_xy - (sum_x * sum_y) / n) / denominator def computeNearestNeighbor(self, username): """creates a sorted list of users based on their distance to username""" distances = [] for instance in self.data: if instance != username: distance = self.fn(self.data[username], self.data[instance]) distances.append((instance, distance)) # sort based on distance -- closest first distances.sort(key=lambda artistTuple: artistTuple[1], reverse=True) return distances def recommend(self, user): """Give list of recommendations""" recommendations = {} # first get list of users ordered by nearness nearest = self.computeNearestNeighbor(user) # # now get the ratings for the user # userRatings = self.data[user] # # determine the total distance totalDistance = 0.0 for i in range(self.k): totalDistance += nearest[i][1] # now iterate through the k nearest neighbors # accumulating their ratings for i in range(self.k): # compute slice of pie weight = nearest[i][1] / totalDistance # get the name of the person name = nearest[i][0] # get the ratings for this person neighborRatings = self.data[name] # get the name of the person # now find bands neighbor rated that user didn't for artist in neighborRatings: if not artist in userRatings: if artist not in recommendations: recommendations[artist] = (neighborRatings[artist] * weight) else: recommendations[artist] = (recommendations[artist] + neighborRatings[artist] * weight) # now make list from dictionary recommendations = list(recommendations.items()) recommendations = [(self.convertProductID2name(k), v) for (k, v) in recommendations] # finally sort and return recommendations.sort(key=lambda artistTuple: artistTuple[1], reverse = True) # Return the first n items return recommendations[:self.n]
七、一个实际的数据集
BX-Dump.zip,Cai-Nicolas Zeigler从Book Crossing网站上收集了超过100万书评,其中包含278858个用户对271379本书的评级。
该CSV文件包括3张表
- BX-Users表:正如名字的含义一样,该表包含的是用户的信息。具体包括整型的用户ID字段以及地址字段和年龄字段
- BX-Books表:书通过ISBN、书名、作者、出版年份和出版商来表示
- BX-Book-Rating表:包括用户ID、书的ISBN和一个0到10之间的评级分数
import remmender as r>>> rc = r.recommender(r.users)>>> rc.recommend('Jordyn')[('Blues Traveler', 5.0)]>>> rc.loadBookDB('BX-Dump/')1700018
# 现在我可以得到来自多伦多的一个用户17118的推荐结果>>> rc.recommend('171118')[(u"Devil's Waltz (Alex Delaware Novels (Paperback)) by Jonathan Kellerman", 9.0), (u'Silent Partner (Alex Delaware Novels (Paperback)) by Jonathan Kellerman', 8.0), (u'The Outsiders (Now in Speak!) by S. E. Hinton', 8.0), (u'Thinner by Stephen King', 8.0), (u'Sein Language by JERRY SEINFELD', 8.0)]>>> rc.userRatings('171118', 5)Ratings for toronto, ontario, canada2421The Careful Writer by Theodore M. Bernstein10The Darkest Road (The Fionavar Tapestry, Book 3) by Guy Gavriel Kay10Wonderful Life: The Burgess Shale and the Nature of History by Stephen Jay Gould10Time Power: The Revolutionary Time Management System That Can Change Your Professional and Personal by Charles Hobbs10Just So Stories (Penguin Twentieth-Century Classics) by Rudyard Kipling10
阅读全文
0 0
- chapter2:协同过滤
- 协同过滤
- 协同过滤
- 协同过滤
- 协同过滤
- 协同过滤
- 协同过滤
- 协同过滤
- 协同过滤
- 协同过滤
- 协同过滤
- 协同过滤
- 协同过滤
- 协同过滤
- 协同过滤
- 协同过滤
- 协同过滤
- 协同过滤
- 解决了电脑经常死机的问题
- 2012ICPC长春站 A Browsing History 【字符串】
- 【SQL Server学习笔记】18:对字符数据的处理
- bzoj 3143: [Hnoi2013]游走(高斯消元)
- 一些Linux命令简要笔记——文件系统
- chapter2:协同过滤
- VIJOS-P1626 爱在心中 tarjan
- 2012ICPC长春站 B Candy 【快速排列组合】
- bzoj-4627 [BeiJing2016]回转寿司 hash+权值线段树
- Go的闭包——计数器
- Unity_DOTween动画的学习(五)_Tweener的使用和注意事项_DOPlay播一次_DOPlayForward播多次_DOPlayBackwards倒放_SetAutoKill动画的自动销
- 2012ICPC长春站 I Count【暴力+模拟】
- Kubelet源码分析之diskSpaceManager
- JavaScript-2-6:canvas