chapter2:协同过滤

来源:互联网 发布:淘宝手表店推荐 编辑:程序博客网 时间:2024/05/18 03:34

https://github.com/zacharski/pg2dm-python

一、如何寻找相似用户

  曼哈顿距离(Manhattan Distance)

             |x1 - x2 | + | y1 - y2 |            

        欧式距离

    sqrt( (x1-x2)^2 + (y1-y2)^2 )

  N维下的思考

     ×××

  一个缺陷

    当没有缺失值时,曼哈顿距离和欧式距离非常好。缺失值的处理是一个活跃的学术研究问题

  一般化

    

Python

##  FILTERINGDATA.py##  Code file for the book Programmer's Guide to Data Mining#  http://guidetodatamining.com#  Ron Zacharski#from math import sqrtusers = {"Angelica": {"Blues Traveler": 3.5, "Broken Bells": 2.0, "Norah Jones": 4.5, "Phoenix": 5.0, "Slightly Stoopid": 1.5, "The Strokes": 2.5, "Vampire Weekend": 2.0},         "Bill":{"Blues Traveler": 2.0, "Broken Bells": 3.5, "Deadmau5": 4.0, "Phoenix": 2.0, "Slightly Stoopid": 3.5, "Vampire Weekend": 3.0},         "Chan": {"Blues Traveler": 5.0, "Broken Bells": 1.0, "Deadmau5": 1.0, "Norah Jones": 3.0, "Phoenix": 5, "Slightly Stoopid": 1.0},         "Dan": {"Blues Traveler": 3.0, "Broken Bells": 4.0, "Deadmau5": 4.5, "Phoenix": 3.0, "Slightly Stoopid": 4.5, "The Strokes": 4.0, "Vampire Weekend": 2.0},         "Hailey": {"Broken Bells": 4.0, "Deadmau5": 1.0, "Norah Jones": 4.0, "The Strokes": 4.0, "Vampire Weekend": 1.0},         "Jordyn":  {"Broken Bells": 4.5, "Deadmau5": 4.0, "Norah Jones": 5.0, "Phoenix": 5.0, "Slightly Stoopid": 4.5, "The Strokes": 4.0, "Vampire Weekend": 4.0},         "Sam": {"Blues Traveler": 5.0, "Broken Bells": 2.0, "Norah Jones": 3.0, "Phoenix": 5.0, "Slightly Stoopid": 4.0, "The Strokes": 5.0},         "Veronica": {"Blues Traveler": 3.0, "Norah Jones": 5.0, "Phoenix": 4.0, "Slightly Stoopid": 2.5, "The Strokes": 3.0}        }def manhattan(rating1, rating2):    """Computes the Manhattan distance. Both rating1 and rating2 are dictionaries       of the form {'The Strokes': 3.0, 'Slightly Stoopid': 2.5}"""    distance = 0    commonRatings = False     for key in rating1:        if key in rating2:            distance += abs(rating1[key] - rating2[key])            commonRatings = True    if commonRatings:        return distance    else:        return -1 #Indicates no ratings in commondef computeNearestNeighbor(username, users):    """creates a sorted list of users based on their distance to username"""    distances = []    for user in users:        if user != username:            distance = manhattan(users[user], users[username])            distances.append((distance, user))    # sort based on distance -- closest first    distances.sort()    return distancesdef recommend(username, users):    """Give list of recommendations"""    # first find nearest neighbor    nearest = computeNearestNeighbor(username, users)[0][1]    recommendations = []    # now find bands neighbor rated that user didn't    neighborRatings = users[nearest]    userRatings = users[username]    for artist in neighborRatings:        if not artist in userRatings:            recommendations.append((artist, neighborRatings[artist]))    # using the fn sorted for variety - sort is more efficient    return sorted(recommendations, key=lambda artistTuple: artistTuple[1], reverse = True)# examples - uncomment to runprint( recommend('Hailey', users))#print( recommend('Chan', users))

二、用户的评级差异

   皮尔逊相关系数(Pearson Correlation Coefficient)

     寻找某个感兴趣的用户的最相似用户


    除了看上去或许有点复杂之外,上面公式的另一个问题在于算法时可能需要对数据进行多遍扫描。幸运的是,对于算法实现人员而言,还有另一个皮尔逊相关系数的近似计算公式:


from math import sqrtusers = {"Angelica": {"Blues Traveler": 3.5, "Broken Bells": 2.0, "Norah Jones": 4.5, "Phoenix": 5.0, "Slightly Stoopid": 1.5, "The Strokes": 2.5, "Vampire Weekend": 2.0},         "Bill":{"Blues Traveler": 2.0, "Broken Bells": 3.5, "Deadmau5": 4.0, "Phoenix": 2.0, "Slightly Stoopid": 3.5, "Vampire Weekend": 3.0},         "Chan": {"Blues Traveler": 5.0, "Broken Bells": 1.0, "Deadmau5": 1.0, "Norah Jones": 3.0, "Phoenix": 5, "Slightly Stoopid": 1.0},         "Dan": {"Blues Traveler": 3.0, "Broken Bells": 4.0, "Deadmau5": 4.5, "Phoenix": 3.0, "Slightly Stoopid": 4.5, "The Strokes": 4.0, "Vampire Weekend": 2.0},         "Hailey": {"Broken Bells": 4.0, "Deadmau5": 1.0, "Norah Jones": 4.0, "The Strokes": 4.0, "Vampire Weekend": 1.0},         "Jordyn":  {"Broken Bells": 4.5, "Deadmau5": 4.0, "Norah Jones": 5.0, "Phoenix": 5.0, "Slightly Stoopid": 4.5, "The Strokes": 4.0, "Vampire Weekend": 4.0},         "Sam": {"Blues Traveler": 5.0, "Broken Bells": 2.0, "Norah Jones": 3.0, "Phoenix": 5.0, "Slightly Stoopid": 4.0, "The Strokes": 5.0},         "Veronica": {"Blues Traveler": 3.0, "Norah Jones": 5.0, "Phoenix": 4.0, "Slightly Stoopid": 2.5, "The Strokes": 3.0}        }
def pearson(rating1, rating2):    sum_xy = 0    sum_x = 0    sum_y = 0    sum_x2 = 0    sum_y2 = 0    n = 0    for key in rating1:        if key in rating2:            n += 1            x = rating1[key]            y = rating2[key]            sum_xy += x * y            sum_x += x            sum_y += y            sum_x2 += pow(x, 2)            sum_y2 += pow(y, 2)    # now compute denominator    denominator = sqrt(sum_x2 - pow(sum_x, 2) / n) * sqrt(sum_y2 - pow(sum_y, 2) / n)    if denominator == 0:        return 0    else:        return (sum_xy - (sum_x * sum_y) / n) / denominator
print( pearson(users['Angelica'], users['Bill'] ) )print( pearson(users['Angelica'], users['Hailey'] ) )print( pearson(users['Angelica'], users['Jordyn'] ) )

三、最后一个公式---余弦相似度

   不仅在文本挖掘中使用得非常普遍,而且也广泛用于协同过滤

   例子:跟踪用户播放某首音乐的次数并基于该信息进行推荐


四、相似度的选择

   如果数据受分数贬值(即不同用户使用不同的评级范围)的影响,则使用皮尔逊相关系数

   如果数据稠密(几乎所有属性都没有零值)且属性值大小十分重要,那么使用诸如欧式距离或者曼哈顿距离

   如果数据稀疏,考虑使用余弦相似度

五、K近邻

   上面的问题:依赖单个“最相似”的用户进行推荐。该用户的其他怪癖爱好都会被推荐。

   一种解决办法是基于多个相似的用户进行推荐。这里可以使用K近邻方法

   利用K个最相似的用户来确定推荐结果。基本思路


六、Python的一个推荐类

import codecs from math import sqrtusers = {"Angelica": {"Blues Traveler": 3.5, "Broken Bells": 2.0,                      "Norah Jones": 4.5, "Phoenix": 5.0,                      "Slightly Stoopid": 1.5,                      "The Strokes": 2.5, "Vampire Weekend": 2.0},                  "Bill":{"Blues Traveler": 2.0, "Broken Bells": 3.5,                 "Deadmau5": 4.0, "Phoenix": 2.0,                 "Slightly Stoopid": 3.5, "Vampire Weekend": 3.0},                  "Chan": {"Blues Traveler": 5.0, "Broken Bells": 1.0,                  "Deadmau5": 1.0, "Norah Jones": 3.0, "Phoenix": 5,                  "Slightly Stoopid": 1.0},                  "Dan": {"Blues Traveler": 3.0, "Broken Bells": 4.0,                 "Deadmau5": 4.5, "Phoenix": 3.0,                 "Slightly Stoopid": 4.5, "The Strokes": 4.0,                 "Vampire Weekend": 2.0},                  "Hailey": {"Broken Bells": 4.0, "Deadmau5": 1.0,                    "Norah Jones": 4.0, "The Strokes": 4.0,                    "Vampire Weekend": 1.0},                  "Jordyn":  {"Broken Bells": 4.5, "Deadmau5": 4.0,                     "Norah Jones": 5.0, "Phoenix": 5.0,                     "Slightly Stoopid": 4.5, "The Strokes": 4.0,                     "Vampire Weekend": 4.0},                  "Sam": {"Blues Traveler": 5.0, "Broken Bells": 2.0,                 "Norah Jones": 3.0, "Phoenix": 5.0,                 "Slightly Stoopid": 4.0, "The Strokes": 5.0},                  "Veronica": {"Blues Traveler": 3.0, "Norah Jones": 5.0,                      "Phoenix": 4.0, "Slightly Stoopid": 2.5,                      "The Strokes": 3.0}        }class recommender:    def __init__(self, data, k=1, metric='pearson', n=5):        """ initialize recommender        currently, if data is dictionary the recommender is initialized        to it.        For all other data types of data, no initialization occurs        k is the k value for k nearest neighbor        metric is which distance formula to use        n is the maximum number of recommendations to make"""        self.k = k        self.n = n        self.username2id = {}        self.userid2name = {}        self.productid2name = {}        # for some reason I want to save the name of the metric        self.metric = metric        if self.metric == 'pearson':            self.fn = self.pearson        #        # if data is dictionary set recommender data to it        #        if type(data).__name__ == 'dict':            self.data = data    def convertProductID2name(self, id):        """Given product id number return product name"""        if id in self.productid2name:            return self.productid2name[id]        else:            return id    def userRatings(self, id, n):        """Return n top ratings for user with id"""        print ("Ratings for " + self.userid2name[id])        ratings = self.data[id]        print(len(ratings))        ratings = list(ratings.items())        ratings = [(self.convertProductID2name(k), v)                   for (k, v) in ratings]        # finally sort and return        ratings.sort(key=lambda artistTuple: artistTuple[1],                     reverse = True)        ratings = ratings[:n]        for rating in ratings:            print("%s\t%i" % (rating[0], rating[1]))                    def loadBookDB(self, path=''):        """loads the BX book dataset. Path is where the BX files are        located"""        self.data = {}        i = 0        #        # First load book ratings into self.data        #        f = codecs.open(path + "BX-Book-Ratings.csv", 'r', 'utf8')        for line in f:            i += 1            #separate line into fields            fields = line.split(';')            user = fields[0].strip('"')            book = fields[1].strip('"')            rating = int(fields[2].strip().strip('"'))            if user in self.data:                currentRatings = self.data[user]            else:                currentRatings = {}            currentRatings[book] = rating            self.data[user] = currentRatings        f.close()        #        # Now load books into self.productid2name        # Books contains isbn, title, and author among other fields        #        f = codecs.open(path + "BX-Books.csv", 'r', 'utf8')        for line in f:            i += 1            #separate line into fields            fields = line.split(';')            isbn = fields[0].strip('"')            title = fields[1].strip('"')            author = fields[2].strip().strip('"')            title = title + ' by ' + author            self.productid2name[isbn] = title        f.close()        #        #  Now load user info into both self.userid2name and        #  self.username2id        #        f = codecs.open(path + "BX-Users.csv", 'r', 'utf8')        for line in f:            i += 1            #print(line)            #separate line into fields            fields = line.split(';')            userid = fields[0].strip('"')            location = fields[1].strip('"')            if len(fields) > 3:                age = fields[2].strip().strip('"')            else:                age = 'NULL'            if age != 'NULL':                value = location + '  (age: ' + age + ')'            else:                value = location            self.userid2name[userid] = value            self.username2id[location] = userid        f.close()        print(i)                            def pearson(self, rating1, rating2):        sum_xy = 0        sum_x = 0        sum_y = 0        sum_x2 = 0        sum_y2 = 0        n = 0        for key in rating1:            if key in rating2:                n += 1                x = rating1[key]                y = rating2[key]                sum_xy += x * y                sum_x += x                sum_y += y                sum_x2 += pow(x, 2)                sum_y2 += pow(y, 2)        if n == 0:            return 0        # now compute denominator        denominator = (sqrt(sum_x2 - pow(sum_x, 2) / n)                       * sqrt(sum_y2 - pow(sum_y, 2) / n))        if denominator == 0:            return 0        else:            return (sum_xy - (sum_x * sum_y) / n) / denominator    def computeNearestNeighbor(self, username):        """creates a sorted list of users based on their distance to        username"""        distances = []        for instance in self.data:            if instance != username:                distance = self.fn(self.data[username],                                   self.data[instance])                distances.append((instance, distance))        # sort based on distance -- closest first        distances.sort(key=lambda artistTuple: artistTuple[1],                       reverse=True)        return distances    def recommend(self, user):       """Give list of recommendations"""       recommendations = {}       # first get list of users  ordered by nearness       nearest = self.computeNearestNeighbor(user)       #       # now get the ratings for the user       #       userRatings = self.data[user]       #       # determine the total distance       totalDistance = 0.0       for i in range(self.k):          totalDistance += nearest[i][1]       # now iterate through the k nearest neighbors       # accumulating their ratings       for i in range(self.k):          # compute slice of pie           weight = nearest[i][1] / totalDistance          # get the name of the person          name = nearest[i][0]          # get the ratings for this person          neighborRatings = self.data[name]          # get the name of the person          # now find bands neighbor rated that user didn't          for artist in neighborRatings:             if not artist in userRatings:                if artist not in recommendations:                   recommendations[artist] = (neighborRatings[artist]                                              * weight)                else:                   recommendations[artist] = (recommendations[artist]                                              + neighborRatings[artist]                                              * weight)       # now make list from dictionary       recommendations = list(recommendations.items())       recommendations = [(self.convertProductID2name(k), v)                          for (k, v) in recommendations]       # finally sort and return       recommendations.sort(key=lambda artistTuple: artistTuple[1],                            reverse = True)       # Return the first n items       return recommendations[:self.n]


七、一个实际的数据集

BX-Dump.zip,Cai-Nicolas Zeigler从Book Crossing网站上收集了超过100万书评,其中包含278858个用户对271379本书的评级。

  该CSV文件包括3张表

  • BX-Users表:正如名字的含义一样,该表包含的是用户的信息。具体包括整型的用户ID字段以及地址字段和年龄字段
  • BX-Books表:书通过ISBN、书名、作者、出版年份和出版商来表示
  • BX-Book-Rating表:包括用户ID、书的ISBN和一个0到10之间的评级分数

import remmender as r>>> rc = r.recommender(r.users)>>> rc.recommend('Jordyn')[('Blues Traveler', 5.0)]>>> rc.loadBookDB('BX-Dump/')1700018
# 现在我可以得到来自多伦多的一个用户17118的推荐结果>>> rc.recommend('171118')[(u"Devil's Waltz (Alex Delaware Novels (Paperback)) by Jonathan Kellerman", 9.0), (u'Silent Partner (Alex Delaware Novels (Paperback)) by Jonathan Kellerman', 8.0), (u'The Outsiders (Now in Speak!) by S. E. Hinton', 8.0), (u'Thinner by Stephen King', 8.0), (u'Sein Language by JERRY SEINFELD', 8.0)]>>> rc.userRatings('171118', 5)Ratings for toronto, ontario, canada2421The Careful Writer by Theodore M. Bernstein10The Darkest Road (The Fionavar Tapestry, Book 3) by Guy Gavriel Kay10Wonderful Life: The Burgess Shale and the Nature of History by Stephen Jay Gould10Time Power: The Revolutionary Time Management System That Can Change Your Professional and Personal by Charles Hobbs10Just So Stories (Penguin Twentieth-Century Classics) by Rudyard Kipling10