chapter4:内容过滤及分类---基于物品属性的过滤
来源:互联网 发布:剑三捏脸数据免费 编辑:程序博客网 时间:2024/05/22 06:03
协同过滤也称为社会过滤,利用了用户社区的力量来帮助进行推荐,它的难点,包括数据稀疏和扩展性带来的问题,另一个问题是基于协同过滤的推荐系统倾向于推荐已流行的物品,即偏向于流行事物。作为一个极端的例子,考虑一个全新乐队刚发布的专辑,由于乐队和专辑从没被人评价过或者没人购买过,因此它永远不会被推荐,这就是所谓的“冷启动”问题。会带来“富者越富”的效果
一种不同的推荐方法。考虑流音乐网站Pandora的推荐,基于一种称为音乐基因的项目。他们雇了一些具有很强音乐理论背景的专业音乐人士作为分析师,有他们来决定歌曲的特征(他们称之为基因)。这些分析师会接受超过150个小时的培训。一旦培训完毕,他们就会花平均20~30分钟的时间来分析一首歌曲以确定其基因或者说特征。这些特征当中很多都是专业性的。分析师会在超过400中基因上进行评分。由于每个月都大约添加15000首新歌,因此上述做法的工作量很大。
一、选择合适取值的重要性
特征选取,如音乐的流派、情绪,取值在1~5之间
用Python实现的数据格式
music = {"Dr Dog/Fate": {"piano": 2.5, "vocals": 4, "beat": 3.5, "blues": 3, "guitar": 5, "backup vocals": 4, "rap": 1}, "Phoenix/Lisztomania": {"piano": 2, "vocals": 5, "beat": 5, "blues": 3, "guitar": 2, "backup vocals": 1, "rap": 1}, "Heartless Bastards/Out at Sea": {"piano": 1, "vocals": 5, "beat": 4, "blues": 2, "guitar": 4, "backup vocals": 1, "rap": 1}, "Todd Snider/Don't Tempt Me": {"piano": 4, "vocals": 5, "beat": 4, "blues": 4, "guitar": 1, "backup vocals": 5, "rap": 1}, "The Black Keys/Magic Potion": {"piano": 1, "vocals": 4, "beat": 5, "blues": 3.5, "guitar": 5, "backup vocals": 1, "rap": 1}, "Glee Cast/Jessie's Girl": {"piano": 1, "vocals": 5, "beat": 3.5, "blues": 3, "guitar":4, "backup vocals": 5, "rap": 1}, "La Roux/Bulletproof": {"piano": 5, "vocals": 5, "beat": 4, "blues": 2, "guitar": 1, "backup vocals": 1, "rap": 1}, "Mike Posner": {"piano": 2.5, "vocals": 4, "beat": 4, "blues": 1, "guitar": 1, "backup vocals": 1, "rap": 1}, "Black Eyed Peas/Rock That Body": {"piano": 2, "vocals": 5, "beat": 5, "blues": 1, "guitar": 2, "backup vocals": 2, "rap": 4}, "Lady Gaga/Alejandro": {"piano": 1, "vocals": 5, "beat": 3, "blues": 2, "guitar": 1, "backup vocals": 2, "rap": 1}}用曼哈顿距离推荐
from math import sqrtusers = {"Angelica": {"Blues Traveler": 3.5, "Broken Bells": 2.0, "Norah Jones": 4.5, "Phoenix": 5.0, "Slightly Stoopid": 1.5, "The Strokes": 2.5, "Vampire Weekend": 2.0}, "Bill":{"Blues Traveler": 2.0, "Broken Bells": 3.5, "Deadmau5": 4.0, "Phoenix": 2.0, "Slightly Stoopid": 3.5, "Vampire Weekend": 3.0}, "Chan": {"Blues Traveler": 5.0, "Broken Bells": 1.0, "Deadmau5": 1.0, "Norah Jones": 3.0, "Phoenix": 5, "Slightly Stoopid": 1.0}, "Dan": {"Blues Traveler": 3.0, "Broken Bells": 4.0, "Deadmau5": 4.5, "Phoenix": 3.0, "Slightly Stoopid": 4.5, "The Strokes": 4.0, "Vampire Weekend": 2.0}, "Hailey": {"Broken Bells": 4.0, "Deadmau5": 1.0, "Norah Jones": 4.0, "The Strokes": 4.0, "Vampire Weekend": 1.0}, "Jordyn": {"Broken Bells": 4.5, "Deadmau5": 4.0, "Norah Jones": 5.0, "Phoenix": 5.0, "Slightly Stoopid": 4.5, "The Strokes": 4.0, "Vampire Weekend": 4.0}, "Sam": {"Blues Traveler": 5.0, "Broken Bells": 2.0, "Norah Jones": 3.0, "Phoenix": 5.0, "Slightly Stoopid": 4.0, "The Strokes": 5.0}, "Veronica": {"Blues Traveler": 3.0, "Norah Jones": 5.0, "Phoenix": 4.0, "Slightly Stoopid": 2.5, "The Strokes": 3.0} }music = {"Dr Dog/Fate": {"piano": 2.5, "vocals": 4, "beat": 3.5, "blues": 3, "guitar": 5, "backup vocals": 4, "rap": 1}, "Phoenix/Lisztomania": {"piano": 2, "vocals": 5, "beat": 5, "blues": 3, "guitar": 2, "backup vocals": 1, "rap": 1}, "Heartless Bastards/Out at Sea": {"piano": 1, "vocals": 5, "beat": 4, "blues": 2, "guitar": 4, "backup vocals": 1, "rap": 1}, "Todd Snider/Don't Tempt Me": {"piano": 4, "vocals": 5, "beat": 4, "blues": 4, "guitar": 1, "backup vocals": 5, "rap": 1}, "The Black Keys/Magic Potion": {"piano": 1, "vocals": 4, "beat": 5, "blues": 3.5, "guitar": 5, "backup vocals": 1, "rap": 1}, "Glee Cast/Jessie's Girl": {"piano": 1, "vocals": 5, "beat": 3.5, "blues": 3, "guitar":4, "backup vocals": 5, "rap": 1}, "La Roux/Bulletproof": {"piano": 5, "vocals": 5, "beat": 4, "blues": 2, "guitar": 1, "backup vocals": 1, "rap": 1}, "Mike Posner": {"piano": 2.5, "vocals": 4, "beat": 4, "blues": 1, "guitar": 1, "backup vocals": 1, "rap": 1}, "Black Eyed Peas/Rock That Body": {"piano": 2, "vocals": 5, "beat": 5, "blues": 1, "guitar": 2, "backup vocals": 2, "rap": 4}, "Lady Gaga/Alejandro": {"piano": 1, "vocals": 5, "beat": 3, "blues": 2, "guitar": 1, "backup vocals": 2, "rap": 1}}def manhattan(rating1, rating2): """Computes the Manhattan distance. Both rating1 and rating2 are dictionaries of the form {'The Strokes': 3.0, 'Slightly Stoopid': 2.5}""" distance = 0 total = 0 for key in rating1: if key in rating2: distance += abs(rating1[key] - rating2[key]) total += 1 return distancedef computeNearestNeighbor(username, users): """creates a sorted list of users based on their distance to username""" distances = [] for user in users: if user != username: distance = manhattan(users[user], users[username]) distances.append((distance, user)) # sort based on distance -- closest first distances.sort() return distancesdef recommend(username, users): """Give list of recommendations""" # first find nearest neighbor nearest = computeNearestNeighbor(username, users)[0][1] recommendations = [] # now find bands neighbor rated that user didn't neighborRatings = users[nearest] userRatings = users[username] for artist in neighborRatings: if not artist in userRatings: recommendations.append((artist, neighborRatings[artist])) # using the fn sorted for variety - sort is more efficient return sorted(recommendations, key=lambda artistTuple: artistTuple[1], reverse = True)一个取值范围的问题
假设某个特征在距离计算中占主导地位,并不是什么好事,实际上,这种不同属性取值范围的差异对任意推荐系统来说都是个大问题。
二、归一化
解决上面的问题是归一化。为了消除数据的偏斜性,我们必须要对数据标准化或者说归一化。
一个常用的归一化方法会将每个特征的值转换为0到1之间,如 (val - min) / (max - min)
如果你上过统计课,可能会熟悉更精确的标准化数据的做法,如标准分数(Standard Score)
使用标准分数的问题在于其会受到离群点的剧烈影响。
改进的标准分数
哪些情况下应该进行归一化处理:记住的是如果进行归一化的话会涉及计算的开销
1、所用数据挖掘方法基于特征的值来计算两个对象的距离
2、不同特征的尺度不同(特别是有显著不同的情况,如上述例子中的询价和卧室数目)
三、最近邻分类器的Python代码
为喜欢Green Day的用户推荐歌曲
需要的数据
音乐的属性music = { }
将music转换成向量items = { } 方便计算
每个用户对部分的评分users = { }
创建一个分类函数
四、体育项目的识别
小规模数据,两个文件athletesTrainingSet.txt(训练分类器) and athletesTestSet.txt(评估分类器)
class Classifier: def __init__(self, filename): self.medianAndDeviation = [] # reading the data in from the file f = open(filename) lines = f.readlines() f.close() self.format = lines[0].strip().split('\t') self.data = [] for line in lines[1:]: fields = line.strip().split('\t') ignore = [] vector = [] for i in range(len(fields)): if self.format[i] == 'num': vector.append(int(fields[i])) elif self.format[i] == 'comment': ignore.append(fields[i]) elif self.format[i] == 'class': classification = fields[i] self.data.append((classification, vector, ignore)) self.rawData = list(self.data) ################################################## ### ### FINISH THE FOLLOWING TWO METHODS def getMedian(self, alist): """return median of alist""" """TO BE DONE""" return 0 def getAbsoluteStandardDeviation(self, alist, median): """given alist and median return absolute standard deviation""" """TO BE DONE""" return 0 ### ### ##################################################def unitTest(): list1 = [54, 72, 78, 49, 65, 63, 75, 67, 54] list2 = [54, 72, 78, 49, 65, 63, 75, 67, 54, 68] list3 = [69] list4 = [69, 72] classifier = Classifier('athletesTrainingSet.txt') m1 = classifier.getMedian(list1) m2 = classifier.getMedian(list2) m3 = classifier.getMedian(list3) m4 = classifier.getMedian(list4) asd1 = classifier.getAbsoluteStandardDeviation(list1, m1) asd2 = classifier.getAbsoluteStandardDeviation(list2, m2) asd3 = classifier.getAbsoluteStandardDeviation(list3, m3) asd4 = classifier.getAbsoluteStandardDeviation(list4, m4) assert(round(m1, 3) == 65) assert(round(m2, 3) == 66) assert(round(m3, 3) == 69) assert(round(m4, 3) == 70.5) assert(round(asd1, 3) == 8) assert(round(asd2, 3) == 7.5) assert(round(asd3, 3) == 0) assert(round(asd4, 3) == 1.5) print("getMedian and getAbsoluteStandardDeviation work correctly")unitTest()
五、Iris数据集
六、汽车MPG数据
该数据来自卡内基梅隆大学,最初用于1983年度的美国统计协会展会上。
七、杂谈
注意归一化,重要性
- chapter4:内容过滤及分类---基于物品属性的过滤
- chapter3:协同过滤-隐式评级及基于物品的过滤
- 基于物品的协同过滤和内容过滤有什么区别?
- 基于物品的协作型过滤
- mahout基于物品的协同过滤指令
- 基于物品的协同过滤算法
- 协同过滤--基于物品的推荐案例
- 基于物品的协同过滤算法
- 关于基于物品的协同过滤
- 基于物品的协同过滤算法
- 基于物品的协同过滤-itemBase
- 基于物品的协同过滤推荐算法
- 基于物品的协同过滤推荐
- itemCF 基于物品的协同过滤
- 基于物品的协同过滤-电影推荐
- 基于物品的协同过滤算法
- 基于物品的协同过滤算法:理论说明,代码实现及应用
- 基于物品的协同过滤算法:理论说明,代码实现及应用
- jQueryEasyUI
- python入门3-面向对象
- 使用setup轻松解决VMware虚拟机Linux -CentOS系统 NAT连网问题
- 在活动中使用Menu菜单
- KVM
- chapter4:内容过滤及分类---基于物品属性的过滤
- springmvc的理解下
- spring入门篇二(1)
- C语言实验——某年某月的天数
- shuffle性能调优之HashShuffleManager和SortShuffleManager
- mysql运算符
- 阻止默认事件(禁止复制)
- Git常用命令详解
- 创建表与删除表