推荐系统-电影推荐

来源：互联网发布：性别就业歧视现状数据编辑：程序博客网时间：2024/04/28 23:11

用Apriori算法挖掘关联特征
频繁项集（frequent itemset）
FP-growth：频繁项集挖掘算法（比Apriori有改进）
Eclat:（比Apriori有改进）

挖掘亲和性分析所用的关联规则之前，我们先用Apriori算法生成频繁项集，再通过检测频繁项集中前提和结论的组合，生成关联规则。
（1）为Apriori算法指定一个项集成为频繁项集所需的最小支持度；（2）找到频繁项集后，根据置信度选取关联规则。

## 1、背景介绍根据Grouplens团队的电影数据，做电影推荐。## 2、获取数据数据下载地址为为http://grouplens.org/datasets/movielens/ ，包含了100万条电影数据。下载后，解压到文件夹。

#import os#data_folder = os.path.join(os.path.expanduser("~"),"ml_20m")#ratings_filename = os.path.join(data_folder, "u.data")

## 3、加载数据ratings为csv格式文件，表头为：userId,movieId,rating,timestamp。

import pandas as pdall_ratings = pd.read_csv('ratings.csv')all_ratings['timestamp'] = pd.to_datetime(all_ratings['timestamp'], unit='s') # 将时间戳数据转换为时间格式数据print all_ratings.head() #看看数据长什么样的print all_ratings.describe()print all_ratings[all_ratings['userId'] == 100].sort('movieId')#查看下第100名用户的打分情况# **************************Apriori算法的实现*****************************************all_ratings['Favorable'] = all_ratings['rating'] > 3 #评价大于3，标记为喜欢print all_ratings[10:15]print all_ratings[all_ratings['userId'] == 100].head() #查看ID为100的用户的评价ratings = all_ratings[all_ratings['userId'].isin(range(200))] #筛选前200名用户favorable_ratings = ratings[ratings['Favorable']] #创建一个只包括用户喜欢某部电影的数据集#需要知道每个用户各喜欢哪些电影，按照ID进行分组，并遍历每个用户看过的每一部电影。favorable_reviews_by_users = dict((k, frozenset(v.values))                                  for k, v in favorable_ratings.groupby('userId')['movieId']) #frozenset()为不可变集合#把v.values存储为frozenset，便于快速判断用户是否为某部电影打过分。集合比列表快。print len(favorable_reviews_by_users)#创建一个数据框，以便于了解每部电影的影迷数量。num_favorable_by_movie = ratings[['movieId', 'Favorable']].groupby('movieId').sum()#查看最受欢迎的五部电影。num_favorable_by_movie.sort('Favorable', ascending=False)[:5]

userId movieId rating timestamp 0 1 2 3.5 2005-04-02 23:53:47 1 1 29 3.5 2005-04-02 23:31:16 2 1 32 3.5 2005-04-02 23:33:39 3 1 47 3.5 2005-04-02 23:32:07 4 1 50 3.5 2005-04-02 23:29:40 userId movieId rating count 2.000026e+07 2.000026e+07 2.000026e+07 mean 6.904587e+04 9.041567e+03 3.525529e+00 std 4.003863e+04 1.978948e+04 1.051989e+00 min 1.000000e+00 1.000000e+00 5.000000e-01 25% 3.439500e+04 9.020000e+02 3.000000e+00 50% 6.914100e+04 2.167000e+03 3.500000e+00 75% 1.036370e+05 4.770000e+03 4.000000e+00 max 1.384930e+05 1.312620e+05 5.000000e+00 userId movieId rating timestamp 11049 100 14 3.0 1996-06-25 16:40:02 11050 100 25 4.0 1996-06-25 16:31:02 11051 100 32 3.0 1996-06-25 16:24:49 11052 100 39 3.0 1996-06-25 16:25:12 11053 100 50 5.0 1996-06-25 16:24:49 11054 100 70 3.0 1996-06-25 16:38:47 11055 100 161 3.0 1996-06-25 16:23:18 11056 100 162 4.0 1996-06-25 16:43:19 11057 100 185 2.0 1996-06-25 16:23:45 11058 100 194 3.0 1996-06-25 16:40:13 11059 100 223 4.0 1996-06-25 16:31:02 11060 100 235 4.0 1996-06-25 16:28:27 11061 100 260 4.0 1997-06-09 16:40:56 11062 100 265 4.0 1996-06-25 16:29:49 11063 100 288 4.0 1996-06-25 16:24:07 11064 100 293 5.0 1996-06-25 16:28:27 11065 100 296 4.0 1996-06-25 16:21:49 11066 100 318 3.0 1996-06-25 16:22:54 11067 100 329 3.0 1996-06-25 16:22:54 11068 100 337 3.0 1996-06-25 16:25:52 11069 100 339 3.0 1996-06-25 16:23:18 11070 100 342 4.0 1996-06-25 16:33:36 11071 100 344 3.0 1996-06-25 16:22:14 11072 100 356 4.0 1996-06-25 16:25:52 11073 100 427 2.0 1996-06-25 16:36:08 11074 100 431 3.0 1996-06-25 16:34:10 11075 100 434 2.0 1996-06-25 16:23:18 11076 100 435 3.0 1996-06-25 16:25:33 11077 100 471 3.0 1996-06-25 16:37:19 11078 100 481 3.0 1996-06-25 16:47:57 11079 100 500 2.0 1996-06-25 16:30:44 11080 100 508 3.0 1996-06-25 16:35:35 11081 100 527 4.0 1996-06-25 16:30:44 11082 100 535 4.0 1996-06-25 16:46:16 11083 100 538 4.0 1996-06-25 16:47:44 11084 100 562 4.0 1996-07-29 14:57:42 11085 100 586 1.0 1996-06-25 16:32:37 11086 100 587 3.0 1996-06-25 16:31:42 11087 100 589 3.0 1996-06-25 16:29:49 11088 100 593 4.0 1996-06-25 16:23:45 11089 100 608 4.0 1996-06-25 16:33:06 11090 100 610 4.0 1996-06-25 16:35:35 11091 100 673 4.0 1996-06-25 16:58:05 11092 100 680 5.0 1996-06-25 16:58:31 11093 100 708 4.0 1996-06-25 16:44:04 11094 100 728 4.0 1996-07-16 16:26:17 11095 100 778 4.0 1997-06-09 16:41:27 11096 100 780 3.0 1996-07-11 16:20:12 11097 100 1112 4.0 1996-11-13 14:12:25 11098 100 1210 4.0 1997-06-09 16:43:14 11099 100 1449 5.0 1997-06-09 16:38:17 11100 100 1527 4.0 1997-06-09 16:40:04 D:\Anaconda2\lib\site-packages\ipykernel\__main__.py:7: FutureWarning: sort(columns=….) is deprecated, use sort_values(by=…..) userId movieId rating timestamp Favorable 10 1 293 4.0 2005-04-02 23:31:43 True 11 1 296 4.0 2005-04-02 23:32:47 True 12 1 318 4.0 2005-04-02 23:33:18 True 13 1 337 3.5 2004-09-10 03:08:29 True 14 1 367 3.5 2005-04-02 23:53:00 True userId movieId rating timestamp Favorable 11049 100 14 3.0 1996-06-25 16:40:02 False 11050 100 25 4.0 1996-06-25 16:31:02 True 11051 100 32 3.0 1996-06-25 16:24:49 False 11052 100 39 3.0 1996-06-25 16:25:12 False 11053 100 50 5.0 1996-06-25 16:24:49 True 199 D:\Anaconda2\lib\site-packages\ipykernel\__main__.py:28: FutureWarning: sort(columns=….) is deprecated, use sort_values(by=…..)

Favorable movieId 296 80.0 356 78.0 318 76.0 593 63.0 480 58.0

# Apripri算法专门用于查找数据集中的频繁项。基本流程为从前一步找到的频繁项集中找到新的备选集合，接着检测备选集合的频繁程度是否够高，然后再迭代。#（1）把各项目放到只包含自己的项集中，生成最初的频繁项集。只使用达到最小支持度的项目。#（2）查找现有频繁项集的超集，发现新的频繁项集，并用其生成新的备选项集。#（3）测试新生成的备选项集的频繁程度，如果不够频繁，则舍弃。如果没有新的频繁项集，就跳到最后一步。#（4）储存新发现的频繁项集，调到步骤（2）.#（5）返回发现的所有频繁项集。# 接着，用一个函数来实现步骤（2）和（3），它接受新发现的频繁项集，检测频繁程度。from collections import defaultdictdef find_frequent_itemsets(favorable_reviews_by_users, k_l_itemsets, min_support):    counts = defaultdict(int)    # 遍历所有用户和他们的打分数据    for user, reviews in favorable_reviews_by_users.items():        # 遍历前面找出的项集，判断它们是否当前评分项集的子集，如果是，表明用户已经为子集中的电影打过分。        for itemset in k_l_itemsets:            if itemset.issubset(reviews):                for other_reviewed_movie in reviews - itemset:                    current_superset = itemset | frozenset((other_reviewed_movie,))                    counts[current_superset] += 1    # 函数最后检测达到支持度要求的项集，看它的频繁程度够不，并返回其中的频繁项集    return dict([(itemset, frequency) for itemset, frequency in counts.items() if frequency >= min_support])

import sysfrequent_itemsets = {} #初始化一个字典min_support = 50 #设置最小支持度，建议每次只改动10个百分点# 第一步，为每部电影生成只包含自己的项集，检测它是否频繁。frequent_itemsets[1] = dict((frozenset((movie_id,)), row['Favorable']) for movie_id, row in num_favorable_by_movie.iterrows()                            if row['Favorable'] > min_support)print "There are {} movies with more than {} favorable reviews".format(len(frequent_itemsets[1]), min_support)sys.stdout.flush()# 创建循环，运行Ariori算法，存储算法运行中的发现的新项集。k表示即将发现的频繁项集的长度。# 用键k-1可以从frequence_itemsets字典中获取刚发现的频繁项集。# 新发现的频繁项集以长度为键，将其保存到字典中。for k in range(2, 20):    cur_frequent_itemsets = find_frequent_itemsets(favorable_reviews_by_users, frequent_itemsets[k-1],min_support)    # 如果在上述循环中没能找到任何新的频繁项集，就跳出循环。    if len(cur_frequent_itemsets) == 0:        print "Did not find any frequent itemsets of length {}".format(k)        sys.stdout.flush() # 确保代码在运行时，把缓冲区内容输出到终端，不宜多用，拖慢运行速度。        break    # 如果找到了频繁项集，输出信息。    else:        print "I found {} frequent itemsets of length {}".format(len(cur_frequent_itemsets), k)        sys.stdout.flush()        frequent_itemsets[k] = cur_frequent_itemsets# 删除长度为1的项集，对生成关联规则没用。del frequent_itemsets[1]print "Found a total of {0} frequent itemsets".format(sum(len(itemsets) for itemsets in frequent_itemsets.values()))

There are 11 movies with more than 50 favorable reviews I found 34 frequent itemsets of length 2 I found 49 frequent itemsets of length 3 I found 36 frequent itemsets of length 4 I found 12 frequent itemsets of length 5 I found 1 frequent itemsets of length 6 Did not find any frequent itemsets of length 7 Found a total of 132 frequent itemsets

# 抽取关联规则# Apriori算法运行结束后，得到一系列频繁项集，而不是关联规则。# 频繁项集是一组达到最小支持度的项目，而关联规则由前提和结论组成。# 从频繁项集中抽取关联规则，把其中几部电影作为前提，另一部电影作为结论，组成规则：如果用户喜欢前提中的所有电影，那么他们也会喜欢结论中的电影。# 每一个项集都可以用这种方式生成一条规则# 通过遍历不同长度的频繁项集，为每个项集生成规则candidate_rules = []for itemset_length, itemset_counts in frequent_itemsets.items():    for itemset in itemset_counts.keys():        #然后遍历项集中的每一部电影，把他作为结论。项集中的其他电影作为前提，用前提和结论组成备选规则。        for conclusion in itemset:            premise = itemset - set((conclusion,))            candidate_rules.append((premise, conclusion))# 得到了大量的备选项，查看前五条规则。print "There are {} candidate rules".format(len(candidate_rules))# frozenset是作为规则前提的电影编号，后面数字表示作为结论的电影编号。candidate_rules[:5]

There are 425 candidate rules [(frozenset({47}), 50), (frozenset({50}), 47), (frozenset({318}), 480), (frozenset({480}), 318), (frozenset({356}), 480)]

# 计算每条规则的置信度。# 分别存储规则应验（正例）和不适应的次数。correct_counts = defaultdict(int)incorrect_counts = defaultdict(int)# 遍历所有用户及其喜欢的电影数据，在这个过程中遍历每条关联规则。for user, reviews in favorable_reviews_by_users.items():    for candidate_rule in candidate_rules:        premise, conclusion = candidate_rule        # 测试每条规则的前提对用户是否适用，即用户是否喜欢前提中的所有电影。        if premise.issubset(reviews):            correct_counts[candidate_rule] += 1        else:            incorrect_counts[candidate_rule] += 1# 用规则应验的次数除以前提条件出现的次数，计算每条规则的置信度。rule_confidence = {candidate_rule: correct_counts[candidate_rule]/float(correct_counts[candidate_rule] + incorrect_counts[candidate_rule])                  for candidate_rule in candidate_rules}print len(rule_confidence)

425

#min_confidence = 0.9#rule_confidence = {rule: confidence for rule,confidence in rule_confidence.items() if confidence > min_confidence}#print len(rule_confidence)# 对置信度字典进行排序后，输出置信度最高的前五条规则。from operator import itemgettersorted_confidence = sorted(rule_confidence.items(), key=itemgetter(1), reverse=True)print max(sorted_confidence) #输出最大置信度print min(sorted_confidence) #输出最小置信度for index in range(5):    print "Rule #{0}".format(index + 1)    (premise, conclusion) = sorted_confidence[index][0]    print "Rule: If a person recommends {0} they will also recommend {1}".format(premise, conclusion)    print "- Confidence: {0: .3f}".format(rule_confidence[(premise, conclusion)])    print ""

((frozenset([296, 593, 50, 318, 47]), 356), 0.11055276381909548) ((frozenset([296]), 47), 0.4020100502512563) Rule #1 Rule: If a person recommends frozenset([296]) they will also recommend 527 - Confidence: 0.402 Rule #2 Rule: If a person recommends frozenset([296]) they will also recommend 2858 - Confidence: 0.402 Rule #3 Rule: If a person recommends frozenset([296]) they will also recommend 480 - Confidence: 0.402 Rule #4 Rule: If a person recommends frozenset([296]) they will also recommend 50 - Confidence: 0.402 Rule #5 Rule: If a person recommends frozenset([296]) they will also recommend 593 - Confidence: 0.402

#分析电影数据#数据：movies#表头：movieId,title,genresmovie_name_data = pd.read_csv("movies.csv")movie_name_data.head()# 创建一个用电影编号获取名称的函数def get_movie_name(movie_id):    title_object = movie_name_data[movie_name_data['movieId'] == movie_id]['title']    title = title_object.values[0]    return titleget_movie_name(4)

‘Waiting to Exhale (1995)’

# 在输出的规则中显示电影名称for index in range(5):    print "Rule #{0}".format(index + 1)    (premise, conclusion) = sorted_confidence[index][0]    premise_names = ", ".join(get_movie_name(idx) for idx in premise)    conclusion_name = get_movie_name(conclusion)    print "Rule: If a person recommends {0} they will also recommend {1}".format(premise_names, conclusion_name)    print " - Confidence: {0:.3f}".format(rule_confidence[(premise, conclusion)])    print ""

Rule #1 Rule: If a person recommends Pulp Fiction (1994) they will also recommend Schindler’s List (1993) - Confidence: 0.402 Rule #2 Rule: If a person recommends Pulp Fiction (1994) they will also recommend American Beauty (1999) - Confidence: 0.402 Rule #3 Rule: If a person recommends Pulp Fiction (1994) they will also recommend Jurassic Park (1993) - Confidence: 0.402 Rule #4 Rule: If a person recommends Pulp Fiction (1994) they will also recommend Usual Suspects, The (1995) - Confidence: 0.402 Rule #5 Rule: If a person recommends Pulp Fiction (1994) they will also recommend Silence of the Lambs, The (1991) - Confidence: 0.402

# 评估# 只是简单的看下每条规则的表现# 抽取所有没有训练的数据作为测试集test_dataset = all_ratings[~all_ratings['userId'].isin(range(200))]test_favorable = test_dataset[test_dataset['Favorable']]test_favorable_by_users = dict((k, frozenset(v.values)) for k, v in test_favorable.groupby('userId')['movieId'])test_dataset.head()

userId movieId rating timestamp Favorable 25048 200 6 5.0 1996-08-11 12:59:30 True 25049 200 10 3.0 1996-08-11 12:53:11 False 25050 200 17 4.0 1996-08-11 12:57:25 True 25051 200 19 2.0 1996-08-11 12:54:08 False 25052 200 20 4.0 1996-08-11 13:05:27 True

# 使用测试数据计算规则应验的数量correct_counts = defaultdict(int)incorrect_counts = defaultdict(int)for user, reviews in test_favorable_by_users.items():    for candidate_rule in candidate_rules:        premise, conclusion = candidate_rule        if premise.issubset(reviews):            if conclusion in reviews:                correct_counts[candidate_rule] += 1            else:                incorrect_counts[candidate_rule] += 1print len(correct_counts)

# 计算所有应验规则的置信度test_confidence = {candidate_rule: correct_counts[candidate_rule] / float(correct_counts[candidate_rule] + incorrect_counts[candidate_rule])                  for candidate_rule in rule_confidence}print len(test_confidence)sorted_test_confidence = sorted(test_confidence.items(), key=itemgetter(1), reverse=True)print sorted_test_confidence[:5]

425[((frozenset([296]), 2858), 0.4020100502512563), ((frozenset([296]), 480), 0.4020100502512563), ((frozenset([296]), 50), 0.4020100502512563), ((frozenset([296]), 593), 0.4020100502512563), ((frozenset([296]), 47), 0.4020100502512563)]

# 输出电影名称表示的最佳关联规则for index in range(5):    print "Rule #{0}".format(index + 1)    (premise, conclusion) = sorted_confidence[index][0]    premise_names = ",".join(get_movie_name(idx) for idx in premise)    conlusion_name = get_movie_name(conclusion)    print "Rule: If a person recommends {0} they will also recommend {1}".format(premise_names, conlusion_name)    print "- Train Confidence: {0:.3f}".format(rule_confidence.get((premise, conclusion), -1))    print "- Test Confidence: {0:.3f}".format(test_confidence.get((premise, conclusion), -1))    print ""

Rule #1Rule: If a person recommends Pulp Fiction (1994) they will also recommend Schindler's List (1993)- Train Confidence: 0.402- Test Confidence: 0.402Rule #2Rule: If a person recommends Pulp Fiction (1994) they will also recommend American Beauty (1999)- Train Confidence: 0.402- Test Confidence: 0.402Rule #3Rule: If a person recommends Pulp Fiction (1994) they will also recommend Jurassic Park (1993)- Train Confidence: 0.402- Test Confidence: 0.402Rule #4Rule: If a person recommends Pulp Fiction (1994) they will also recommend Usual Suspects, The (1995)- Train Confidence: 0.402- Test Confidence: 0.402Rule #5Rule: If a person recommends Pulp Fiction (1994) they will also recommend Silence of the Lambs, The (1991)- Train Confidence: 0.402- Test Confidence: 0.402

0 0