基于淘宝点击及购买记录的口碑商家推荐——基于用户的协同过滤
来源:互联网 发布:淘宝主播培训公司 编辑:程序博客网 时间:2024/04/29 05:16
Table_1: Online user behavior before Dec.2015. (ijcai2016_taobao)
User_id 、Seller_id、Item_id、Category_id、Online_Action_id、Time_Stamp
Category为商品类别,共72类,Online_Action分为0点击、1购买
Table_2: Users’shopping records at brick-and-mortar stores before Dec. 2015. (ijcai2016_koubei_train)
User_id、Merchant_id、Location_id 、Time_Stamp
Table_3: Merchant information. (ijcai2016_merchant_info)
Merchant_id 、Budget(budget constraints imposed on the merchant) 、Location_id_list
Table_4: Prediction result. (ijcai2016_koubei_test)
User_id、Location_id、Merchant_id_list
1、根据Table_2和Table_4中的User_id对Table_1进行过滤,4000W+的数据量减少到2000W+;
import pandas as pd, osos.chdir('c:/Bai/taobao/data sets')df1 = pd.read_csv('ijcai2016_koubei_test',header=None)df2 = pd.read_csv('ijcai2016_koubei_train',header=None)sets = set(df1[0])|set(df2[0])df3 = pd.read_csv('ijcai2016_taobao',header=None)df3 = df3[df3[0].isin(sets)]
2、对淘宝数据进行统计,得到feature.csv,内容为:User_id, Category_id, Online_Action_id, number;
import math,os,numpy as np,csvos.chdir('c:/Bai/taobao/data sets')f = open("taobao.csv")context = f.readlines()u_dict = [{}for i in range(2)]for line in context: line = line.replace('\n','') array = line.split(',') if int(array[0])==0: continue u_id = (array[0],array[3]) type = int(array[4]) if (u_id in u_dict[type]): u_dict[type][u_id] += 1 else: u_dict[type][u_id] = 1csvfile = file('feature.csv','wb')writer = csv.writer(csvfile)for i in range(2): for key,value in u_dict[i].items(): key = list(key) line = [] line.append(int(key[0])) line.append(int(key[1])) line.append(i) line.append(value) writer.writerow(line)csvfile.close()
3、根据feature.csv,得到用户特征字典:{User_id : [144维数组] },(category共有72类,Online_Action分为0点击、1购买);
# -*- coding:utf-8 -*-__author__ = 'Bai'import os,pandas as pd,numpy as np,math,csvfrom sklearn import preprocessingos.chdir('c:/Bai/taobao/data sets')##用户特征字典f = open("feature.csv")context = f.readlines()u_feature = {}for line in context: line = line.replace('\n','') array = line.split(',') if array[0] in u_feature: i = 2*int(array[1])+int(array[2])-2 u_feature[array[0]][i] = int(array[3]) else: u_feature[array[0]] = [0 for i in range(144)]
4、将特征字典正则化,使得每个用户的特征向量模长为1;
##正则化for key in u_feature: u_feature[key] = preprocessing.normalize(np.array(u_feature[key]), norm='l2')
5、对于每个要预测的用户User_a,根据其Location,如果该Location内商家数量不足十个,则全部推荐;
6、若该Location内商家数量超过十个,筛选出口碑上去过该Location任意商家的用户,形成用户集Set;
7、在用户集Set内计算用户与User_a的相似度(欧氏距离);
8、选择相似度较高的100个用户,并对这100个用户去过的商家统计数量,若商家不在所预测的用户的Location,则不统计其数量;
9、每个商家的数量*其Budget 得到商家的指标,选择较大的十个商家推荐给用户User_a。
不足:
1、不知道Time_Stamp的作用,淘宝点击时间和口碑消费时间是否有联系。
2、效率太太太低了。。。
f1 = open("koubei_test.csv")f2 = open("koubei_train.csv")f3 = open("merchant_info.csv")context1 = f1.readlines()context2 = f2.readlines()context3 = f3.readlines()shangquan = {}for v in context3: v = v.replace('\n','') v1 = v.split(',') shangquan[v1[0]] = v1[2:]df = pd.read_csv("merchant_info.csv",header=None)csvfile = file('forcast.csv','wb')writer = csv.writer(csvfile)for line1 in context1: line1 = line1.replace('\n','') array1 = line1.split(',') sim_user = set() for line2 in context2: line2 = line2.replace('\n','') array2 = line2.split(',') if int(array2[0])==0: continue if array1[1] == array2[2]: sim_user.add(array2[0]) dist = {} for a in sim_user: if a in u_feature: a1 = u_feature[a] else: a1 = [0 for i in range(144)] if array1[0] in u_feature: a2 = u_feature[array1[0]] else: a2 = [0 for i in range(144)] a1 = np.asarray(a1) a2 = np.asarray(a2) dist[a] = math.sqrt(abs(np.dot(a1-a2, (a1-a2).T))) d = sorted(dist.iteritems(),key=lambda t:t[1],reverse=False) count = 0 merchant = {} for key in d: if count == 100: break for line in context2: line = line.replace('\n','') array = line.split(',') if array[0] == key: if array1[1] in shangquan[array[1]]: if array[1] in merchant: merchant[array[1]] += 1 else: merchant[array[1]] = 1 count += 1 for x in merchant: merchant[x] = merchant[x] * df[df[0]==x][1] s = sorted(merchant.iteritems(),key=lambda t:t[1],reverse=True) count2 = 0 content = '' for y in s: if count2 == 10: break if count2 == 0: content = y else: content = content + ':' + y count2 += 1 array1.append(content) writer.writerow(array1)csvfile.close()
评测指标
- 基于淘宝点击及购买记录的口碑商家推荐——基于用户的协同过滤
- 基于淘宝点击及购买记录的口碑商家推荐——基于物品的协同过滤
- 基于用户的协同过滤推荐—实现电影推荐
- 协同过滤——基于用户的推荐算法
- 推荐算法——基于用户的协同过滤算法
- 基于用户的协同过滤推荐算法
- 基于用户的协同过滤推荐算法
- 基于用户的协同过滤推荐
- 基于用户的协同过滤推荐模型
- 推荐系统简介——基于协同过滤的推荐
- 推荐系统简介——基于协同过滤的推荐
- 基于用户的协同过滤和基于物品的协同过滤推荐算法原理和实现
- 基于用户的协同过滤
- 基于用户的协同过滤算法的电影推荐系统
- 推荐系统(基于用户的协同过滤)入门总结
- [推荐算法]基于用户的协同过滤算法
- [推荐算法]基于用户的协同过滤算法
- 推荐系统--基于用户的协同过滤算法
- 旋转数组的最小数字10
- Class.forName()用法详解
- 让HTML中的文本框<input type="text">中的文字垂直居中
- 3008
- 初识持续集成(Continuous Integration)
- 基于淘宝点击及购买记录的口碑商家推荐——基于用户的协同过滤
- LeetCode:Implement Queue using Stacks
- Android的ADT与SDK的区别
- Programming Scala第6章 demo02 高阶函数
- 工具篇之FTP And Office
- oracle sqlplus 导出csv文件
- bzoj 1005: [HNOI2008]明明的烦恼(组合数学 purfer sequence)
- PrintStream
- actionbar的下拉Menu(dropDownListView)的分割线divider颜色修改