基于淘宝点击及购买记录的口碑商家推荐——基于用户的协同过滤

来源:互联网 发布:淘宝主播培训公司 编辑:程序博客网 时间:2024/04/29 05:16

Table_1: Online user behavior before Dec.2015. (ijcai2016_taobao)
User_id 、Seller_id、Item_id、Category_id、Online_Action_id、Time_Stamp
Category为商品类别,共72类,Online_Action分为0点击、1购买
Table_2: Users’shopping records at brick-and-mortar stores before Dec. 2015. (ijcai2016_koubei_train)
User_id、Merchant_id、Location_id 、Time_Stamp
Table_3: Merchant information. (ijcai2016_merchant_info)
Merchant_id 、Budget(budget constraints imposed on the merchant) 、Location_id_list
Table_4: Prediction result. (ijcai2016_koubei_test)
User_id、Location_id、Merchant_id_list
1、根据Table_2和Table_4中的User_id对Table_1进行过滤,4000W+的数据量减少到2000W+;

import pandas as pd, osos.chdir('c:/Bai/taobao/data sets')df1 = pd.read_csv('ijcai2016_koubei_test',header=None)df2 = pd.read_csv('ijcai2016_koubei_train',header=None)sets = set(df1[0])|set(df2[0])df3 = pd.read_csv('ijcai2016_taobao',header=None)df3 = df3[df3[0].isin(sets)]

2、对淘宝数据进行统计,得到feature.csv,内容为:User_id, Category_id, Online_Action_id, number;

import math,os,numpy as np,csvos.chdir('c:/Bai/taobao/data sets')f = open("taobao.csv")context = f.readlines()u_dict = [{}for i in range(2)]for line in context:    line = line.replace('\n','')    array = line.split(',')    if int(array[0])==0:        continue    u_id = (array[0],array[3])    type = int(array[4])    if (u_id in u_dict[type]):        u_dict[type][u_id] += 1    else:        u_dict[type][u_id] = 1csvfile = file('feature.csv','wb')writer = csv.writer(csvfile)for i  in range(2):    for key,value in u_dict[i].items():        key = list(key)        line = []        line.append(int(key[0]))        line.append(int(key[1]))        line.append(i)        line.append(value)        writer.writerow(line)csvfile.close()

3、根据feature.csv,得到用户特征字典:{User_id : [144维数组] },(category共有72类,Online_Action分为0点击、1购买);

# -*- coding:utf-8 -*-__author__ = 'Bai'import os,pandas as pd,numpy as np,math,csvfrom sklearn import preprocessingos.chdir('c:/Bai/taobao/data sets')##用户特征字典f = open("feature.csv")context = f.readlines()u_feature = {}for line in context:    line = line.replace('\n','')    array = line.split(',')    if array[0] in u_feature:        i = 2*int(array[1])+int(array[2])-2        u_feature[array[0]][i] = int(array[3])    else:        u_feature[array[0]] = [0 for i in range(144)]

4、将特征字典正则化,使得每个用户的特征向量模长为1;

##正则化for key in u_feature:    u_feature[key] = preprocessing.normalize(np.array(u_feature[key]), norm='l2')

5、对于每个要预测的用户User_a,根据其Location,如果该Location内商家数量不足十个,则全部推荐;
6、若该Location内商家数量超过十个,筛选出口碑上去过该Location任意商家的用户,形成用户集Set;
7、在用户集Set内计算用户与User_a的相似度(欧氏距离);
8、选择相似度较高的100个用户,并对这100个用户去过的商家统计数量,若商家不在所预测的用户的Location,则不统计其数量;
9、每个商家的数量*其Budget 得到商家的指标,选择较大的十个商家推荐给用户User_a。

不足:
1、不知道Time_Stamp的作用,淘宝点击时间和口碑消费时间是否有联系。
2、效率太太太低了。。。

f1 = open("koubei_test.csv")f2 = open("koubei_train.csv")f3 = open("merchant_info.csv")context1 = f1.readlines()context2 = f2.readlines()context3 = f3.readlines()shangquan = {}for v in context3:    v = v.replace('\n','')    v1 = v.split(',')    shangquan[v1[0]] = v1[2:]df = pd.read_csv("merchant_info.csv",header=None)csvfile = file('forcast.csv','wb')writer = csv.writer(csvfile)for line1 in context1:    line1 = line1.replace('\n','')    array1 = line1.split(',')    sim_user = set()    for line2 in context2:        line2 = line2.replace('\n','')        array2 = line2.split(',')        if int(array2[0])==0:            continue        if array1[1] == array2[2]:            sim_user.add(array2[0])    dist = {}    for a in sim_user:        if a in u_feature:            a1 = u_feature[a]        else:            a1 = [0 for i in range(144)]        if array1[0] in u_feature:            a2 = u_feature[array1[0]]        else:            a2 = [0 for i in range(144)]        a1 = np.asarray(a1)        a2 = np.asarray(a2)        dist[a] = math.sqrt(abs(np.dot(a1-a2, (a1-a2).T)))    d = sorted(dist.iteritems(),key=lambda t:t[1],reverse=False)    count = 0    merchant = {}    for key in d:        if count == 100:            break        for line in context2:            line = line.replace('\n','')            array = line.split(',')            if array[0] == key:                if array1[1] in shangquan[array[1]]:                    if array[1] in merchant:                        merchant[array[1]] += 1                    else:                        merchant[array[1]] = 1        count += 1    for x in merchant:        merchant[x] = merchant[x] * df[df[0]==x][1]    s = sorted(merchant.iteritems(),key=lambda t:t[1],reverse=True)    count2 = 0    content = ''    for y in s:        if count2 == 10:            break        if count2 == 0:            content = y        else:            content = content + ':' + y        count2 += 1    array1.append(content)    writer.writerow(array1)csvfile.close()

评测指标

0 0
原创粉丝点击