商品亲和性分析示例 - Python 数据挖掘

来源:互联网 发布:c语言集合的交并运算 编辑:程序博客网 时间:2024/06/16 02:34

商品亲和性分析示例 - Python 数据挖掘

= 开始之前 =

  1. 安装python3,给个地址:python3
  2. pip install xxxx
  3. 自带的pip速度很慢,换个源,速度瞬间飙高几百倍。
  4. 用 Pycharm 和 SublimeTxet
  5. 书中用到的数据和源代码 点击下载

= 开始之旅 =

一个简单的亲和性分析示例

在有足够多的数据的情况下,我们可以对某种假设进行分析。亲和性分析即确定个体之间的相似度以及他们之间关系的亲疏,这里有一些应用场景的举例:
* 想网站用户提供多样化的服务或投放定向广告
* 为购买产品的用户提供一些相关的产品
* 根据基因寻找有亲缘关系的人

在接下来的给定示例中,
商家售卖五种商品:

features = ["bread", "milk", "cheese", "apples", "bananas"]

购买信息储存在 affinity_dataset.txt 中,我们用numpy类库加载它,通过打印前几行查看其格式:

import numpy as npdataset_filename = "affinity_dataset.txt"X = np.loadtxt(dataset_filename)# 由n_samples, n_goods------- 构成了 X.shape,即数据集X的行和列,sample代表个人的购买行为总记录,good代表五种货物n_samples, n_goods = X.shapeprint(X[:5])> [[ 0.  0.  1.  1.  1.]>  [ 1.  1.  0.  1.  0.]>  [ 1.  0.  1.  1.  0.]>  [ 0.  0.  1.  1.  1.]>  [ 0.  1.  0.  0.  1.]]

根据信息的结构,我们可以提出一个假设,假设N:If someone buy x1 they will also buy x2,并想办法对其进行验证,比如:

通过可信度和支持度验证,支持度即,假设N成立的情况下,支持度+=1;X的可信度即,假设N成立次数/购买商品所含X的总数,结果为百分数。我们可以通过算法来实现这个假设。

我们用defaultdictionary来构建相关变量,再计算支持度和可信度:

from collections import defaultdictvalid_rules = defaultdict(int)invalid_rules = defaultdict(int)num_occurances = defaultdict(int)# 对于X数据集中的每一个个体(sin),如果他们购买了某件商品,则这件商品的购买次数(num_occurances[])+=1,再通过conclusion循环,找出买了X1又买了X2的例子并使验证成功的规则(valid_rule)加一,买了X1没买X2就使验证失败的例子加一for sin in X:    # range(5)是没有意义的    for premise in range(4):        if sin[premise] == 0:            continue        num_occurances[premise] += 1        for conclusion in range(n_goods):            # 他们相等的情况也是没有意义的            if premise == conclusion:                continue            if sin[conclusion] == 1:                valid_rules[(premise, conclusion)] += 1            else:                invalid_rules[(premise, conclusion)] += 1#support = valid_rulesconfidence = defaultdict(float)               # 把keys依次取出分配给premise和conclusion,能用上这样的语法实在是棒了for premise, conclusion in valid_rules.keys():    rule = (premise, conclusion)    confidence[rule] = valid_rules[rule] / num_occurances[premise]

最后一步,我们要将支持度和可信度排序:

from operator import itemgetter# reverse=True才是从大到小排序# itemgetter函数,设定一个函数,获取第几个域,第几个值,其中参数的个数决定了这些值或者域的个数sorted_support = sorted(support.items(), key=itemgetter(1), reverse=True)sorted_confidence = sorted(confidence.items(), key=itemgetter(1), reverse=True)# print_rule 和 print_line 是打印相关函数,我把源代码放在文末了for index in range(5):    print("Rule #{}".format(index + 1))    premise, conclusion = sorted_support[index][0]    print_rule(premise, conclusion, support, confidence, goods)print_line()for index in range(5):    print("Rule #{}".format(index + 1))    premise, conclusion = sorted_confidence[index][0]    print_rule(premise, conclusion, support, confidence, goods)print_line()

本小节的源代码

import numpy as npfrom collections import defaultdictimport pprintfrom operator import itemgetterdef print_line():    print("====================================================")    print("\n")dataset_filename = "affinity_dataset.txt"X = np.loadtxt(dataset_filename)n_samples, n_goods = X.shapegoods = ["bread", "milk", "cheese", "apples", "bananas"]print(X[:5])print_line()num_apple_purchases = 0for sin in X:    if sin[3] == 1:        num_apple_purchases += 1print("{0} people bought Apples".format(num_apple_purchases))print_line()valid_rules = defaultdict(int)invalid_rules = defaultdict(int)num_occurances = defaultdict(int)for sin in X:    for premise in range(4):        if sin[premise] == 0:            continue        num_occurances[premise] += 1        for conclusion in range(n_goods):            if premise == conclusion:                continue            if sin[conclusion] == 1:                valid_rules[(premise, conclusion)] += 1            else:                invalid_rules[(premise, conclusion)] += 1support = valid_rulesconfidence = defaultdict(float)for premise, conclusion in valid_rules.keys():    rule = (premise, conclusion)    confidence[rule] = valid_rules[rule] / num_occurances[premise]def print_rule(premise, conclusion, support, confidence, features):    premise_name = features[premise]    conclusion_name = features[conclusion]    print("Rule : If a person buys {0} they will also buy {1}".format(premise_name, conclusion_name))    print(" - Support : {}".format(support[(premise, conclusion)]))    print(" - Confidence : {0:.3f}".format(confidence[(premise, conclusion)]))    print_line()premise = 1conclusion = 3print_rule(premise, conclusion, support, confidence, goods)print_line()pprint.pprint(support)pprint.pprint(list(support.items()))print_line()sorted_support = sorted(support.items(), key=itemgetter(1), reverse=True)sorted_confidence = sorted(confidence.items(), key=itemgetter(1), reverse=True)for index in range(5):    print("Rule #{}".format(index + 1))    premise, conclusion = sorted_support[index][0]    print_rule(premise, conclusion, support, confidence, goods)print_line()for index in range(5):    print("Rule #{}".format(index + 1))    premise, conclusion = sorted_confidence[index][0]    print_rule(premise, conclusion, support, confidence, goods)print_line()
1 0
原创粉丝点击