市场研究中的数据分析知识整理 (八)-关联法则

来源:互联网 发布:mac office 没有权限 编辑:程序博客网 时间:2024/05/19 05:03

关联法则

关联法则的度量包括:
支持度(support),的含有该集合的交易占总交易的占比。元素集合出现要相对频繁。
置信度(confidence),出现A商品的同时,出现B商品的概率。存在强关系
提升度(lift),某一组合出现的概率是其中各个商品单独出现频率预期的倍数。出现频率超过偶然现象。

小票数据

数据导入

library(arules)data("Groceries")summary(Groceries)inspect(some(Groceries,5))

结果输出:

    items                  [1] {UHT-milk,                  domestic eggs,             brown bread,               coffee,                    soda,                      canned beer,               liquor (appetizer),        newspapers}           [2] {brown bread,               margarine}            [3] {fruit/vegetable juice,     waffles}              [4] {butter milk,               coffee}               [5] {frozen meals}  

可以看到关联法则所针对的数据结构就是这种不是矩阵化的数据,其中,每个「{}」代表一条交易,每条包含的内容数量是有差异。

探索关联法则

ar <- apriori(Groceries,parameter = list(supp=0.01,conf=0.3,target = "rules"))# 根据具体的数据和业务需求调整supp和conf。inspect(subset(ar,lift > 2.5))

结果输出:

  #Apriori结果Parameter specification: confidence minval smax arem  aval originalSupport maxtime support minlen maxlen target   ext        0.3    0.1    1 none FALSE            TRUE       5    0.01      1     10  rules FALSEAlgorithmic control: filter tree heap memopt load sort verbose    0.1 TRUE TRUE  FALSE TRUE    2    TRUEAbsolute minimum support count: 98 set item appearances ...[0 item(s)] done [0.00s].set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].sorting and recoding items ... [88 item(s)] done [0.00s].creating transaction tree ... done [0.00s].checking subsets of size 1 2 3 4 done [0.00s].writing ... [125 rule(s)] done [0.00s].creating S4 object  ... done [0.00s].#inspect结果:     lhs                                      rhs                support    confidence lift    [1]  {beef}                                => {root vegetables}  0.01738688 0.3313953  3.040367[2]  {whole milk,curd}                     => {yogurt}           0.01006609 0.3852140  2.761356[3]  {yogurt,whipped/sour cream}           => {other vegetables} 0.01016777 0.4901961  2.533410[4]  {other vegetables,whipped/sour cream} => {yogurt}           0.01016777 0.3521127  2.524073[5]  {citrus fruit,root vegetables}        => {other vegetables} 0.01037112 0.5862069  3.029608[6]  {citrus fruit,other vegetables}       => {root vegetables}  0.01037112 0.3591549  3.295045[7]  {tropical fruit,root vegetables}      => {other vegetables} 0.01230300 0.5845411  3.020999[8]  {tropical fruit,other vegetables}     => {root vegetables}  0.01230300 0.3427762  3.144780[9]  {tropical fruit,whole milk}           => {yogurt}           0.01514997 0.3581731  2.567516[10] {root vegetables,yogurt}              => {other vegetables} 0.01291307 0.5000000  2.584078[11] {root vegetables,rolls/buns}          => {other vegetables} 0.01220132 0.5020921  2.594890[12] {other vegetables,whole milk}         => {root vegetables}  0.02318251 0.3097826  2.842082

结果解读,在提升度大于2.5的规则中,有12条规则符合。如在购买了「whole milk,curd」的同时,会出现「yogurt」的概率是单独购买的2.7倍(lift),在整体数据中出现的概率为1%(support);出现「whole milk,curd」时候,也会出现「yogurt」的概率是38.5%(confidence)。

零售数据

数据导入

导入Brijs中Belgian的交易数据作为零售数据的分析

rtl.raw <- readLines("http://fimi.ua.ac.be/data/retail.dat")rtl.list <- strsplit(rtl.raw, " ") # 将原数据中文本且分为单独的元素names(rtl.list) <- paste("trsct", 1:length(rtl.list),sep = "") #给每个交易命名rtl.trsct <- as(rtl.list,"transactions")#最终转化成transactions对象summary(rtl.trsct)#为单笔交易利润计算模拟利润数据rtl.vct <- sort(unique(unlist(as(rtl.trsct,"list"))))#编译商品名称列表set.seed(0017)rtl.mrgn <- data.frame(margin = rnorm(n = length(rtl.vct),mean = 0.30,sd= 0.30))# 生产随即数据rownames(rtl.mrgn) <- rtl.vct #为生成的值编排索引
AprioriParameter specification: confidence minval smax arem  aval originalSupport maxtime support minlen maxlen        0.5    0.1    1 none FALSE            TRUE       5    0.01      1     10 target   ext  rules FALSEAlgorithmic control: filter tree heap memopt load sort verbose    0.1 TRUE TRUE  FALSE TRUE    2    TRUEAbsolute minimum support count: 881 set item appearances ...[0 item(s)] done [0.00s].set transactions ...[16470 item(s), 88162 transaction(s)] done [0.24s].sorting and recoding items ... [70 item(s)] done [0.01s].creating transaction tree ... done [0.06s].checking subsets of size 1 2 3 4 done [0.01s].writing ... [111 rule(s)] done [0.00s].creating S4 object  ... done [0.03s].

plot结果:
这里写图片描述
plot交互界面:
这里写图片描述

探索并可视化关联法则

rtl.rules <- apriori(rtl.trsct, parameter = list(supp = 0.01, conf = 0.5))library(arulesViz)plot(rtl.rules,interactive = T)

按照supp = 0.01, conf = 0.5的条件,获得881条数据。并通过点图绘制法则的分布情况。并且,可在console中生产可互动的点图。

规则子集绘图
除了整体的800余条法则,对于提升明显的规则子集的考察仍然是另一个重要方面:

rtl.top30 <- head(sort(rtl.rules, by = "lift"),30)plot(rtl.top30, method= "graph", control = list(type ="items"))

可以看到有2个明显的规则集合,以39、48和38为核心的集合,39是两个集合的过渡值。

非交易数据

除了在零售交易数据分析中使用,关联法则还可通过将同一条记录中的各个变量值视为一条「交易」,从而来运用关联法则,探寻不同变量间的关系。此处使用在聚类分析中使用的客户分组数据集。

数据导入与处理

cls <- read.csv("http://r-marketing.r-forge.r-project.org/data/rintro-chapter5.csv")#切割连续值cls_cut <- clscls_cut$age <- cut(cls$age,                    breaks = c(0,25,35,45,55,65,100),                   labels = c("19~24","25~34","35~44","45~54","55~64","65+"),                   right = F,ordered_result = T)cls_cut$kids <- cut(cls$kids,                    breaks = c(0,1,2,3,100),                   labels = c("no_kid","one_kid","two_kid","three_kid+"),                   right = F,ordered_result = T)cls_cut$income <- cut(cls$income,                    breaks = c(-100000,40000,70000,1000000),                   labels = c("low","medium","high"),                   right = F,ordered_result = T)summary(cls_cut)

探索并可视化关联法则

cls_trsct <- as(cls_cut, "transactions")cls.rules <- apriori(cls_trsct, parameter = list(supp = 0.1, conf = 0.4))plot(cls.rules)inspect(subset(cls.rules, lift > 5))cls_top50 <-sort(cls.rules,by = "lift")[1:50]plot(cls_top50, method= "graph", control = list(type ="items"))

结果输出:

    lhs                    rhs                   support confidence     lift[1]  {age=19~24}         => {Segment=Urban hip} 0.1266667       1.00 6.000000[2]  {Segment=Urban hip} => {age=19~24}         0.1266667       0.76 6.000000[3]  {age=19~24,                                                                   income=low}        => {Segment=Urban hip} 0.1266667       1.00 6.000000[4]  {income=low,                                                                  Segment=Urban hip} => {age=19~24}         0.1266667       0.76 6.000000[5]  {age=19~24,                                                                   ownHome=ownNo}     => {Segment=Urban hip} 0.1000000       1.00 6.000000[6]  {ownHome=ownNo,                                                               Segment=Urban hip} => {age=19~24}         0.1000000       0.75 5.921053[7]  {age=19~24,                                                                   subscribe=subNo}   => {Segment=Urban hip} 0.1000000       1.00 6.000000[8]  {subscribe=subNo,                                                             Segment=Urban hip} => {age=19~24}         0.1000000       0.75 5.921053[9]  {age=19~24,                                                                   income=low,                                                                  ownHome=ownNo}     => {Segment=Urban hip} 0.1000000       1.00 6.000000[10] {income=low,                                                                  ownHome=ownNo,                                                               Segment=Urban hip} => {age=19~24}         0.1000000       0.75 5.921053[11] {age=19~24,                                                                   income=low,                                                                  subscribe=subNo}   => {Segment=Urban hip} 0.1000000       1.00 6.000000[12] {income=low,                                                                  subscribe=subNo,                                                             Segment=Urban hip} => {age=19~24}         0.1000000       0.75 5.921053

plot法则类图:
这里写图片描述

透过这个图可以看到,两个内部差异性较小(提升度高)的特征群体,19~24岁、低收入、无房为一个内部关联性高的群体,55~64岁、有房、旅行者等为另一群体。通过修改提升度范围,可进一步看其他聚合的群体。

阅读全文
0 0