市场研究中的数据分析知识整理 (十)-贝叶斯的方案

来源:互联网 发布:2016最大网络传销案 编辑:程序博客网 时间:2024/05/19 06:35

贝叶斯

本文的所涉及的贝叶斯方法所涉及的数据和应对的问题都与之前总结相同,价值在于面对同一个问题给出不同的解决方案。

library(BayesFactor)library(MCMCpack)

贝叶斯ANOVA

贝叶斯ANOVA模型

anova <- read.csv("http://r-marketing.r-forge.r-project.org/data/rintro-chapter5.csv")names(anova) <- tolower(names(anova))#贝叶斯在方差估计过程中需要从后验分布中进行随机抽样,故需要设置随机种子set.seed(651700)anv1 <- lmBF(income ~ segment, data = anova)anv2 <- lmBF(income ~ segment + ownhome, data = anova)#通过模型比较,确定拟合是否收敛;贝叶斯可直接将两个模型用「/」相连进行比较anv1/anv2

结果输出:

Bayes factor analysis--------------[1] segment : 6.623943 ±1.29%Against denominator:  income ~ segment + ownhome ---Bayes factor type: BFlinearModel, JZS

输出解读:

  • 什么是贝叶斯因子(Bayes factor):

  • 贝叶斯因子的作用:

    The aim of the Bayes factor is to quantify the support for a model over another, regardless of whether these models are correct.

  • 如何看待贝叶斯因子

具体参照 Harold Jeffreys给出的K值范围解释

两个模型的贝叶斯因子比例是6.6,很明显是anv1好于anv2。

ANOVA模型参数估计和置信区间

#参数估计和置信区间函数 posterior(model, index,iterations)为马尔可夫过程估计, 其中iterations为从参数后验中抽取的样本数anv_chain <- posterior(anv1, 1, iterations = 10000)#本案例中,选择查看马尔科夫随基抽样结果的群值(mu)和方差(sigma),以及segment中四各类的均值估计plot(anv_chain[,1:6])

这里写图片描述

这里写图片描述

其中,左边是踪迹图展示样本的收敛程度,本案例呈现较宽直线分布,稳定且收敛(与此相对的不收敛情况是呈现上下波动或者存在大量离群值)。右边的密度图则显示近似正态分布符合回归模型假设。至此证明,模型是稳定的。

summary(anv_chain)

结果输出:

Iterations = 1:10000Thinning interval = 1 Number of chains = 1 Sample size per chain = 10000 1. Empirical mean and standard deviation for each variable,   plus standard error of the mean:                         Mean        SD  Naive SE Time-series SEmu                  4.806e+04 8.925e+02 8.925e+00      8.691e+00segment-Moving up   4.980e+03 1.528e+03 1.528e+01      1.554e+01segment-Suburb mix  6.916e+03 1.381e+03 1.381e+01      1.381e+01segment-Travelers   1.400e+04 1.471e+03 1.471e+01      1.471e+01segment-Urban hip  -2.590e+04 1.759e+03 1.759e+01      1.785e+01sig2                2.257e+08 1.876e+07 1.876e+05      1.938e+05g_segment           2.204e+00 4.341e+00 4.341e-02      4.341e-022. Quantiles for each variable:                         2.5%        25%        50%        75%      97.5%mu                  4.631e+04  4.745e+04  4.805e+04  4.866e+04  4.979e+04segment-Moving up   2.017e+03  3.927e+03  4.981e+03  5.997e+03  7.925e+03segment-Suburb mix  4.195e+03  5.993e+03  6.903e+03  7.832e+03  9.663e+03segment-Travelers   1.111e+04  1.299e+04  1.400e+04  1.498e+04  1.689e+04segment-Urban hip  -2.931e+04 -2.708e+04 -2.589e+04 -2.474e+04 -2.242e+04sig2                1.913e+08  2.126e+08  2.248e+08  2.379e+08  2.657e+08g_segment           3.903e-01  8.269e-01  1.340e+00  2.354e+00  8.685e+00

由于anv_chain是由群体均值(mu)和各个参数和群体均值的偏差构成,还不是每个参数的均值,需要进一步处理。

anv_chain_t <- anv_chain[,2:5] + anv_chain[,1]#获取每个参数的分位数anv_ci <- t(apply(anv_chain_t,2,quantile, pr = c(0.025,0.5,0.975)))#置信区间可视化library(ggplot2)anv_ci_df <- data.frame(anv_ci)anv_ci_df$segment <- rownames(anv_ci_df)ggplot(anv_ci_df, aes(x = segment, y = X50., ymax = X97.5., ymin = X2.5.))+  geom_point(size = 5)+geom_errorbar(width = 0.2) + coord_flip() + ylab("income")

这里写图片描述

贝叶斯线性模型

amu <- read.csv("http://r-marketing.r-forge.r-project.org/data/rintro-chapter7.csv")amu$logdis <- log(amu$distance) #转化原有distance变量amu.std <- amu[ ,-3]  # 去原有distance变量amu.std[ , 3:8] <- scale(amu.std[ ,3:8])

线性模型拟合:

amu.std$child.or.not <- as.factor(amu.std$num.child >0)#一般线性模型lm4 <- lm(overall ~ rides + games + wait + clean +  logdis + child.or.not + wait:child.or.not, data = amu.std)summary(lm4)#贝叶斯线性模型lm_bys <- MCMCregress(overall ~ rides + games + wait + clean +  logdis + child.or.not + wait:child.or.not, data = amu.std)summary(lm_bys)

结果输出:

#一般线性模型Call:lm(formula = overall ~ rides + games + wait + clean + logdis +     child.or.not + wait:child.or.not, data = amu.std)Residuals:     Min       1Q   Median       3Q      Max -1.12939 -0.32350 -0.00167  0.31705  1.43253 Coefficients:                      Estimate Std. Error t value Pr(>|t|)    (Intercept)           -0.69316    0.03684 -18.814  < 2e-16 ***rides                  0.21264    0.03313   6.419 3.24e-10 ***games                  0.04870    0.02394   2.034   0.0425 *  wait                   0.15095    0.03688   4.093 4.98e-05 ***clean                  0.30244    0.03485   8.678  < 2e-16 ***logdis                 0.02919    0.02027   1.440   0.1504    child.or.notTRUE       0.99830    0.04416  22.606  < 2e-16 ***wait:child.or.notTRUE  0.34688    0.04380   7.920 1.59e-14 ***---Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1Residual standard error: 0.4508 on 492 degrees of freedomMultiple R-squared:  0.7996,    Adjusted R-squared:  0.7968 F-statistic: 280.5 on 7 and 492 DF,  p-value: < 2.2e-16#贝叶斯线性模型Iterations = 1001:11000Thinning interval = 1 Number of chains = 1 Sample size per chain = 10000 1. Empirical mean and standard deviation for each variable,   plus standard error of the mean:                          Mean      SD  Naive SE Time-series SE(Intercept)           -0.69331 0.03702 0.0003702      0.0003702rides                  0.21262 0.03351 0.0003351      0.0003301games                  0.04885 0.02400 0.0002400      0.0002400wait                   0.15096 0.03683 0.0003683      0.0003683clean                  0.30205 0.03515 0.0003515      0.0003515logdis                 0.02891 0.02029 0.0002029      0.0002029child.or.notTRUE       0.99837 0.04441 0.0004441      0.0004441wait:child.or.notTRUE  0.34733 0.04358 0.0004358      0.0004358sigma2                 0.20374 0.01306 0.0001306      0.00013062. Quantiles for each variable:                           2.5%      25%      50%      75%    97.5%(Intercept)           -0.764177 -0.71841 -0.69345 -0.66861 -0.62004rides                  0.145773  0.19015  0.21290  0.23499  0.27833games                  0.001507  0.03285  0.04876  0.06453  0.09668wait                   0.079481  0.12629  0.15060  0.17602  0.22353clean                  0.233243  0.27832  0.30218  0.32581  0.37076logdis                -0.010923  0.01539  0.02885  0.04262  0.06869child.or.notTRUE       0.910071  0.96896  0.99857  1.02800  1.08498wait:child.or.notTRUE  0.261291  0.31780  0.34720  0.37724  0.43211sigma2                 0.179781  0.19454  0.20311  0.21213  0.23094

对比一般线性模型和贝叶斯线性模型可知:

  • 贝叶斯线性模型由两部分构成,描述集中趋势的参数均值和标准差(「1.」),以及描述分布趋势的参数分位数;
    *贝叶斯参数估计结果没有统计检验(p值),要获知参数是否为非零(模型是否考虑具体某一个参数),可以看其分位值。如该模型中,只有logdis在2.5% ~ 97.5% 区间内的取值为(-0.01,0.07),是不显著非零的。

贝叶斯分层线性模型

cja_1 <- read.csv("http://r-marketing.r-forge.r-project.org/data/rintro-chapter9conjoint.csv")cja_1$speed <- factor(cja_1$speed)cja_1$height <- factor(cja_1$height)

分层初始线性模型拟合

#从非分层的模型入手,建立对于该数据基本关系认知set.seed(651700)cja_mc <- MCMCregress(rating ~ speed + height + const +theme, data = cja_1)summary(cja_mc)

结果输出:

Iterations = 1001:11000Thinning interval = 1 Number of chains = 1 Sample size per chain = 10000 1. Empirical mean and standard deviation for each variable,   plus standard error of the mean:               Mean      SD  Naive SE Time-series SE(Intercept)  3.0729 0.08112 0.0008112      0.0008112speed50      0.8208 0.11061 0.0011061      0.0011126speed60      1.5754 0.12889 0.0012889      0.0012889speed70      4.4873 0.15002 0.0015002      0.0015002height300    2.9444 0.09122 0.0009122      0.0009337height400    1.4461 0.12934 0.0012934      0.0013367constWood   -0.1187 0.11310 0.0011310      0.0011310themeEagle  -0.7533 0.11308 0.0011308      0.0011308sigma2       3.8705 0.09737 0.0009737      0.00097372. Quantiles for each variable:               2.5%     25%     50%      75%   97.5%(Intercept)  2.9175  3.0175  3.0722  3.12754  3.2349speed50      0.6027  0.7470  0.8225  0.89443  1.0331speed60      1.3230  1.4872  1.5766  1.66176  1.8302speed70      4.1944  4.3855  4.4850  4.59031  4.7802height300    2.7640  2.8834  2.9450  3.00725  3.1190height400    1.1920  1.3582  1.4442  1.53329  1.7048constWood   -0.3403 -0.1953 -0.1190 -0.04138  0.1023themeEagle  -0.9714 -0.8301 -0.7522 -0.67797 -0.5303sigma2       3.6839  3.8039  3.8683  3.93464  4.0691

本案例所用数据与上一篇文章(市场研究中的数据分析知识整理 (九)- 联合分析)中特征贡献模型数据和模型结构相同,可以看到结果也近乎一致。

分层线性模型拟合

MCMChregress()是贝叶斯估计分层模型的函数,其多出来的h正是hierarchical。

#其中fixed是固定效应的模型公式,random则是对每个个体估计的随机效应,group则是随机效应对应的分类变量,r和R是聚集受访者信息程度的两个参数,其中r是参数个数(此案例有8个参数),R是参数矩阵,通过函数diag(K)取值,其中的K与r相同。set.seed(651700)cja_mc2 <- MCMChregress(fixed = rating ~ speed + height + const +theme,                         random = ~ speed + height + const +theme,                        group = "resp.id",                        data = cja_1,r = 8, R = diag(8))

分层线性模型输出解读

#前8列是总体系数和群体偏好水平(固定效应)summary(cja_mc2$mcmc[,1:8])

结果输出:

Iterations = 1001:10991Thinning interval = 10 Number of chains = 1 Sample size per chain = 1000 1. Empirical mean and standard deviation for each variable,   plus standard error of the mean:                    Mean     SD Naive SE Time-series SEbeta.(Intercept)  3.0739 0.1694 0.005356       0.005457beta.speed50      0.8168 0.1398 0.004422       0.004422beta.speed60      1.5691 0.1618 0.005117       0.005569beta.speed70      4.4849 0.1862 0.005889       0.005889beta.height300    2.9474 0.1235 0.003904       0.003681beta.height400    1.4578 0.1796 0.005680       0.005680beta.constWood   -0.1128 0.1952 0.006172       0.005615beta.themeEagle  -0.7542 0.1857 0.005871       0.0058712. Quantiles for each variable:                    2.5%     25%     50%      75%   97.5%beta.(Intercept)  2.7389  2.9594  3.0764  3.18818  3.4099beta.speed50      0.5421  0.7251  0.8114  0.91274  1.0801beta.speed60      1.2604  1.4636  1.5725  1.68365  1.8804beta.speed70      4.1213  4.3599  4.4834  4.60792  4.8599beta.height300    2.7114  2.8642  2.9501  3.03263  3.1779beta.height400    1.0898  1.3429  1.4589  1.58500  1.8017beta.constWood   -0.5219 -0.2464 -0.1105  0.01628  0.2698beta.themeEagle  -1.0999 -0.8745 -0.7571 -0.63284 -0.3609

系数与贝叶斯一般线性模型的接近。

#查看特定用户的偏好水平#仍以之前联合分析里的121号游客为例,此处用grepl()识别字符向量中的字符串具体位置summary(cja_mc2$mcmc[,grepl(".121",colnames(cja_mc2$mcmc),fixed = T)])

结果输出:

Iterations = 1001:10991Thinning interval = 10 Number of chains = 1 Sample size per chain = 1000 1. Empirical mean and standard deviation for each variable,   plus standard error of the mean:                      Mean     SD Naive SE Time-series SEb.(Intercept).121  0.85838 0.6568  0.02077        0.02077b.speed50.121     -0.10290 0.5284  0.01671        0.01671b.speed60.121     -0.04224 0.5990  0.01894        0.01894b.speed70.121      0.50233 0.6677  0.02111        0.02111b.height300.121   -0.28284 0.5280  0.01670        0.01670b.height400.121   -0.41411 0.7217  0.02282        0.02282b.constWood.121   -0.58305 0.7918  0.02504        0.02504b.themeEagle.121  -0.91875 0.8052  0.02546        0.025462. Quantiles for each variable:                     2.5%     25%      50%      75%  97.5%b.(Intercept).121 -0.4006  0.3978  0.86991  1.31876 2.1281b.speed50.121     -1.1450 -0.4582 -0.08788  0.23811 0.9736b.speed60.121     -1.2381 -0.4410 -0.01685  0.36285 1.0969b.speed70.121     -0.6837  0.0239  0.50241  0.96847 1.8190b.height300.121   -1.2732 -0.6583 -0.27573  0.07488 0.6883b.height400.121   -1.7211 -0.9126 -0.39759  0.05674 1.0329b.constWood.121   -2.0921 -1.1408 -0.57240 -0.04476 0.9689b.themeEagle.121  -2.4475 -1.4443 -0.94002 -0.40926 0.7440

可以看出121号游客对于70码的过山车的偏好十分突出(与基准水平40码相比),而对鹰主题则有明显的厌恶(与基准水平龙主题相比)。
对于分层个体层面效应结果的运用,一方面是进一步聚类的重要基础,另一方面是根据受访者信息,进一步精准化营销的基础。

# 偏好分布查看# 查看400米高度的每个个体与基准水平的偏差程度分布,其中标签含有「b.height400」的列是个体与群体的偏差水平,「beta.height400」则是群体平均水平,二者共同构成该选项的总体受访者偏好分布mc_height400 <- summary(cja_mc2$mcmc[,grepl("b.height400",colnames(cja_mc2$mcmc),fixed = T)] + cja_mc2$mcmc[,"beta.height400"])par(family = "STSong")hist(mc_height400$statistics[,1],xlim = c(-0.5,4), main = "高400米 对 高200米的偏好分布", xlab = "评分相对差值")

这里写图片描述

贝叶斯分层选择模型

数据转换

#基于第九部分联合分析中汽车配置组合满意度数据cja_2 <- read.csv("http://r-marketing.r-forge.r-project.org/data/rintro-chapter13conjoint.csv", colClasses = c(seat ="factor", price = "factor"))#ChoiceModelR要求数据是长数据框的数据,故要进行转换#选项(choice)的确定choice <- rep(0,nrow(cja_2))choice[cja_2[,"alt"]==1] <- cja_2[cja_2[,"choice"]==1,"alt"]#因子变量的确定cja_code <- model.matrix(~seat + eng + cargo + price, data = cja_2)cja_code <- cja_code[,-1]#最终合并成可用于贝叶斯的数据框chc_df <- cbind(cja_2[,1:3],cja_code,choice)head(chc_df,5)#为了体现贝叶斯分类框架对于能够很好联系消费者属性效应值和消费者特征的特性,将效用值作为受访者特征函数。此案例中刚好有carpool这个消费者是否愿意拼车这个字段,此处输出为效应值矩阵carpool <- cja_2$carpool[cja_2$ques == 1 & cja_2$alt == 1] == "yes"carpool <- as.numeric(carpool)chc_demo <- as.matrix(carpool, nrow = length(carpool))

模型拟合

library(ChoiceModelR)#xcoding表明对属性进行何种方式的编码,mcmc则是说明要产生20000个后验样本,且只使用其中的后10000个,options则说明要存储最后10000个样本;最终输出的是包含了4个元素列表的对象chc_mdchc_md <- choicemodelr(data= chc_df, xcoding = rep(1,7),                       demos = chc_demo,                       mcmc = list(R=20000, use =10000),                       options = list(save = T))

总结

文中与频率学派的结果相比,特别是频率学派擅长的线性模型方面,贝叶斯的结果多数都与之相差无几,但贝叶斯的价值,一方面在与能在样本量更少的前提下获得更好的结果,另一方面则是更多更复杂的模型嵌套中,能给出更为直观的答案。

阅读全文
0 0
原创粉丝点击