数据挖掘期中作业参考

来源：互联网发布：qq添加好友软件编辑：程序博客网时间：2024/05/17 04:25

期中作业参考

1. 采用最小描述长度有指导地进行分箱处理
2. 进行关联分析，并选择打印出关联规则
3. 采用Fisher线性判别分析进行预测
4. 采用主成分因子分析等产生新的变量，之后进行判别分析，
5. 进行广义线性回归，对测试集进行预测

1 采用最小描述长度有指导地进行分箱处理

bank<-read.table("data/bank/bank.csv",header=TRUE,sep=";")m<-ncol(bank)factors<-lapply(bank,is.factor)iffactor<-as.logical(factors)m<-ncol(bank)bank.factors<-(1:m)[iffactor]bank.int<-(1:m)[!iffactor]bankint<-bank[,c(bank.int,m)] # 产生数值型数据，并保存为bank.intbankfac<-bank[,iffactor]summary(bankfac)

         job          marital         education    default    housing   management :969   divorced: 528   primary  : 678   no :4445   no :1962  blue-collar:946   married :2797   secondary:2306   yes:  76   yes:2559  technician :768   single  :1196   tertiary :1350                        admin.     :478                   unknown  : 187                        services   :417                                                         retired    :230                                                         (Other)    :713                                                          loan           contact         month         poutcome      y       no :3830   cellular :2896   may    :1398   failure: 490   no :4000  yes: 691   telephone: 301   jul    : 706   other  : 197   yes: 521             unknown  :1324   aug    : 633   success: 129                                         jun    : 531   unknown:3705                                         nov    : 389                                                        apr    : 293                                                        (Other): 571

采用mdlp进行有指导的分箱，对连续型数据变量进行离散化处理，进一步转化为因子型数据，替换原有的连续性数据，替换后的数据框记为banknew

library(discretization)bankintDisc<-mdlp(bankint)$Disc.data #对数据进行有指导的分箱bankintFa<-as.data.frame(lapply(bankintDisc,as.factor)) ##将分箱后的离散型变量转换为字 ##符型，并转化为数据框banknew<-bank   ##转换为新的数据p<-ncol(bankintFa)banknew[,bank.int]<-bankintFa[,-p] # 将其中的数值型变量替换为分箱后的因子型变量

2 进行关联分析，并选择打印出关联规则

首先对数据集进行分割为训练集和测试集

##首先对数据集进行分割为训练集和测试集set.seed(1314) id<-sample(1:4521,2712)bank.train<-banknew[id,]bank.test<-bank[-id,]

对训练集进行关联分析

library("arules") banktransac <- as(bank.train, "transactions")rules <- apriori(banktransac,parameter = list(support = 0.1, confidence = 0.6)) summary(rules)

parameter specification: confidence minval smax arem  aval originalSupport support minlen maxlen target        0.6    0.1    1 none FALSE            TRUE     0.1      1     10  rules   ext FALSEalgorithmic control: filter tree heap memopt load sort verbose    0.1 TRUE TRUE  FALSE TRUE    2    TRUEapriori - find association rules with the apriori algorithmversion 4.21 (2004.05.09)        (c) 1996-2004   Christian Borgeltset item appearances ...[0 item(s)] done [0.00s].set transactions ...[64 item(s), 2712 transaction(s)] done [0.00s].sorting and recoding items ... [36 item(s)] done [0.00s].creating transaction tree ... done [0.00s].checking subsets of size 1 2 3 4 5 6 7 8 9 10 done [0.27s].writing ... [271827 rule(s)] done [0.04s].creating S4 object  ... done [0.11s].set of 271827 rulesrule length distribution (lhs + rhs):sizes    1     2     3     4     5     6     7     8     9    10    11   341  2977 13135 34796 59692 68851 54041 28443  9540    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.   1.000   6.000   7.000   6.842   8.000  10.000 summary of quality measures:    support         confidence          lift        Min.   :0.1003   Min.   :0.6000   Min.   :0.7879   1st Qu.:0.1136   1st Qu.:0.8701   1st Qu.:1.0000   Median :0.1364   Median :0.9904   Median :1.0164   Mean   :0.1673   Mean   :0.9282   Mean   :1.0889   3rd Qu.:0.1881   3rd Qu.:1.0000   3rd Qu.:1.2147   Max.   :1.0000   Max.   :1.0000   Max.   :5.4567  mining info:        data ntransactions support confidence banktransac          2712     0.1        0.6

## 选出 "y=no" 为后项的关联规则，rulesyno <- subset(rules, subset = rhs %in% "y=no" & lift > 1.13) summary(rulesyno)## 观察所得的关联规则的大致形式，按照置信度进行排序inspect(head(sort(rulesyno, by = "lift"), n = 10))## 将全部关联规则另存为文本文件write(rulesyno, file = "data/bank/bankyapriorirules.csv", sep = ",")

set of 319 rulesrule length distribution (lhs + rhs):sizes 3  4  5  6  7  8  9 10  3 18 48 76 80 58 28  8    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.   3.000   6.000   7.000   6.687   8.000  10.000 summary of quality measures:    support         confidence      lift       Min.   :0.1010   Min.   :1    Min.   :1.132   1st Qu.:0.1034   1st Qu.:1    1st Qu.:1.132   Median :0.1125   Median :1    Median :1.132   Mean   :0.1118   Mean   :1    Mean   :1.132   3rd Qu.:0.1206   3rd Qu.:1    3rd Qu.:1.132   Max.   :0.1239   Max.   :1    Max.   :1.132  mining info:        data ntransactions support confidence banktransac          2712     0.1        0.6   lhs                   rhs      support confidence     lift1  {duration=1,                                                  poutcome=unknown} => {y=no} 0.1231563          1 1.1318862  {duration=1,                                                  previous=1}       => {y=no} 0.1231563          1 1.1318863  {duration=1,                                                  pdays=1}          => {y=no} 0.1238938          1 1.1318864  {duration=1,                                                  previous=1,                                                  poutcome=unknown} => {y=no} 0.1231563          1 1.1318865  {duration=1,                                                  pdays=1,                                                     poutcome=unknown} => {y=no} 0.1231563          1 1.1318866  {loan=no,                                                     duration=1,                                                  poutcome=unknown} => {y=no} 0.1039823          1 1.1318867  {age=1,                                                       duration=1,                                                  poutcome=unknown} => {y=no} 0.1224189          1 1.1318868  {default=no,                                                  duration=1,                                                  poutcome=unknown} => {y=no} 0.1205752          1 1.1318869  {duration=1,                                                  campaign=1,                                                  poutcome=unknown} => {y=no} 0.1231563          1 1.13188610 {day=1,                                                       duration=1,                                                  poutcome=unknown} => {y=no} 0.1231563          1 1.131886

3 采用Fisher线性判别分析进行预测

## 首先采用训练数据集建立判别规则require("MASS")z <- lda(y ~ ., bankint, prior = c(1,1)/2,subset = id)z.predict<- predict(z,bank.test)predicty<-z.predict$classtruey<-bank.test$ytable(predicty, truey)

载入需要的程辑包：MASS        trueypredicty   no  yes     no  1389   71     yes  215  134

4 采用主成分因子分析等产生新的变量，之后进行判别分析，

这里以主成分为例进行

bankinttrain<-bankint[id,]p<-ncol(bankint)trainpr<-princomp(bankinttrain[,-p],cor=TRUE,loadings=TRUE)summary(trainpr)

Importance of components:                          Comp.1    Comp.2    Comp.3    Comp.4    Comp.5Standard deviation     1.2904231 1.0802812 1.0501263 0.9857049 0.9408610Proportion of Variance 0.2378845 0.1667153 0.1575379 0.1388020 0.1264599Cumulative Proportion  0.2378845 0.4045999 0.5621378 0.7009398 0.8273997                          Comp.6     Comp.7Standard deviation     0.8991575 0.63223248Proportion of Variance 0.1154978 0.05710256Cumulative Proportion  0.9428974 1.00000000

挑选其中前五个主成分构成新的变量

loading5<-as.matrix(trainpr$loadings)[,1:5]components<-as.matrix(bankinttrain[,-p])%*%loading5bankprtrain<-data.frame(components,y=bankint[id,8])#相应的测试集合需要进行类似的处理components<-as.matrix(bankint[-id,-p])%*%loading5bankprtest<-data.frame(components,y=bankint[-id,8])#head(bankprtrain)

对训练集进行新的 Fisher线性判别分析，再运用到新的测试集合上

z<-lda(y~.,data=bankprtrain)prpredic<-predict(z,newdata=bankprtest)predicy<-prpredic$classtruey<-bankprtest$ytable(predicy,truey)

       trueypredicy   no  yes    no  1565  158    yes   39   47

效果不如直接进行判别分析的好，不再进行下去

5 进行广义线性回归，对测试集进行预测

bankinttest<-bankint[-id,]m<-glm(y ~.,family="binomial",subset=id,data=bankint)step(m) # 逐步回归方法选择最优变量

Start:  AIC=1611.09y ~ age + balance + day + duration + campaign + pdays + previous           Df Deviance    AIC- day       1   1595.5 1609.5- balance   1   1596.6 1610.6- age       1   1596.7 1610.7<none>          1595.1 1611.1- pdays     1   1602.3 1616.3- campaign  1   1602.9 1616.9- previous  1   1604.3 1618.3- duration  1   1905.0 1919.0Step:  AIC=1609.46y ~ age + balance + duration + campaign + pdays + previous           Df Deviance    AIC- balance   1   1596.9 1608.9- age       1   1597.0 1609.0<none>          1595.5 1609.5- pdays     1   1602.5 1614.5- campaign  1   1602.9 1614.9- previous  1   1604.6 1616.6- duration  1   1905.1 1917.1Step:  AIC=1608.89y ~ age + duration + campaign + pdays + previous           Df Deviance    AIC- age       1   1598.8 1608.8<none>          1596.9 1608.9- pdays     1   1603.9 1613.9- campaign  1   1604.4 1614.4- previous  1   1606.2 1616.2- duration  1   1905.4 1915.4Step:  AIC=1608.8y ~ duration + campaign + pdays + previous           Df Deviance    AIC<none>          1598.8 1608.8- pdays     1   1605.6 1613.6- campaign  1   1606.7 1614.7- previous  1   1608.3 1616.3- duration  1   1908.1 1916.1Call:  glm(formula = y ~ duration + campaign + pdays + previous, family = "binomial",     data = bankint, subset = id)Coefficients:(Intercept)     duration     campaign        pdays     previous    -3.191878     0.003464    -0.084405     0.001790     0.110634  Degrees of Freedom: 2711 Total (i.e. Null);  2707 ResidualNull Deviance:    1952 Residual Deviance: 1599 AIC: 1609

m<-glm(y ~ age + duration + campaign + pdays + previous,family="binomial",subset=id,data=bankint)summary(m)

Call:glm(formula = y ~ age + duration + campaign + pdays + previous,     family = "binomial", data = bankint, subset = id)Deviance Residuals:     Min       1Q   Median       3Q      Max  -3.8052  -0.4398  -0.3412  -0.2859   2.5410  Coefficients:              Estimate Std. Error z value Pr(>|z|)    (Intercept) -3.5381351  0.2880890 -12.281  < 2e-16 ***age          0.0083428  0.0060098   1.388  0.16507    duration     0.0034554  0.0002189  15.785  < 2e-16 ***campaign    -0.0829148  0.0328637  -2.523  0.01164 *  pdays        0.0018123  0.0006601   2.746  0.00604 ** previous     0.1094893  0.0347442   3.151  0.00163 ** ---Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1(Dispersion parameter for binomial family taken to be 1)    Null deviance: 1952.3  on 2711  degrees of freedomResidual deviance: 1596.9  on 2706  degrees of freedomAIC: 1608.9Number of Fisher Scoring iterations: 6

mpredict<-predict(m, newdata=bankinttest,type = "response")predicty<-ifelse(mpredict<0.5,"no","yes") # 对预测值进行重新编码# 列表输出结果truey<-bankint$y[-id]table(predicty,truey)

        trueypredicty   no  yes     no  1582  175     yes   22   30

效果也不如Fisher 判别分析，也不如主成分之后判别分析

Validate