数据挖掘期中作业参考
来源:互联网 发布:qq添加好友软件 编辑:程序博客网 时间:2024/05/17 04:25
期中作业参考
Table of Contents
- 1. 采用最小描述长度有指导地进行分箱处理
- 2. 进行关联分析,并选择打印出关联规则
- 3. 采用Fisher线性判别分析进行预测
- 4. 采用主成分因子分析等产生新的变量,之后进行判别分析,
- 5. 进行广义线性回归,对测试集进行预测
1 采用最小描述长度有指导地进行分箱处理
bank<-read.table("data/bank/bank.csv",header=TRUE,sep=";")m<-ncol(bank)factors<-lapply(bank,is.factor)iffactor<-as.logical(factors)m<-ncol(bank)bank.factors<-(1:m)[iffactor]bank.int<-(1:m)[!iffactor]bankint<-bank[,c(bank.int,m)] # 产生数值型数据,并保存为bank.intbankfac<-bank[,iffactor]summary(bankfac)
job marital education default housing management :969 divorced: 528 primary : 678 no :4445 no :1962 blue-collar:946 married :2797 secondary:2306 yes: 76 yes:2559 technician :768 single :1196 tertiary :1350 admin. :478 unknown : 187 services :417 retired :230 (Other) :713 loan contact month poutcome y no :3830 cellular :2896 may :1398 failure: 490 no :4000 yes: 691 telephone: 301 jul : 706 other : 197 yes: 521 unknown :1324 aug : 633 success: 129 jun : 531 unknown:3705 nov : 389 apr : 293 (Other): 571
采用mdlp进行有指导的分箱,对连续型数据变量进行离散化处理,进一步转化为 因子型数据,替换原有的连续性数据,替换后的数据框记为banknew
library(discretization)bankintDisc<-mdlp(bankint)$Disc.data #对数据进行有指导的分箱bankintFa<-as.data.frame(lapply(bankintDisc,as.factor)) ##将分箱后的离散型变量转换为字 ##符型,并转化为数据框banknew<-bank ##转换为新的数据p<-ncol(bankintFa)banknew[,bank.int]<-bankintFa[,-p] # 将其中的数值型变量替换为分箱后的因子型变量
2 进行关联分析,并选择打印出关联规则
首先对数据集进行分割为训练集和测试集
##首先对数据集进行分割为训练集和测试集set.seed(1314) id<-sample(1:4521,2712)bank.train<-banknew[id,]bank.test<-bank[-id,]
- 对训练集进行关联分析
library("arules") banktransac <- as(bank.train, "transactions")rules <- apriori(banktransac,parameter = list(support = 0.1, confidence = 0.6)) summary(rules)
parameter specification: confidence minval smax arem aval originalSupport support minlen maxlen target 0.6 0.1 1 none FALSE TRUE 0.1 1 10 rules ext FALSEalgorithmic control: filter tree heap memopt load sort verbose 0.1 TRUE TRUE FALSE TRUE 2 TRUEapriori - find association rules with the apriori algorithmversion 4.21 (2004.05.09) (c) 1996-2004 Christian Borgeltset item appearances ...[0 item(s)] done [0.00s].set transactions ...[64 item(s), 2712 transaction(s)] done [0.00s].sorting and recoding items ... [36 item(s)] done [0.00s].creating transaction tree ... done [0.00s].checking subsets of size 1 2 3 4 5 6 7 8 9 10 done [0.27s].writing ... [271827 rule(s)] done [0.04s].creating S4 object ... done [0.11s].set of 271827 rulesrule length distribution (lhs + rhs):sizes 1 2 3 4 5 6 7 8 9 10 11 341 2977 13135 34796 59692 68851 54041 28443 9540 Min. 1st Qu. Median Mean 3rd Qu. Max. 1.000 6.000 7.000 6.842 8.000 10.000 summary of quality measures: support confidence lift Min. :0.1003 Min. :0.6000 Min. :0.7879 1st Qu.:0.1136 1st Qu.:0.8701 1st Qu.:1.0000 Median :0.1364 Median :0.9904 Median :1.0164 Mean :0.1673 Mean :0.9282 Mean :1.0889 3rd Qu.:0.1881 3rd Qu.:1.0000 3rd Qu.:1.2147 Max. :1.0000 Max. :1.0000 Max. :5.4567 mining info: data ntransactions support confidence banktransac 2712 0.1 0.6
## 选出 "y=no" 为后项的关联规则,rulesyno <- subset(rules, subset = rhs %in% "y=no" & lift > 1.13) summary(rulesyno)## 观察所得的关联规则的大致形式,按照置信度进行排序inspect(head(sort(rulesyno, by = "lift"), n = 10))## 将全部关联规则另存为文本文件write(rulesyno, file = "data/bank/bankyapriorirules.csv", sep = ",")
set of 319 rulesrule length distribution (lhs + rhs):sizes 3 4 5 6 7 8 9 10 3 18 48 76 80 58 28 8 Min. 1st Qu. Median Mean 3rd Qu. Max. 3.000 6.000 7.000 6.687 8.000 10.000 summary of quality measures: support confidence lift Min. :0.1010 Min. :1 Min. :1.132 1st Qu.:0.1034 1st Qu.:1 1st Qu.:1.132 Median :0.1125 Median :1 Median :1.132 Mean :0.1118 Mean :1 Mean :1.132 3rd Qu.:0.1206 3rd Qu.:1 3rd Qu.:1.132 Max. :0.1239 Max. :1 Max. :1.132 mining info: data ntransactions support confidence banktransac 2712 0.1 0.6 lhs rhs support confidence lift1 {duration=1, poutcome=unknown} => {y=no} 0.1231563 1 1.1318862 {duration=1, previous=1} => {y=no} 0.1231563 1 1.1318863 {duration=1, pdays=1} => {y=no} 0.1238938 1 1.1318864 {duration=1, previous=1, poutcome=unknown} => {y=no} 0.1231563 1 1.1318865 {duration=1, pdays=1, poutcome=unknown} => {y=no} 0.1231563 1 1.1318866 {loan=no, duration=1, poutcome=unknown} => {y=no} 0.1039823 1 1.1318867 {age=1, duration=1, poutcome=unknown} => {y=no} 0.1224189 1 1.1318868 {default=no, duration=1, poutcome=unknown} => {y=no} 0.1205752 1 1.1318869 {duration=1, campaign=1, poutcome=unknown} => {y=no} 0.1231563 1 1.13188610 {day=1, duration=1, poutcome=unknown} => {y=no} 0.1231563 1 1.131886
3 采用Fisher线性判别分析进行预测
## 首先采用训练数据集建立判别规则require("MASS")z <- lda(y ~ ., bankint, prior = c(1,1)/2,subset = id)z.predict<- predict(z,bank.test)predicty<-z.predict$classtruey<-bank.test$ytable(predicty, truey)
载入需要的程辑包:MASS trueypredicty no yes no 1389 71 yes 215 134
4 采用主成分因子分析等产生新的变量,之后进行判别分析,
这里以主成分为例进行
bankinttrain<-bankint[id,]p<-ncol(bankint)trainpr<-princomp(bankinttrain[,-p],cor=TRUE,loadings=TRUE)summary(trainpr)
Importance of components: Comp.1 Comp.2 Comp.3 Comp.4 Comp.5Standard deviation 1.2904231 1.0802812 1.0501263 0.9857049 0.9408610Proportion of Variance 0.2378845 0.1667153 0.1575379 0.1388020 0.1264599Cumulative Proportion 0.2378845 0.4045999 0.5621378 0.7009398 0.8273997 Comp.6 Comp.7Standard deviation 0.8991575 0.63223248Proportion of Variance 0.1154978 0.05710256Cumulative Proportion 0.9428974 1.00000000
- 挑选其中前五个主成分构成新的变量
loading5<-as.matrix(trainpr$loadings)[,1:5]components<-as.matrix(bankinttrain[,-p])%*%loading5bankprtrain<-data.frame(components,y=bankint[id,8])#相应的测试集合需要进行类似的处理components<-as.matrix(bankint[-id,-p])%*%loading5bankprtest<-data.frame(components,y=bankint[-id,8])#head(bankprtrain)
- 对训练集进行新的 Fisher线性判别分析,再运用到新的测试集合上
z<-lda(y~.,data=bankprtrain)prpredic<-predict(z,newdata=bankprtest)predicy<-prpredic$classtruey<-bankprtest$ytable(predicy,truey)
trueypredicy no yes no 1565 158 yes 39 47
效果不如直接进行判别分析的好,不再进行下去
5 进行广义线性回归,对测试集进行预测
bankinttest<-bankint[-id,]m<-glm(y ~.,family="binomial",subset=id,data=bankint)step(m) # 逐步回归方法选择最优变量
Start: AIC=1611.09y ~ age + balance + day + duration + campaign + pdays + previous Df Deviance AIC- day 1 1595.5 1609.5- balance 1 1596.6 1610.6- age 1 1596.7 1610.7<none> 1595.1 1611.1- pdays 1 1602.3 1616.3- campaign 1 1602.9 1616.9- previous 1 1604.3 1618.3- duration 1 1905.0 1919.0Step: AIC=1609.46y ~ age + balance + duration + campaign + pdays + previous Df Deviance AIC- balance 1 1596.9 1608.9- age 1 1597.0 1609.0<none> 1595.5 1609.5- pdays 1 1602.5 1614.5- campaign 1 1602.9 1614.9- previous 1 1604.6 1616.6- duration 1 1905.1 1917.1Step: AIC=1608.89y ~ age + duration + campaign + pdays + previous Df Deviance AIC- age 1 1598.8 1608.8<none> 1596.9 1608.9- pdays 1 1603.9 1613.9- campaign 1 1604.4 1614.4- previous 1 1606.2 1616.2- duration 1 1905.4 1915.4Step: AIC=1608.8y ~ duration + campaign + pdays + previous Df Deviance AIC<none> 1598.8 1608.8- pdays 1 1605.6 1613.6- campaign 1 1606.7 1614.7- previous 1 1608.3 1616.3- duration 1 1908.1 1916.1Call: glm(formula = y ~ duration + campaign + pdays + previous, family = "binomial", data = bankint, subset = id)Coefficients:(Intercept) duration campaign pdays previous -3.191878 0.003464 -0.084405 0.001790 0.110634 Degrees of Freedom: 2711 Total (i.e. Null); 2707 ResidualNull Deviance: 1952 Residual Deviance: 1599 AIC: 1609
m<-glm(y ~ age + duration + campaign + pdays + previous,family="binomial",subset=id,data=bankint)summary(m)
Call:glm(formula = y ~ age + duration + campaign + pdays + previous, family = "binomial", data = bankint, subset = id)Deviance Residuals: Min 1Q Median 3Q Max -3.8052 -0.4398 -0.3412 -0.2859 2.5410 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -3.5381351 0.2880890 -12.281 < 2e-16 ***age 0.0083428 0.0060098 1.388 0.16507 duration 0.0034554 0.0002189 15.785 < 2e-16 ***campaign -0.0829148 0.0328637 -2.523 0.01164 * pdays 0.0018123 0.0006601 2.746 0.00604 ** previous 0.1094893 0.0347442 3.151 0.00163 ** ---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1(Dispersion parameter for binomial family taken to be 1) Null deviance: 1952.3 on 2711 degrees of freedomResidual deviance: 1596.9 on 2706 degrees of freedomAIC: 1608.9Number of Fisher Scoring iterations: 6
mpredict<-predict(m, newdata=bankinttest,type = "response")predicty<-ifelse(mpredict<0.5,"no","yes") # 对预测值进行重新编码# 列表输出结果truey<-bankint$y[-id]table(predicty,truey)
trueypredicty no yes no 1582 175 yes 22 30
效果也不如Fisher 判别分析,也不如主成分之后判别分析
Validate
- 数据挖掘期中作业参考
- 期中作业
- 数据挖掘参考资源
- C语言期中作业
- java期中作业
- 数据挖掘大作业参考文献
- 数据挖掘 作业1 神经网络
- 数据挖掘 作业2 SVM
- 【可视化】数据仓库与数据挖掘大作业
- JAVA期中作业——弹球游戏
- Flex期中大作业实验单
- 用 WEKA 进行数据挖掘(参考链接整理)
- 第十次作业数据挖掘:关联规则NO.1
- 数据挖掘概念与技术作业(留复习用)
- JAVA期中作业——弹球游戏 +砖块
- 数据挖掘--序列挖掘
- 数据挖掘--文本挖掘
- 数据挖掘
- HDU 4548 -------美素数
- 关闭自动播放功能
- 访问变量-作用域链 访问属性-原型链
- 3D-HEVC/HTM相关资料下载
- struts2 文件上传
- 数据挖掘期中作业参考
- 编程面试的10大算法概念汇总
- android:sharedUserId问题
- 最小生成树
- easyui-------表格
- unity3d 重要函数方法
- C/C++在ndk开发中的区别
- 类和接口设计的几个注意点
- SQL Server 高性能写入的一些总结