R语言-逻辑回归+主成分分析-员工离职预测训练赛

来源:互联网 发布:女生外套 冬装 知乎 编辑:程序博客网 时间:2024/06/01 09:05

题目:员工离职预测训练赛
网址:http://www.pkbigdata.com/common/cmpt/员工离职预测训练赛_竞赛信息.html
要求:
数据主要包括影响员工离职的各种因素(工资、出差、工作环境满意度、工作投入度、是否加班、是否升职、工资提升比例等)以及员工是否已经离职的对应记录。
数据分为训练数据和测试数据,分别保存在pfm_train.csv和pfm_test.csv两个文件中。
其中训练数据主要包括1100条记录,31个字段。
测试数据主要包括350条记录,30个字段,跟训练数据的不同是测试数据并不包括员工是否已经离职的记录,学员需要通过由训练数据所建立的模型以及所给的测试数据,得出测试数据相应的员工是否已经离职的预测。
数据:https://pan.baidu.com/s/1qXZOS8W  密码:bxgm

代码:

data <- read.csv("E:/.../员工离职预测训练赛/数据/pfm_train.csv", sep=",", header=TRUE)colnames(data)[1]<-c("Age")     #首列列名乱码#####################################################################################################################      逻辑回归      ########################################################################################################################################str(data)fit.full<-glm(Attrition~.,data=data[,-c(8,18,23)],family=binomial())                       #初步回归,AIC: 730.18summary(fit.full)step(fit.full)fit.reduce<-glm(formula = Attrition ~ Age + BusinessTravel + Department +                  #逐步回归优化,AIC: 721.3    DistanceFromHome + EducationField + EnvironmentSatisfaction +     Gender + JobInvolvement + JobLevel + JobSatisfaction + MaritalStatus +     NumCompaniesWorked + OverTime + RelationshipSatisfaction +     TotalWorkingYears + TrainingTimesLastYear + WorkLifeBalance +     YearsAtCompany + YearsInCurrentRole + YearsSinceLastPromotion +     YearsWithCurrManager, family = binomial(), data = data[,     -c(8, 18,23)])summary(fit.reduce)test <- predict(fit.full, newdata = data, type = "response")test1 <- predict(fit.reduce, newdata = data, type = "response")test[test <0.5] <- 0test[test >= 0.5] <- 1result<-cbind(test,data$Attrition)table(test,data$Attrition)#未优化          step后#test   0   1    test   0   1#   0 902  91       0 898  98#   1  20  87       1  24  80#训练集上看优化前拟合度较高,但提示过拟合#####################################################################################################################      逻辑回归+主成分      #################################################################################################################################data[,2] <- as.factor(as.vector(data)[,2])#首先将数值型因子进行了标准化,确保所有的因子在一个量纲上,接着对已经标准化的数据进行主成分分析,消除因子中的高相关性library(caret)library(ipred)p_2009 <- preProcess(data[,-c(2,8,18,23)],method=c("scale","center","pca"))           #主成分分析重组各个特征值src1_2009_p <- cbind(Attrition=data[,2],predict(p_2009,data[,-c(2,8,18,23)]))fit.full<-glm(Attrition~.,data=src1_2009_p,family=binomial())                         #AIC: 728.81summary(fit.full)step(fit.full)fit.reduce<-glm(formula = Attrition ~ BusinessTravel + EducationField + Gender +      #716.06    JobRole + MaritalStatus + OverTime + PC1 + PC4 + PC7 + PC8 +     PC9 + PC13 + PC14 + PC15, family = binomial(), data = src1_2009_p)summary(fit.reduce)test <- predict(fit.full, newdata = src1_2009_p, type = "response")test1 <- predict(fit.reduce, newdata = src1_2009_p, type = "response")test[test <0.5] <- 0test[test >= 0.5] <- 1test1[test1 <0.5] <- 0test1[test1 >= 0.5] <- 1result<-cbind(test,data$Attrition)table(test,data$Attrition)未优化          step后          test   0   1    test   0   1      0   1    0 896  93       0 896  96    922 178    1  26  85       1  26  82#######################  预测  #############################data1 <- read.csv("E:/.../员工离职预测训练赛/数据/pfm_test.csv", sep=",", header=TRUE)colnames(data1)[1]<-c("Age")     #首列列名乱码pre_data1 <- predict(p_2009,data1[,-c(7,17,22)])result <- predict(fit.reduce,pre_data1 ,interval = "prediction", level = 0.95)result1 <- predict(fit.full,pre_data1 ,interval = "prediction", level = 0.95)result1 <- resultresult1[result1 >= 0.5] <- 1result1[result1 <0.5] <- 0table(result1)file.path <- paste("E:/PACT-上海/私の稿/比赛/员工离职预测训练赛/out_log.csv",sep="")write.table(result1,file.path, col.names=T,row.names = F, quote = F, sep=",")
实际比赛提交得分为0.89***6,成绩还行,排名5。

原创粉丝点击