Applied_Predictive_Model_11

来源：互联网发布：spark时间序列算法编辑：程序博客网时间：2024/06/11 05:24

一直在用Python/R/Shell做后台开发，但缺少交流。第一次尝试把自己学习中东西贴到博客中，希望得到高手们的指点让自己也能得到一些阳光普照。

library(AppliedPredictiveModeling)data(concrete)library(caret)library(plyr)

head(concrete,2)

CementBlastFurnaceSlagFlyAshWaterSuperplasticizerCoarseAggregateFineAggregateAgeCompressiveStrength 540 0 0 162 2.5 1040 676 28 79.99 540 0 0 162 2.5 1055 676 28 61.89

featurePlot(concrete[, -9], concrete$CompressiveStrength,            between = list(x = 1, y = 1),            type = c("g", "p", "smooth"))

png

averaged <- ddply(mixtures,                  .(Cement, BlastFurnaceSlag, FlyAsh, Water,                     Superplasticizer, CoarseAggregate,                     FineAggregate, Age),                  function(x) c(CompressiveStrength =                     mean(x$CompressiveStrength)))

head(mixtures,2)

CementBlastFurnaceSlagFlyAshWaterSuperplasticizerCoarseAggregateFineAggregateAgeCompressiveStrength 0.2230944 0 0 0.06692832 0.0010328440.4296633 0.2792811 28 79.99 0.2217204 0 0 0.06651612 0.0010264830.4331759 0.2775611 28 61.89

set.seed(975)inTrain <- createDataPartition(averaged$CompressiveStrength, p = 3/4)[[1]]training <- averaged[ inTrain,]testing  <- averaged[-inTrain,]

### Repeated Training/Test Splits

ctrl <- trainControl(method = "repeatedcv", repeats = 5, number = 10)

### Create a model formula that can be used repeatedly

modForm <- paste("CompressiveStrength ~ (.)^2 + I(Cement^2) + I(BlastFurnaceSlag^2) +",                 "I(FlyAsh^2)  + I(Water^2) + I(Superplasticizer^2)  +",                 "I(CoarseAggregate^2) +  I(FineAggregate^2) + I(Age^2)")modForm <- as.formula(modForm)

### lm & pls: Partial Least Squares

set.seed(669)lmFit <- train(modForm, data = training,               method = "lm",               trControl = ctrl)set.seed(669)plsFit <- train(modForm, data = training,                method = "pls",                preProc = c("center", "scale"),                tuneLength = 15,                trControl = ctrl)

Attaching package: 'pls'The following object is masked from 'package:caret':    R2The following object is masked from 'package:stats':    loadings

### ridge regression

lassoGrid <- expand.grid(lambda = c(0, .001, .01, .1),                          fraction = seq(0.05, 1, length = 20))set.seed(669)lassoFit <- train(modForm, data = training,                  method = "enet",                  preProc = c("center", "scale"),                  tuneGrid = lassoGrid,                  trControl = ctrl)

Loading required package: elasticnet

### Multivariate Adaptive Regression Splines

set.seed(669)earthFit <- train(CompressiveStrength ~ ., data = training,                  method = "earth",                  tuneGrid = expand.grid(degree = 1,                                          nprune = 2:25),                  trControl = ctrl)

Loading required package: earthLoading required package: plotmoLoading required package: plotrixLoading required package: TeachingDemosWarning message:"package 'TeachingDemos' was built under R version 3.2.5"

### SVM

set.seed(669)svmRFit <- train(CompressiveStrength ~ ., data = training,                 method = "svmRadial",                 tuneLength = 15,                 preProc = c("center", "scale"),                 trControl = ctrl)

Loading required package: kernlabAttaching package: 'kernlab'The following object is masked from 'package:ggplot2':    alpha

### neural network models### don't run it due that it'll take more times!

nnetGrid <- expand.grid(decay = c(0.001, .01, .1),                         size = seq(1, 27, by = 2),                         bag = FALSE)set.seed(669)nnetFit <- train(CompressiveStrength ~ .,                 data = training,                 method = "avNNet",                 tuneGrid = nnetGrid,                 preProc = c("center", "scale"),                 linout = TRUE,                 trace = FALSE,                 maxit = 50,                 allowParallel = FALSE,                 trControl = ctrl)

Loading required package: nnet

set.seed(669)rpartFit <- train(CompressiveStrength ~ .,                  data = training,                  method = "rpart",                  tuneLength = 30,                  trControl = ctrl)

Loading required package: rpart

set.seed(669)treebagFit <- train(CompressiveStrength ~ .,                    data = training,                    method = "treebag",                    trControl = ctrl)

Loading required package: ipredWarning message:"package 'ipred' was built under R version 3.2.5"Loading required package: e1071

set.seed(669)ctreeFit <- train(CompressiveStrength ~ .,                  data = training,                  method = "ctree",                  tuneLength = 10,                  trControl = ctrl)

Loading required package: partyWarning message:"package 'party' was built under R version 3.2.5"Loading required package: gridLoading required package: mvtnormWarning message:"package 'mvtnorm' was built under R version 3.3.0"Loading required package: modeltoolsWarning message:"package 'modeltools' was built under R version 3.2.5"Loading required package: stats4Attaching package: 'modeltools'The following object is masked from 'package:kernlab':    priorThe following object is masked from 'package:plyr':    emptyLoading required package: strucchangeWarning message:"package 'strucchange' was built under R version 3.2.5"Loading required package: zooAttaching package: 'zoo'The following objects are masked from 'package:base':    as.Date, as.Date.numericLoading required package: sandwichWarning message:"package 'sandwich' was built under R version 3.2.5"

set.seed(669)rfFit <- train(CompressiveStrength ~ .,               data = training,               method = "rf",               tuneLength = 10,               ntrees = 1000,               importance = TRUE,               trControl = ctrl)

Loading required package: randomForestrandomForest 4.6-12Type rfNews() to see new features/changes/bug fixes.Attaching package: 'randomForest'The following object is masked from 'package:ggplot2':    marginnote: only 7 unique complexity parameters in default grid. Truncating the grid to 7 .

cbGrid <- expand.grid(committees = c(1, 5, 10, 50, 75, 100),                       neighbors = c(0, 1, 3, 5, 7, 9))set.seed(669)cbFit <- train(CompressiveStrength ~ .,               data = training,               method = "cubist",               tuneGrid = cbGrid,               trControl = ctrl)

Loading required package: CubistWarning message:"package 'Cubist' was built under R version 3.2.5"

rs <- resamples(list("Linear Reg" = lmFit, "                     PLS" = plsFit,                     "Elastic Net" = lassoFit,                      MARS = earthFit,                     SVM = svmRFit,                      CART = rpartFit,                      "Cond Inf Tree" = ctreeFit,                     "Bagged Tree" = treebagFit,                      "Random Forest" = rfFit,                     Cubist = cbFit))

parallelplot(rs)

png

parallelplot(rs, metric = "Rsquared")

png

library(proxy)

Attaching package: 'proxy'The following objects are masked from 'package:stats':    as.dist, distThe following object is masked from 'package:base':    as.matrix

### Create a function to maximize compressive strength* while keeping### the predictor values as mixtures. Water (in x[7]) is used as the ### 'slack variable'. ### * We are actually minimizing the negative compressive strengthmodelPrediction <- function(x, mod, limit = 2500){  if(x[1] < 0 | x[1] > 1) return(10^38)  if(x[2] < 0 | x[2] > 1) return(10^38)  if(x[3] < 0 | x[3] > 1) return(10^38)  if(x[4] < 0 | x[4] > 1) return(10^38)  if(x[5] < 0 | x[5] > 1) return(10^38)  if(x[6] < 0 | x[6] > 1) return(10^38)  x <- c(x, 1 - sum(x))  if(x[7] < 0.05) return(10^38)  tmp <- as.data.frame(t(x))  names(tmp) <- c('Cement','BlastFurnaceSlag','FlyAsh',                  'Superplasticizer','CoarseAggregate',                  'FineAggregate', 'Water')  tmp$Age <- 28  -predict(mod, tmp)}

### Get mixtures at 28 days subTrain <- subset(training, Age == 28)

### Center and scale the data to use dissimilarity samplingpp1 <- preProcess(subTrain[, -(8:9)], c("center", "scale"))scaledTrain <- predict(pp1, subTrain[, 1:7])

scaledTrain

CementBlastFurnaceSlagFlyAshWaterSuperplasticizerCoarseAggregateFineAggregate 10 2.3329285 -0.95398889-0.9328639 2.4629888 -1.26706876 0.22516196-2.06983269 12-0.6596216 0.55740691-0.9328639 0.3720387 -1.26706876 0.29553760 0.74828894 15 0.5633247 -0.04052946-0.9328639 2.5559748 -1.26706876 0.33286556-0.89886277 37-0.1630025 1.87623956-0.9328639 2.4629888 -1.26706876 0.22516196-2.06983269 60 0.8353699 0.74414818-0.9328639 2.4629888 -1.26706876 0.22516196-2.06983269 69-0.6450221 1.32965969-0.9328639 2.5559748 -1.26706876 0.33286556-0.89886277 70 1.7946573 -0.95398889-0.9328639 -2.3011001 -1.26706876 0.85191964-0.56573558 107 0.9520378 1.11740399-0.9328639 -0.9394919 0.51349746-0.86593570-0.62050421 108 0.3758607 1.92599765-0.9328639 -0.6791642 0.25400975 0.66807315-2.50078991 109 1.4203994 0.20298191-0.9328639 -1.7021478 1.62473280-1.84142877 1.02689518 110 1.4195638 0.20274493-0.9328639 -1.7944769 1.99211254-0.81376125-0.06619910 111 0.9068727 0.05740009-0.9328639 -2.9188388 2.79574248-1.93909627 2.28569937 112 1.9438879 0.35138785-0.9328639 -0.4313610 0.30765668-1.74146601-0.25675506 113 1.8274724 0.31729412-0.9328639 -2.4026084 4.35719798-1.87688558 0.38006401 114 1.0817458 0.10588898-0.9328639 -1.4918327 0.85968581-1.81133744 1.56664654 115 2.4269585 -0.95398889-0.9328639 -2.2256367 3.66195312-1.86964293 1.08169898 116 0.4094699 1.36523136-0.9328639 -1.5882589 1.24605864-1.81263062 0.97095700 117 1.1761870 0.07097361-0.9328639 -2.0020890 0.71975001-0.74370465 0.50684538 119 0.4963023 2.17178411-0.9328639 -0.2657583 0.56613469-0.54952402-1.81109078 122 1.6153761 1.00018695-0.9328639 -0.1781449 0.70630233-1.28251610-1.18401900 124 0.6229710 1.12756884-0.9328639 -0.6847885 0.41772083-0.56658695-0.57296972 162-0.4423172 -0.95398889 0.5054081 0.1637024 -0.44812784 0.02605801 1.24404267 167-0.3084945 -0.95398889 0.4863714 0.6277940 -0.42267186-0.12551885 1.09854129 172-0.7430965 -0.95398889 0.5429020 -0.9973667 0.08046046 0.36375913 1.57097911 177-0.7819832 -0.95398889 0.9090037 -1.1510096 0.13471493 1.41938024 0.20722207 182-0.5506266 -0.95398889 0.8709243 -0.2628853 -0.23251673 1.14370331-0.02029362 192-0.7843449 -0.95398889 0.9076128 -1.1538457 0.51361460 1.40698196 0.19749768 212-0.5513467 0.14257941-0.5724490 -0.2676561 -0.06109769 1.13814288-0.02451948 217-0.3544599 -0.95398889 0.8367847 0.4870601 -0.14962698 0.89012443-0.21887286 222-0.2577430 -0.95398889 0.4833069 0.1461733 0.02192133-0.04984295 1.06376804 …………………… 958-1.03416302 0.7757177 0.8405317 -0.25569006 1.53878101 0.1886832 -0.52920597 959-1.09337034 1.8228929 -0.9328639 0.90911463-0.08690866 0.8431535 -1.01521847 961 0.47977350-0.9539889 0.5977088 0.75779299 0.57559550 0.3338215 -0.68057278 962-1.10611857-0.9539889 1.8911770 -0.61423147 0.91251101 0.7434844 0.25206520 963-1.20772058 0.9780357 1.0465997 -0.34333881 0.76837711-0.7573945 0.36798308 964 0.39680959 0.6423870 0.7040606 0.97095013-0.14290982-0.7740819 -1.41348148 965 0.57880318 0.9222704 -0.9328639 0.31244131-0.42067402-1.0891511 0.05224448 966 1.01738347 1.1759267 -0.9328639 0.32084641-0.05658178-1.7123400 -0.21601860 968-0.07314017 0.1753625 0.2243178 0.62003195 0.28740494-1.3414356 0.90471118 969-0.95322680-0.9539889 1.3929832 0.14184625 0.94676347 1.2607123 -0.28326537 970-1.08090715 1.8462336 -0.9328639 -0.09897249 0.71433397 0.9543590 -0.93419075 971-0.98922043 1.2679192 1.3442210 1.20719545 0.88344437-1.2550518 -0.49644276 972 0.38800925-0.9539889 0.6992291 0.24729258-0.12861217-0.8110073 0.77826829 973 0.64063572-0.9539889 1.0265043 1.47190990-0.18681473-0.8789036 -0.15792941 974 0.26867818 0.4383784 0.4950507 0.27630137 0.44471931-0.5149481 -0.91304524 975 0.59151186 0.9296633 -0.9328639 -0.51195505 0.35523735-1.0456901 0.17071766 977-1.30648807 1.4451467 1.5272809 -0.07888306-0.23812218-0.9419310 -0.30587479 979-0.95201063-0.9539889 2.1713142 0.14984145 1.12457015-1.0100769 1.40811562 980 0.51688755-0.9539889 0.7677287 -0.59119326 0.59934866-0.3248713 0.22091938 981 0.62430106-0.9539889 1.0169151 0.06258255 0.87798510-0.9338088 0.27761185 982-1.17009444 1.0267134 1.0980946 3.03405763-0.13840739-0.4623723 -1.05516994 983 0.23904230 0.4247238 -0.9328639 -0.30627830 0.01727672-0.6337927 0.78217528 985 0.05096172 0.3355627 0.3896545 0.68844758-0.16333797-1.3919022 0.45881845 986-1.05091334 1.8986292 -0.9328639 -0.70367053 0.97453510 1.2026513 -1.13944546 987-0.97353781 2.0358973 -0.9328639 0.04323462 1.08733983-1.1500287 0.86297824 988 0.14443572 0.3801135 0.4338853 -0.13030515 0.38115660-1.0150275 0.04249164 989 0.65793099-0.9539889 0.8430374 0.75499036 0.68781012-1.5331936 0.83435698 990-1.12495282 0.6788567 0.7412387 0.63244536-0.11651321-0.5118373 0.40341638 991-1.04638908 1.1917394 -0.9328639 -0.31899857 0.82417313 0.5263082 0.31968011 992 0.02759098 0.2194803 0.2703346 0.97120199 0.34989206-0.9166100 0.10905214

### Randomly select a few mixtures as a starting poolset.seed(91)startMixture <- sample(1:nrow(subTrain), 1)starters <- scaledTrain[startMixture, 1:7]pool <- scaledTrainindex <- maxDissim(starters, pool, 14)startPoints <- c(startMixture, index)

starters <- subTrain[startPoints,1:7]

startingValues <- starters[, -4]

### For each starting mixture, optimize the Cubist model using### a simplex search routinecbResults <- startingValuescbResults$Water <- NAcbResults$Prediction <- NAfor(i in 1:nrow(cbResults)){  results <- optim(unlist(cbResults[i,1:6]),                   modelPrediction,                   method = "Nelder-Mead",                   control=list(maxit=5000),                   mod = cbFit)  cbResults$Prediction[i] <- -results$value  cbResults[i,1:6] <- results$par}cbResults$Water <- 1 - apply(cbResults[,1:6], 1, sum)cbResults <- subset(cbResults, Prediction > 0 & Water > .02)cbResults <- cbResults[order(-cbResults$Prediction),][1:3,]cbResults$Model <- "Cubist"

### Do the same for the neural network modelnnetResults <- startingValuesnnetResults$Water <- NAnnetResults$Prediction <- NAfor(i in 1:nrow(nnetResults)){  results <- optim(unlist(nnetResults[i, 1:6,]),                   modelPrediction,                   method = "Nelder-Mead",                   control=list(maxit=5000),                   mod = nnetFit)  nnetResults$Prediction[i] <- -results$value  nnetResults[i,1:6] <- results$par}nnetResults$Water <- 1 - apply(nnetResults[,1:6], 1, sum)nnetResults <- subset(nnetResults, Prediction > 0 & Water > .02)nnetResults <- nnetResults[order(-nnetResults$Prediction),][1:3,]nnetResults$Model <- "NNet"

nnetResults

CementBlastFurnaceSlagFlyAshSuperplasticizerCoarseAggregateFineAggregateWaterPredictionModel 5230.1974024 0.1386961 4.884813e-030.007925268 0.3310551 0.2700363 0.05000000 83.97978 NNet 7950.2189491 0.1309298 2.742898e-090.007305410 0.3587300 0.2327945 0.05129121 82.11033 NNet 690.1933065 0.1303526 1.712704e-090.006000672 0.3932416 0.2270986 0.05000001 81.44083 NNet

### Convert the predicted mixtures to PCA space and plotpp2 <- preProcess(subTrain[, 1:7], "pca")pca1 <- predict(pp2, subTrain[, 1:7])pca1$Data <- "Training Set"pca1$Data[startPoints] <- "Starting Values"pca3 <- predict(pp2, cbResults[, names(subTrain[, 1:7])])pca3$Data <- "Cubist"pca4 <- predict(pp2, nnetResults[, names(subTrain[, 1:7])])pca4$Data <- "Neural Network"pcaData <- rbind(pca1, pca3, pca4)pcaData$Data <- factor(pcaData$Data,                       levels = c("Training Set","Starting Values",                                  "Cubist","Neural Network"))lim <- extendrange(pcaData[, 1:2])

xyplot(PC2 ~ PC1,        data = pcaData,        groups = Data,       auto.key = list(columns = 2),       xlim = lim,        ylim = lim,       type = c("g", "p"))

png

xyplot(PC2 ~ PC1,        data = pcaData,        groups = Data,       auto.key = list(columns = 2),       par.settings = simpleTheme(pch = 20:25),       xlim = lim,        ylim = lim,       type = c("g", "p"))

png

Section 11.1 Class Predictions

Basic: Simulate some two class data with two predictors

library(AppliedPredictiveModeling)### Simulate some two class data with two predictorsset.seed(975)training <- quadBoundaryFunc(500)testing <- quadBoundaryFunc(1000)testing$class2 <- ifelse(testing$class == "Class1", 1, 0)testing$ID <- 1:nrow(testing)### Fit modelslibrary(MASS)qdaFit <- qda(class ~ X1 + X2, data = training)library(randomForest)rfFit <- randomForest(class ~ X1 + X2, data = training, ntree = 2000)

### Predict the test settesting$qda <- predict(qdaFit, testing)$posterior[,1]testing$rf <- predict(rfFit, testing, type = "prob")[,1]

One way to assess the quality of the class probabilities is using a calibration plot.

For a given set of data, this plot shows some measure of the observed
probability of an event versus the predicted class probability. One approach
for creating this visualization is to score a collection of samples with known
outcomes (preferably a test set) using a classification model. The next step
is to bin the data into groups based on their class probabilities. For example,
a set of bins might be [0, 10%], (10%, 20%], …, (90%, 100%]. For each
bin, determine the observed event rate. Suppose that 50 samples fell into
the bin for class probabilities less than 10% and there was a single event.
The midpoint of the bin is 5% and the observed event rate would be 2 %.
The calibration plot would display the midpoint of the bin on the x-axis and
the observed event rate on the y-axis. If the points fall along a 45◦ line, the
model has produced well-calibrated probabilities.

### Generate the calibration analysislibrary(caret)calData1 <- calibration(class ~ qda + rf, data = testing, cuts = 10)

### Plot the curvexyplot(calData1, auto.key = list(columns = 2))

png

### To calibrate the data, treat the probabilities as inputs into the### modeltrainProbs <- trainingtrainProbs$qda <- predict(qdaFit)$posterior[,1]

### These models take the probabilities as inputs and, based on the### true class, re-calibrate them.library(klaR)nbCal <- NaiveBayes(class ~ qda, data = trainProbs, usekernel = TRUE)### We use relevel() here because glm() models the probability of the### second factor level.lrCal <- glm(relevel(class, "Class2") ~ qda, data = trainProbs, family = binomial)

Warning message:"package 'klaR' was built under R version 3.2.5"Attaching package: 'klaR'The following object is masked from 'package:TeachingDemos':    triplot

### Now re-predict the test set using the modified class probability### estimatestesting$qda2 <- predict(nbCal, testing[, "qda", drop = FALSE])$posterior[,1]testing$qda3 <- predict(lrCal, testing[, "qda", drop = FALSE], type = "response")

### Manipulate the data a bit for pretty plottingsimulatedProbs <- testing[, c("class", "rf", "qda3")]names(simulatedProbs) <- c("TrueClass", "RandomForestProb", "QDACalibrated")simulatedProbs$RandomForestClass <-  predict(rfFit, testing)

Manipulate the data a bit for pretty plotting

calData2 <- calibration(class ~ qda + qda2 + qda3, data = testing)calData2$data$calibModelVar <- as.character(calData2$data$calibModelVar)calData2$data$calibModelVar <- ifelse(calData2$data$calibModelVar == "qda",                                       "QDA",                                      calData2$data$calibModelVar)calData2$data$calibModelVar <- ifelse(calData2$data$calibModelVar == "qda2",                                       "Bayesian Calibration",                                      calData2$data$calibModelVar)calData2$data$calibModelVar <- ifelse(calData2$data$calibModelVar == "qda3",                                       "Sigmoidal Calibration",                                      calData2$data$calibModelVar)calData2$data$calibModelVar <- factor(calData2$data$calibModelVar,                                      levels = c("QDA",                                                  "Bayesian Calibration",                                                  "Sigmoidal Calibration"))

QDA is re-calibrated use Bayesian & Sigmoidal

xyplot(calData2, auto.key = list(columns = 1))

png

Presenting Class Probabilities

### Recreate the model used in the over-fitting chapterlibrary(caret)data(GermanCredit)

head(GermanCredit,2)

statusdurationcredit_historypurposeamountsavingsemployment_durationinstallment_ratepersonal_status_sexother_debtors…propertyageother_installment_planshousingnumber_creditsjobpeople_liabletelephoneforeign_workercredit_risk … < 100 DM 6 critical account/other credits existing domestic appliances 1169 unknown/no savings account … >= 7 years 4 male : single none … real estate 67 none own 2 skilled employee/official 1 yes yes good 0 <= … < 200 DM 48 existing credits paid back duly till now domestic appliances 5951 … < 100 DM 1 <= … < 4 years 2 female : divorced/separated/married none … real estate 22 none own 1 skilled employee/official 1 no yes bad

## First, remove near-zero variance predictors then get rid of a few predictors ## that duplicate values. For example, there are two possible values for the ## housing variable: "Rent", "Own" and "ForFree". So that we don't have linear## dependencies, we get rid of one of the levels (e.g. "ForFree")GermanCredit <- GermanCredit[, -nearZeroVar(GermanCredit)]GermanCredit$CheckingAccountStatus.lt.0 <- NULLGermanCredit$SavingsAccountBonds.lt.100 <- NULLGermanCredit$EmploymentDuration.lt.1 <- NULLGermanCredit$EmploymentDuration.Unemployed <- NULLGermanCredit$Personal.Male.Married.Widowed <- NULLGermanCredit$Property.Unknown <- NULLGermanCredit$Housing.ForFree <- NULLnames(GermanCredit)[ncol(GermanCredit) ]<- 'Class'

## Split the data into training (80%) and test sets (20%)set.seed(100)inTrain <- createDataPartition(GermanCredit$Class, p = .8)[[1]]GermanCreditTrain <- GermanCredit[ inTrain, ]GermanCreditTest  <- GermanCredit[-inTrain, ]set.seed(1056)logisticReg <- train(Class ~ .,                     data = GermanCreditTrain,                     method = "glm",                     trControl = trainControl(method = "repeatedcv",                                               repeats = 5))

Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :"prediction from a rank-deficient fit may be misleading"

logisticReg

Generalized Linear Model 800 samples 19 predictor  2 classes: 'good', 'bad' No pre-processingResampling: Cross-Validated (10 fold, repeated 5 times) Summary of sample sizes: 720, 720, 720, 720, 720, 720, ... Resampling results:  Accuracy  Kappa      0.75425   0.3797266

### Predict the test setcreditResults <- data.frame(obs = GermanCreditTest$Class)creditResults$prob <- predict(logisticReg, GermanCreditTest, type = "prob")[, "bad"]creditResults$pred <- predict(logisticReg, GermanCreditTest)creditResults$Label <- ifelse(creditResults$obs == "bad",                               "True Outcome: Bad Credit",                               "True Outcome: Good Credit")

Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :"prediction from a rank-deficient fit may be misleading"Warning message in predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :"prediction from a rank-deficient fit may be misleading"

### Plot the probability of bad credithistogram(~prob|Label,          data = creditResults,          layout = c(2, 1),          nint = 20,          xlab = "Probability of Bad Credit",          type = "count")### Calculate and plot the calibration curvecreditCalib <- calibration(obs ~ prob, data = creditResults)xyplot(creditCalib)

png

Top: Histograms for a set of probabilities associated with bad credit.
The two panels split the customers by their true class. Bottom: A calibration
plot for these probabilities

The top panel shows histograms of
the test set probabilities for the logistic regression model (the panels indicate
the true credit status). The probability of bad credit for the customers with
good credit shows a skewed distribution where most customers’ probabilities
are quite low. In contrast, the probabilities for the customers with bad
credit are flat (or uniformly distributed), reflecting the model’s inability to
distinguish bad credit cases.

### Create the confusion matrix from the test set.confusionMatrix(data = creditResults$pred,                 reference = creditResults$obs)

Confusion Matrix and Statistics          ReferencePrediction good bad      good  123  31      bad    17  29               Accuracy : 0.76                             95% CI : (0.6947, 0.8174)    No Information Rate : 0.7                 P-Value [Acc > NIR] : 0.03595                           Kappa : 0.3878           Mcnemar's Test P-Value : 0.06060                     Sensitivity : 0.8786                      Specificity : 0.4833                   Pos Pred Value : 0.7987                   Neg Pred Value : 0.6304                       Prevalence : 0.7000                   Detection Rate : 0.6150             Detection Prevalence : 0.7700                Balanced Accuracy : 0.6810                 'Positive' Class : good

The Kappa statistic (also known as
Cohen’s Kappa) was originally designed to assess the agreement between two
raters (Cohen 1960). Kappa takes into account the accuracy that would be
generated simply by chance.

ROC: Receiver Operating Characteristic Curves

### ROC curves:### Like glm(), roc() treats the last level of the factor as the event### of interest so we use relevel() to change the observed class datalibrary(pROC)creditROC <- roc(relevel(creditResults$obs, "good"), creditResults$prob)

coords(creditROC, "all")[,1:3]auc(creditROC)ci.auc(creditROC)

allallall threshold-Inf 6.509215e-094.267632e-08 specificity 0 7.142857e-031.428571e-02 sensitivity 1 1.000000e+001.000000e+00

0.78202380952381

0.71286003739189
0.78202380952381
0.851187581655729

### Note the x-axis is reversedplot(creditROC)

png

Lift Charts

Lift charts (Ling and Li 1998) are a visualization tool for assessing the ability
of a model to detect events in a data set with two classes.

Lift charts do just this: rank the samples by
their scores and determine the cumulative event rate as more samples are
evaluated.In the optimal case, the M highest-ranked samples would contain
all M events. When the model is non-informative, the highest-ranked X %
of the data would contain, on average, X events.

creditLift <- lift(obs ~ prob, data = creditResults)xyplot(creditLift)

png

阅读全文

0 0