R机器学习之交叉验证改善模型

来源:互联网 发布:编程项目 编辑:程序博客网 时间:2024/03/29 17:37

什么事交叉验证

交叉验证就是保留一部分样本集不用于训练模型,而用于预测。

交叉验证的方法

  1. 50%测试集,50%训练集
    缺点:只用一半数据集训练有可能丢失有用信息,即高偏差
  2. 留一法
    2.1使用所有数据点,具有较低偏差
    2.2 递归执行n次交叉验证,较高执行时间
    2.3在测试集上容易产生高方差,因为一旦这个作为测试集的点是个异常点,那就over!

  3. k-折叠交叉验证
    k-折叠交叉验证解决了上面两个的问题:
    3.1使用大部分数据作为训练集
    3.2保证测试集比例

大致步骤:

  1. Randomly split your entire dataset into k”folds”.
    For each k folds in your dataset, build your model on k – 1 folds of the data set.
  2. Then, test the model to check the effectiveness for kth fold.

  3. Record the error you see on each of the predictions.

  4. Repeat this until each of the k folds has served as the test set.

  5. The average of your k recorded errors is called the cross-validation error and will serve as your performance metric for the model.

但是:怎么确定k?
小k容易高偏差,大k容易高方差。通常取k=10

如何测量模型的偏差和方差?

k-折叠交叉验证将会产生k个不同模型的误差估计,理想的情况是它们的和为零。对这些误差取平均得到模型的偏差

类似的计算模型的方差,我们取所有误差的标准偏差。

我们需要在偏差和方差之间做一个权衡,可以控制偏差的情况下减少方差。

python 代码

from sklearn import cross_validationmodel = RandomForestClassifier(n_estimators=100)#Simple K-Fold cross validation. 10 folds.cv = cross_validation.KFold(len(train), n_folds=10, indices=False)results = []# "Error_function" can be replaced by the error function of your analysisfor traincv, testcv in cv:        probas = model.fit(train[traincv], target[traincv]).predict_proba(train[testcv])        results.append( Error_function )print "Results: " + str( np.array(results).mean() )

R代码

setwd('C:/Users/manish/desktop/RData')library(plyr)library(dplyr)library(randomForest)data <- irisglimpse(data)#cross validation, using rf to predict sepal.lengthk = 5data$id <- sample(1:k, nrow(data), replace = TRUE)list <- 1:k# prediction and test set data frames that we add to with each iteration over# the foldsprediction <- data.frame()testsetCopy <- data.frame()#Creating a progress bar to know the status of CVprogress.bar <- create_progress_bar("text")progress.bar$init(k)#function for k foldfor(i in 1:k){# remove rows with id i from dataframe to create training set# select rows with id i to create test settrainingset <- subset(data, id %in% list[-i])testset <- subset(data, id %in% c(i))#run a random forest modelmymodel <- randomForest(trainingset$Sepal.Length ~ ., data = trainingset, ntree = 100)#remove response column 1, Sepal.Lengthtemp <- as.data.frame(predict(mymodel, testset[,-1]))# append this iteration's predictions to the end of the prediction data frameprediction <- rbind(prediction, temp)# append this iteration's test set to the test set copy data frame# keep only the Sepal Length ColumntestsetCopy <- rbind(testsetCopy, as.data.frame(testset[,1]))progress.bar$step()}# add predictions and actual Sepal Length valuesresult <- cbind(prediction, testsetCopy[, 1])names(result) <- c("Predicted", "Actual")result$Difference <- abs(result$Actual - result$Predicted)# As an example use Mean Absolute Error as Evalution summary(result$Difference)
0 0
原创粉丝点击