R中的数据抽样SMOTE （谢佳标老师讲课笔记）

来源：互联网发布：火星时代客观评价知乎编辑：程序博客网时间：2024/05/05 17:29

在使用抽样之前，之前学的内容忘得差不多了。

所以在使用本次例子之前，对获取该数据作下了解。

hyper<-read.csv("http://archive.ics.uci.edu/ml/machine-learning-databases/thyroid-disease/hypothyroid.data",header=F)

names<-read.csv("http://archive.ics.uci.edu/ml/machine-learning-databases/thyroid-disease/hypothyroid.names",header=F,sep='\t')[[1]]

> names [1] hypothyroid, negative.     age:                       sex:                       [4] on_thyroxine:              query_on_thyroxine:        on_antithyroid_medication: [7] thyroid_surgery:           query_hypothyroid:         query_hyperthyroid:       [10] pregnant:                  sick:                      tumor:                    [13] lithium:                   goitre:                    TSH_measured:             [16] TSH:                       T3_measured:               T3:                       [19] TT4_measured:              TT4:                       T4U_measured:             [22] T4U:                       FTI_measured:              FTI:                      [25] TBG_measured:              TBG:

对列名进行清洗。因为有"."，还有以“：”结尾的。所以处理成均不含符号的。

name1<-gsub(".|:","",names)

把含有“.”或者“：”的都去掉，这样对吗？注意对. 进行处理时，要加上"[]"。

> gg<-gsub(":|[.]","",names)> gg [1] "hypothyroid, negative"     "age"                       "sex"                       [4] "on_thyroxine"              "query_on_thyroxine"        "on_antithyroid_medication" [7] "thyroid_surgery"           "query_hypothyroid"         "query_hyperthyroid"       [10] "pregnant"                  "sick"                      "tumor"                    [13] "lithium"                   "goitre"                    "TSH_measured"             [16] "TSH"                       "T3_measured"               "T3"                       [19] "TT4_measured"              "TT4"                       "T4U_measured"             [22] "T4U"                       "FTI_measured"              "FTI"                      [25] "TBG_measured"              "TBG"

gsub(pattern,replacement,x) 在此扩展用法。

如：

> x <- "line 4322: He is now 25 years old, and weights 130lbs"> y <- gsub("\\d+","---",x)> y

[1] "line ---: He is now --- years old, and weights ---lbs"

把数字替换成“----”（更多替换内容，请参考http://www.endmemo.com/program/R/gsub.php）

> colnames(hyper) [1] "V1"  "V2"  "V3"  "V4"  "V5"  "V6"  "V7"  "V8"  "V9"  "V10" "V11" "V12" "V13" "V14" "V15" "V16"[17] "V17" "V18" "V19" "V20" "V21" "V22" "V23" "V24" "V25" "V26"> colnames(hyper)<-gg

colnames(hyper)[1]<-"target"

> table(hyper$target)hypothyroid    negative         151        2983

> hyper$target<-ifelse(hyper$target=="negative",0,1)# 此步骤的变换很cool

> table(hyper$target)   0    1 2983  151

talbe 主要用于频数统计。频率呢？prop.table()

> prop.table(table(hyper$target))         0          1 0.95181876 0.04818124

将0和1转化成为因子型变量。

> str(hyper$target) num [1:3134] 1 1 1 1 1 1 1 1 1 1 ...> hyper$target<-as.factor(hyper$target)> str(hyper$target) Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...

下面进入正题。

1.SMOTE函数

SMOTE(form, data, perc.over = 200, k = 5, perc.under = 200,
learner = NULL, ...)

form 通常为形式为 V1~. 其中v1 代表分类变量的名称

data 表示整个数据集的名称

perc.over= 200 表示抽样时对少数样本增加2倍。

perc.under=200 表示抽样时多数样本是当前增加的少数样本数量的2倍（总计4倍）。

k 可以忽略，不知道什么意思。

和前面的结合起来可知。

install.packages("DMwR")

> hyper_new<-SMOTE(target~.,hyper,per.over=200,perc.under=200)

> table(hyper_new$target)

  0   1 604 453

help(SMOTE) 中的例子也是按照这个原理，可以试一下。这里主要也要进行因子类型的转换。

0 1