Learning R---dummy

来源:互联网 发布:好用的网络电话软件 编辑:程序博客网 时间:2024/06/05 21:53

Intr

对数据框中的因子型和字符串变量快速高效地创建哑变量。在网上搜哑变量和one-hot encoding,碰巧看到的。感觉还是python比较适合,依赖一个库就好,R真是各个包,不继续维护的话,没准有很多坑。


Function

categories

主要作用:抽取分类变量的值,是生成哑变量的预处理工作。
categories函数抽取数据框中所有的因子型和字符型变量,忽略数值型变量,是dummy函数的预处理。

Arguments

x 数据框p 选择频数为前p个的值。可以是"all"(即分类变量的所有值),或者一个整数p(表示所有分类变量频数排名最靠前的p个),或者一个向量(指定每一个分类型变量的情况)

Examples

library(dummy)traindata <- data.frame(var1=as.factor(c("a","b","b","c")),var2=as.factor(c(1,1,2,3)),                      var3=c("val1","val2","val3","val3"),stringsAsFactors=FALSE)newdata <- data.frame(var1=as.factor(c("a","b","b","c","d","d")),var2=as.factor(c(1,1,2,3,4,5)),var3=c("val1","val2","val3","val3","val4","val4"),stringsAsFactors=FALSE)categories(x=traindata,p="all")categories(x=traindata,p=2)categories(x=traindata,p=c(2,1,3))

dummy

Arguments

dummy(x, p = "all", object = NULL, int = FALSE, verbose = FALSE)x 数据框p object为NULL时,参数有效。参数含义同categories中的参数object categories输出的对象  int TRUE表示哑变量为数值型,否则因子型verbose 是否需要展示进程

Examples

library(dummy)traindata <- data.frame(var1=as.factor(c("a","b","b","c")),                        var2=as.factor(c(1,1,2,3)),                        var3=c("val1","val2","val3","val3"),                        stringsAsFactors=FALSE)newdata <- data.frame(var1=as.factor(c("a","b","b","c","d","d")),                      var2=as.factor(c(1,1,2,3,4,5)),                      var3=c("val1","val2","val3","val3","val4","val4"),                      stringsAsFactors=FALSE)#create dummies of training set(dummies_train <- dummy(x=traindata))#create dummies of new set(dummies_new <- dummy(x=newdata))#how many new dummy variables should not have been created?sum(! colnames(dummies_new) %in% colnames(dummies_train))#create dummies of new set using categories found in training set(dummies_new <- dummy(x=newdata,object=categories(traindata,p="all")))#how many new dummy variables should not have be created?sum(! colnames(dummies_new) %in% colnames(dummies_train))#create dummies of training set,#using the top 2 categories of all variables found in the training datadummy(x=traindata,p=2)#create dummies of training set,#using respectively the top 2,3 and 1 categories of the three#variables found in training datadummy(x=traindata,p=c(2,3,1))#create all dummies of training datadummy(x=traindata)

Others

实际应用是否需要先把训练集和测试集合起来,再进行哑变量呢?不过如果训练集中没有这个类别,似乎模型在测试集中也没有啥用啊,真正的含义是把那些未知的类别都归于训练集中最后一个类别了。
至于哑变量和one-hot encoding的内容还要再找找资料学习下~之前完全没有考虑过这些内容哈,还是太欠缺咯