缺失值NA的影响
来源:互联网 发布:vs2015 php开发 编辑:程序博客网 时间:2024/04/30 12:41
最近学习logistic回归模型时,其中一步构造回归设计矩阵(mode.matrix)时遇到麻烦,现总结重点:
1.解释变量的数据类型是连续型还是离散型,离散型的需要弄清楚各个水类的个数,防止出现NA水平,改变了数据的长度。
## read data and create relevant variablescredit <- read.csv("C:/Users/Administrator/Desktop/DataMining/germancredit.csv")dim(credit)str(credit)'data.frame': 1000 obs. of 22 variables: $ Default : Factor w/ 2 levels "0","1": 1 2 1 1 2 1 1 1 1 2 ... $ checkingstatus1: Factor w/ 4 levels "A11","A12","A13",..: 1 2 4 1 1 4 4 2 4 2 ... $ duration : int 6 48 12 42 24 36 24 36 12 30 ... $ history : Factor w/ 3 levels "good","poor",..: 3 2 3 2 2 2 2 2 2 3 ... $ purpose : Factor w/ 10 levels "A40","A41","A410",..: 5 5 8 4 1 8 4 2 5 1 ... $ amount : int 1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ... $ savings : Factor w/ 5 levels "A61","A62","A63",..: 5 1 1 1 1 5 3 1 4 1 ... $ employ : Factor w/ 5 levels "A71","A72","A73",..: 5 3 4 4 3 3 5 3 4 1 ... $ installment : int 4 2 2 2 3 2 3 2 2 4 ... $ status : Factor w/ 4 levels "A91","A92","A93",..: 3 2 3 3 3 3 3 3 1 4 ... $ others : Factor w/ 3 levels "A101","A102",..: 1 1 1 3 1 1 1 1 1 1 ... $ residence : int 4 2 3 4 4 4 4 2 4 2 ... $ property : Factor w/ 4 levels "A121","A122",..: 1 1 1 2 4 4 2 3 1 3 ... $ age : int 67 22 49 45 53 35 53 35 61 28 ... $ otherplans : Factor w/ 3 levels "A141","A142",..: 3 3 3 3 3 3 3 3 3 3 ... $ housing : Factor w/ 3 levels "A151","A152",..: 2 2 2 3 3 3 2 1 2 2 ... $ cards : int 2 1 1 1 2 1 1 1 1 2 ... $ job : Factor w/ 4 levels "A171","A172",..: 3 3 2 3 3 2 3 4 2 4 ... $ liable : int 1 1 2 2 2 2 1 1 1 1 ... $ tele : Factor w/ 2 levels "A191","A192": 2 1 1 1 1 2 1 2 1 1 ... $ foreign : Factor w/ 2 levels "foreign","german": 1 1 1 1 1 1 1 1 1 1 ... $ rent : Factor w/ 2 levels "FALSE","TRUE": 1 1 1 1 1 1 1 2 1 1 ...
可以看出各个变量的类型。
credit$Default <- factor(credit$Default) #整数类型变成因子类型## re-level the credit history and a few other variableslevels(credit$history) = c("good","good","poor","poor","terrible")levels(credit$foreign) <- c("foreign","german")credit$rent <- factor(credit$housing=="A151")table(credit$rent)credit$purpose <- factor(credit$purpose, levels=c("A40","A41","A42","A43","A44", "A45","A46","A47","A48","A49","A410")) #重新指定水平为11类
这里重新指定因子的类型,原本我认为不需要,因子read.csv在读入字符串变量时已经将其转变为因子类型,但是变量purpose较特殊,必须指定。
原因在于purpose变量,读入的原始数据中purpose变量共有10个水平,如果此时不重新指定purose变量的水平,
credit$purpose <- factor(credit$purpose,levels=c("A40","A41","A42","A43","A44", "A45","A46","A47","A48","A49","A410")) #重新指定水平为11类
那么直接levels(credit$purpose)就会把原始的第八个水平指定为空,造成purpose变量的取值中有50NA,在计算回归设计矩阵时就会出现下标出界的错误。
0 0
- 缺失值NA的影响
- R语言中缺失值NA的处理
- R语言删除数据框中含有缺失值NA的行或列
- 缺失值的处理
- 缺失值的处理
- na
- NA
- NA
- 缺失值的处理方法
- 缺失值的处理方法
- 处理缺失值的方法
- 缺失值的产生机制
- 缺失值的插补
- 缺失值的处理方法
- 缺失值的处理方法
- 缺失值的处理方法
- 缺失值的前期处理
- 【数据建模 缺失值处理】缺失值的处理
- 如何获得窗口类的数据
- 什么是QName
- 安卓使用Glide优雅的下载图片
- jquery获取复选框被选中的值
- 有关僵尸进程
- 缺失值NA的影响
- Linux-Shell脚本编程-学习-8-函数
- easyui ——正则验证手机号码例子
- Selenium中的几种等待方式,需特别注意implicitlyWait的用法
- Generate settings
- 如何制作透明窗口
- C语言在MAC的Terminal 里面运行方法
- 如何在几何画板中用线段标记三角形的边
- codeblocks 使用汇总