Using R to Fix Data Quality: Section 7
来源:互联网 发布:php 获取服务器的ip 编辑:程序博客网 时间:2024/05/22 02:17
Section 7: Fix missing data
Overview
In previous sections, we have mentioned how to find missing data in our table. In this section, we are going to use linear regression to restore missing data.
Read CSV Data
In this demo, we use the hours.csv to be our data.
> head(data)
Hours Score Questions.Posted Days.Missed
1 1 NA 0 3
2 1 55 2 0
3 3 NA 0 2
4 5 60 0 1
5 5 65 0 2
6 5 70 1 3
In this table, there are four columns. The Hours, Questions.Posted and Days.Missed are complete, but there are some missing data in the Score. Thus, we need a solution to restore the missing data in Score.
Because the names of each column are too long if we always use $ operation, we can attach data:
> attach(data)
Linear Regression
An easy way to restore the missing data is just using random imputation to replace NA, but it is bad for the accuracy of our data. Linear Regression can use the existent data to predict the value of each missing data. Linear Regression is a complex functionality to code if we use other programming language, such as Java, C, and Python. Fortunately, R includes the Linear Regression function, so that we can use it directly.
Use Hours, Questions.Posted and Days.Missed to make a linear Regression for Score:
> lmod=lm(Score ~ Hours + Questions.Posted + Days.Missed)> lmod
The “lmod” is the result of our linear regression. We can use it to make a prediction of Score. For example, we want to predict the score when Hours = 3, Questions.Posted=50, and Days.Missed =2.
The code to predict:
> predict(lmod, data.frame(Hours=c(3), Questions.Posted=c(50), Days.Missed=c(2)))1
31.65578
Deterministic Regression Imputation
We have used linear regression to make a predicting Score based on the other three variables. The next thing we need to do is just to replace each NA to our prediction value.In fact the function predict() can be used in data table directly:
> p=predict(lmod,data)We should make a function impute() to replace NA:
impute <- function (a, a.impute){ifelse (is.na(a), a.impute, a)
}
Replace each NA with our prediction value:
> data$Score=impute(data$Score,p)Congratulations! You have fixed the missing data problem in your data table.
Practice Question
1. What are the 5 new values in the table after our regression imputation?
- Using R to Fix Data Quality: Section 7
- Using R to Fix Data Quality: Section 0
- Using R to Fix Data Quality: Section 1
- Using R to Fix Data Quality: Section 2
- Using R to Fix Data Quality: Section 3
- Using R to Fix Data Quality: Section 4
- Using R to Fix Data Quality: Section 5
- Using R to Fix Data Quality: Section 6
- Using R to Fix Data Quality: Section 8
- Using R to read and plot the csv data
- 学习Introduction to Data Analysis using R系列
- [译] 使用Using Data Quality Services (DQS) 清理用户数据
- using linker option to fix error LNK2005
- Microsoft buys Zoomix to add data quality to SQL Server
- How to generate high quality image by using Imagemagick
- How to implement Quality Of Service using Floodlight
- Microsoft Signs Agreement to Purchase Data Quality Start-up Zoomix
- Using OLE to Add Data
- hdu 1176 免费馅饼
- hdu 2255 奔小康赚大钱 (KM裸题)
- 类型Universe 无法解析程序集 System.Design,Version=2.0.0.0
- 求N!末尾0的个数
- 用递归实现求字符串长度
- Using R to Fix Data Quality: Section 7
- android日记本源代码之加密
- 重载和覆写
- java笔记 Map集合
- 黑马程序员_javaAPI之String
- Cocos2d-x随记(2)-精灵移动
- 泛型
- HDU 1166 敌兵布阵 线段树入门(线段树)
- Unity 3D-- 摄像机Clear Flags和Culling Mask属性用途详解