Using R to Fix Data Quality: Section 3

来源:互联网 发布:vb.net sqlite 编辑:程序博客网 时间:2024/05/01 03:47

Section 3: Correlation Test


Overview

In this section, we are going to talk about the relationship between two variables, making a scatter plot for two columns, and making a correlation test.


Scatter Plots

In last section, we use some useful way to visualize one variable. Now, I am happy to show you a way to visualize two variables. This way is to generate a scatter plot. A scatter plot can put one variable on the x-axis and the other variable on the y-axis.  From a scatter plot, we can see the relationship between two variables easily.

Read the data from weather.csv again:

> data=read.csv("weather.csv")

Create scatter plot:

> plot(data$Wind, data$Temp)


Correlation Test

One way to test the relationship between two variables is to check the correlation. If a variable is increasing when another variable is increasing, we say that they are positively correlated. If a variable is increasing when another variable is decreasing, they are negatively correlated. It is also possible that two variables have no relationship, so they are uncorrelated.

There is more than one way to make a correlation test. Because we are not sure if the data are normally distributed, Spearman’s rho is a good idea to be used for our statistical correlation test.

Make a Spearman’s rho:

> cor.test(data$Wind, data$Temp, method="spearman")

This code can get a sample estimate of the rho. The range of the value should be between -1 and 1.  A value of -1 means negative correlation, and a value of 1 means positive correlation. If the value is 0, they are uncorrelated.


Practice Questions

1. What is the relationship between Wind and Temp?

2. If two variables are correlated, does that mean that one must cause the other?


原创粉丝点击