R Programming -- real-world data
来源:互联网 发布:关于提供数据的回函 编辑:程序博客网 时间:2024/04/29 06:25
Real-World Data
So far, we've been working purely in the abstract. It's time to take a look at some real data, and see if we can make any observations about it.
Some Real World Data7.1
Modern pirates plunder software, not silver. We have a file with the software piracy rate, sorted by country. Here's a sample of its format:
Country,PiracyAustralia,23Bangladesh,90Brunei,67China,77...
We'll load that into the
piracy
data frame for you:> piracy <- read.csv("piracy.csv")
We also have another file with GDP per capita for each country (wealth produced, divided by population):
Rank Country GDP1 Liechtenstein 1411002 Qatar 1043003 Luxembourg 811004 Bermuda 69900...
That will go into the
gdp
frame:> gdp <- read.table("gdp.txt", sep=" ", header=TRUE)
We'll merge the frames on the country names:
> countries <- merge(x = gdp, y = piracy)
Let's do a plot of GDP versus piracy. Call the
plot
function, using the"GDP"
column ofcountries
for the horizontal axis, and the"Piracy"
column for the vertical axis:RedoComplete> plot(countries$GDP,countries$Piracy)
- 02000040000600008000020406080countries$GDPcountries$Piracy
It looks like there's a negative correlation between wealth and piracy - generally, the higher a nation's GDP, the lower the percentage of software installed that's pirated. But do we have enough data to support this connection? Is there really a connection at all?
R can test for correlation between two vectors with the cor.test function. Try calling it on the GDP and Piracy columns of the countries data frame:
RedoComplete> cor.test(countries$GDP,countries$Piracy)Pearson's product-moment correlationdata: countries$GDP and countries$Piracy t = -14.8371, df = 107, p-value < 2.2e-16alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: -0.8736179 -0.7475690 sample estimates: cor -0.8203183
The key result we're interested in is the "p-value". Conventionally, any correlation with a p-value less than 0.05 is considered statistically significant, and this sample data's p-value is definitely below that threshold. In other words, yes, these data do show a statistically significant negative correlation between GDP and software piracy.
We have more countries represented in our GDP data than we do our piracy rate data. If we know a country's GDP, can we use that to estimate its piracy rate?
We can, if we calculate the linear model that best represents all our data points (with a certain degree of error). The
lm
function takes a model formula, which is represented by a response variable (piracy rate), a tilde character (~
), and a predictor variable (GDP). (Note that the response variable comes first.)Try calculating the linear model for piracy rate by GDP, and assign it to the
line
variable:RedoComplete> line <- lm(countries$Piracy ~ countries$GDP)
You can draw the line on the plot by passing it to the abline function. Try it now:
RedoComplete> abline(line)
Now, if we know a country's GDP, we should be able to make a reasonable prediction of how common piracy is there!
- 02000040000600008000020406080countries$GDPcountries$Piracy
ggplot27.2
The functionality we've shown you so far is all included with R by default. (And it's pretty powerful, isn't it?) But in case the default installation doesn't include that function you need, there are still more libraries available on the servers of the Comprehensive R Archive Network, or CRAN. They can add anything from new statistical functions to better graphics capabilities. Better yet, installing any of them is just a command away.
Let's install the popular
ggplot2
graphics package. Call theinstall.packages
function with the package name in a string:RedoComplete> install.packages("ggplot2")
You can get help for a package by calling the help function and passing the package name in the package argument. Try displaying help for the "ggplot2" package:
RedoComplete> help(package = "ggplot2") Information on package 'ggplot2'Description:Package: ggplot2Type: PackageTitle: An implementation of the Grammar of GraphicsVersion: 0.9.1...
Here's a quick demo of the power you've just added to R. To use it, let's revisit some data from a previous chapter.
> weights <- c(300, 200, 100, 250, 150)> prices <- c(9000, 5000, 12000, 7500, 18000)> chests <- c('gold', 'silver', 'gems', 'gold', 'gems')> types <- factor(chests)
The qplot function is a commonly-used part of ggplot2. We'll pass the weights and values of our cargo to it, using the chest types vector for the color argument:
RedoComplete> qplot(weights, prices, color = types)
Not bad! An attractive grid background and colorful legend, without any of the configuration hassle from before!
ggplot2
is just the first of many powerful packages awaiting discovery on CRAN. And of course, there's much, much more functionality in the standard R libraries. This course has only scratched the surface!- 80001200016000100150200250300..1..2..3gemsgoldsilver
Chapter 7 Completed
Share your plunder:
Captain's Log: The end of chapter 7. Supplies are running low. Luckily, we've spotted another badge!
We've covered how to take some real-world data sets, and test whether they're correlated with `cor.test`. Then we learned how to show that correlation on plots, with a linear model.
Continue
- R Programming -- real-world data
- R Programming -- data frames
- R Programming week1-Data Type
- R Programming week1-Reading Data
- Programming Grads Meet a Skills Gap in the Real World
- Jayantha Katupitiya, Kim Bentley, «Interfacing with C++: Programming Real-World Applications»
- Introduction to Java programming, Part 2: Constructs for real-world applications
- observe the real world
- Real World XML
- Real World .NET Applications
- Real World Haskell
- real world FIND usage
- Real-World Concurrency
- Real-World Concurrency
- Real World Haskell 中文版
- Reproducing Real World Light
- Real World Microservices
- Real-world Concurrency
- HDU 1010 Tempter of the Bone
- Java NIO系列教程(一) Java NIO 概述
- 用OpenSSL做自签名的证书(by quqi99)
- C++replace()函数用法
- String,StringBuffer,StringBuilder的区别
- R Programming -- real-world data
- 云“价值链”分析
- 1、字符串分拆+行转列问题
- 如何减少TIME_WAIT的数量
- 始于欣赏,止于爱情
- IOS开发之----常用函数和常数
- vim 字符编码
- 旋转卡壳法求凸包的直径
- 第三方