R Programming: Part 4 - Simulation & Base Graphics
来源:互联网 发布:林弯弯淘宝店 编辑:程序博客网 时间:2024/05/16 03:05
Part 4 - Simulation & Base Graphics
- Part 4 - Simulation Base Graphics
- Simulation
- Generating Random Numbers
- Simulating a Linear Model
- Random Sampling
- Base Graphics
- Basic Histogram
- Basic Line Graph with Regression
- Scatterplot with Legend
- Boxplot with Reordered and Formatted Axes
- Barplot with Error Bars
- Limitations of Base Graphics
- Programming Assignment B Hospital Quality
- Simulation
Simulation
Generating Random Numbers
Generally, every probability distribution function is associated with four R corresponding functions, which are prefixed with
d for probability densityr for random number generationp for cumulative distributionq for quantile function
Take normal distribution for example, the four functions have the following prototypes:
dnorm(x, mean = 0, sd = 1, log = FALSE)rnorm(n, mean = 0, sd = 1)pnorm(q, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE)qnorm(p, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE)
where the arguments mean and sd (standard deviation) stand for
The argument log/log.p implies whether the probabilities p are given as log(p). Lower.tail implies whether the probabilities involved are P[X <= x] or P[X > x].
Assuming the normal distribution to be
then the cumulative distribution function is
and the above four functions give the result as follows:
- dnorm(x,
μ ,σ ) =f(x) - rnorm(n,
μ ,σ ) = a numeric vector with n values sampled from f(x) - pnorm(q,
μ ,σ ) =P(X≤q) =Φ(q) - qnorm(p,
μ ,σ ) =Φ−1(p)
Poisson distribution is a discrete probability distribution that expresses the probability of a given number of events occurring in a fixed interval of time or space if these events occur with a known average rate and independently of the time since the last event.
The probability of observing k events in an interval, if they follow Poisson distribution, is given by the equation
where
For Poisson distribution, we have dpois, rpois, ppois and qpois as well. But since Poisson distribution is discrete, dpois(x,
> rpois(10, 20)[1] 17 20 16 20 25 16 19 28 27 22> ppois(6, 2)[1] 0.9954662> dpois(1.1, 2)[1] 0Warning message:In dpois(1.1, 2) : non-integer x = 1.100000
Reproducibility of random numbers can be achieved by setting identical random seeds with set.seed() function. For others’ convenience to replicate your computation, it is highly recommended to set the seed whenever conducting a simulation.
> set.seed(5)> rnorm(5)[1] -0.84085548 1.38435934 -1.25549186 0.07014277 1.71144087> rnorm(5)[1] -0.6029080 -0.4721664 -0.6353713 -0.2857736 0.1381082> set.seed(5)> rnorm(5)[1] -0.84085548 1.38435934 -1.25549186 0.07014277 1.71144087
Simulating a Linear Model
Suppose we want to simulate from a simple linear model as follows
where the noise
> set.seed(20) ## Important for reproducibility> x <- rnorm(100)> e <- rnorm(100, 0, 2)> y <- 0.5 + 2 * x + e> summary(y) Min. 1st Qu. Median Mean 3rd Qu. Max. -6.4080 -1.5400 0.6789 0.6893 2.9300 6.5050 > plot(x, y) ## Plot (x, y) in a graph
Now suppose x is binary, either 0 or 1 with 50/50 probability in each trial, then similarly
> set.seed(20)> x <- rbinom(100, 1, 0.5)> e <- rnorm(100, 0, 2)> y <- 0.5 + 2 * x + e> summary(y) Min. 1st Qu. Median Mean 3rd Qu. Max. -3.4360 -0.2675 1.7800 1.5720 2.8810 6.9170 > plot(x, y)
Now try simulating from a generalized linear model, e.g. Poisson model.
Suppose we have a Poisson model where
and assuming
> set.seed(20)> x <- rnorm(100)> log_lambda <- 0.5 + 0.3 * x> y <- rpois(100, exp(log_lambda))> summary(y) Min. 1st Qu. Median Mean 3rd Qu. Max. 0.00 0.00 2.00 1.69 3.00 5.00 > plot(x, y)
Note: Almost all of the arguments in functions associated with probability distribution can be vectors EXCEPT one which represents the number of variates to be generated in r* series functions.
E.g. rnorm(5, 1:5, 1:5) produces 5 variates which follow N(1, 1), N(2, 2), …, N(5, 5) respectively.
But rnorm(1:5, 5, 1:5) does not produce the expected result. Mapply(rnorm, 1:5, 5, 1:5) can achieve the desired goal.
Other than normal, Poisson, binomial and uniform distribution, there are a lot more built-in standard distributions, like Bernoull, geometric, hypergeometric, negative binomial, Beta, Gamma, chi-squared, exponential, log-normal, multinomial, Cauchy, F (ratio of two chi-squared), etc.
Random Sampling
The sample() function can be used to draw random samples from arbitrary vectors.
Argument list: sample(x, size, replace = FALSE, prob = NULL)
- x: an arbitrary vector or a single positive integer (treated as 1:x)
- size: number of samples to be drawn, which is the length of x by default
- replace: indicates whether to sample with replacement
- prob: a vector of weights implying the probabilities each element in the vector is to be sampled, no need to sum to one
If only the argument x is assigned, then by default, size is equal to the length of x and replace is set to FALSE, so the sampling result is equivalent to a permutation of the original vector x.
> set.seed(20)> sample(1:20, 5)[1] 18 15 6 9 16> sample(20, 5) ## x = 20 equivalent to x = 1:20[1] 20 2 19 6 17> sample(letters, 5) ## Any vector can be sampled from[1] "s" "z" "a" "r" "e"> sample(10) ## A permutation of 1:10[1] 5 3 1 9 10 7 8 2 4 6> sample(10, replace = TRUE) ## Sample with replacement[1] 1 10 10 1 7 4 5 9 2 6> sample(10, 5, replace = TRUE, prob = c(rep(1, 9), 100))[1] 10 10 10 10 10
Base Graphics
All data in the following sections are available here.
Basic Histogram
Have a look at our data about birthweight of male and female unicorns.
> unicorns <- read.table("unicorns.txt", header = T)> head(unicorns) birthweight sex longevity1 4.478424 Male 12 5.753458 Male 03 3.277265 Male 04 3.929379 Male 05 3.972810 Male 06 4.912954 Male 0
We use hist() function to plot a histogram, whose arguments are listed as follows.
> hist(unicorns$birthweight, # Values of x-axis breaks = 40, # Number of cells xlab = "Birth Weight", # X-axis label main = "Histogram of Unicorn Birth Weight", # Plot title ylim = c(0,80)) # Limits of the y axis (min, max)
By default, breaks will be computed according to a particular formula, xlab = vectorName, ylab = “Frequency”, main = “Histogram of” + vectorName.
The argument breaks can also be a vector indicating the breakpoints between the histogram cells.
> hist(unicorns$birthweight, breaks = c(0, 1, 4, 5, 7))
Basic Line Graph with Regression
We have data on the population of moomins, a common pest species in Finland, on the island of Ruissalo from 1971 to 2001.
> moomins <- read.table("Moomin Density.txt", header = T)> head(moomins) Year PopSize1 1971 5002 1972 5623 1973 5444 1974 5325 1975 5806 1976 590
> plot(moomins$Year, moomins$PopSize, # X variable and y variable type = "l", # Draw a line graph col = "red", # Line colour lwd = 3, # Line width xlab = "Year", # X-axis label ylab = "Population Size", # Y-axis label main = "Moomin Population Size on Ruissalo 1971 - 2001") # Plot title
Type indicates which type of plot should be drawn. Some possible options are “p” for points, “l” for lines, “b” for lines covered by points, “o” for lines not covered by points.
Lty (line type) argument controls the type of lines, e.g. “solid“, “dashed”, “dotted”, etc.
Col argument specifies the line color. Numbers can be colors too!
To show the population trend over these years, we can add a basic linear regression to the plot using abline().
> fit <- lm(PopSize ~ Year, data = moomins)> abline(fit, lty = "dashed")> text(x = 1978, y = 750, labels = "R2 = 0.896\nP = 2.615e-15")
Abline() function also supports the arguments of lty, lwd, col, etc.
Text() function adds texts to the plot at the specified (x, y) coordinates.
Scatterplot with Legend
In this section, we make use of a built-in dataset named iris.
> data("iris")> head(iris) Sepal.Length Sepal.Width Petal.Length Petal.Width Species1 5.1 3.5 1.4 0.2 setosa2 4.9 3.0 1.4 0.2 setosa3 4.7 3.2 1.3 0.2 setosa4 4.6 3.1 1.5 0.2 setosa5 5.0 3.6 1.4 0.2 setosa6 5.4 3.9 1.7 0.4 setosa
Use pairs() function to see plots on every two variables. We specify different colors of points according to their species.
> pairs(iris, col = iris$Species)
Now we come to produce a scatterplot on Sepal.Length and Petal.Length.
> plot(iris$Sepal.Length, iris$Petal.Length, # X variable, y variable col = iris$Species, # Color by species pch = 16, # Type of points cex = 1, # Size of points xlab = "Sepal Length", # X-axis label ylab = "Petal Length", # Y-axis label main = "Flower Characteristics in Iris") # Plot title
Pch argument specifies a particular type of point, usually by a number index or a character (e.g. “A”).
The list of indexed point types follows.
To tell these points apart, we’d better add a legend to the plot.
> legend(x = 4.5, y = 7, legend = levels(iris$Species), col = 1:3, pch = 16)
Note 1: It is possible to specify colors to the data frame.
Note 2: It is possible to specify lines in the legend by using lty instead of pch.
> iris$Colour <- ""> iris$Colour[iris$Species == "setosa"] <- "magenta"> iris$Colour[iris$Species == "versicolor"] <- "cyan"> iris$Colour[iris$Species == "virginica"] <- "yellow"> plot(iris$Sepal.Length, iris$Petal.Length, col = iris$Colour, pch = 16, cex = 1)> legend(x = 4.5, y = 7, legend = levels(iris$Species), col = c("magenta", "cyan", "yellow"), lty = "solid")
Boxplot with Reordered and Formatted Axes
Let us continue with the iris dataset.
> boxplot(iris$Sepal.Length ~ iris$Species, # x variable, y variable notch = T, # Draw notch las = 1, # Orientate the axis tick labels xlab = "Species", # X-axis label ylab = "Sepal Length", # Y-axis label main = "Sepal Length by Species in Iris", # Plot title cex.lab = 1.5, # Size of axis labels cex.axis = 1.5, # Size of the tick mark labels cex.main = 2) # Size of the plot title
In a boxplot, the bottom and top of the box are always the first and third quartiles, and the band inside is the median. But the ends of whiskers can represent some alternative values, among which the minimum and the maximum are the most common.
Notch argument allows you to compare the medians of the boxplot.
Las argument allows you to rotate the labels of axis.
Cex.lab, cex.axis and cex.main allows you to specify the size of labels, axes and the title respectively.
To reorder the species list in the label, we can reorder it by specifying a particular order like the following command.
iris$Species <- factor(iris$Species, levels = c("virginica", "versicolor", "setosa"))
Barplot with Error Bars :(
Drawing barplots with base graphics is not that comfortable.
> dragons <- data.frame( TalonLength = c(20.9, 58.3, 35.5), SE = c(4.5, 6.3, 5.5), Population = c("England", "Scotland", "Wales"))> dragons TalonLength SE Population1 20.9 4.5 England2 58.3 6.3 Scotland3 35.5 5.5 Wales> barplot(dragons$TalonLength, names = dragons$Population)
But it is really COMPLICATED to draw a barplot with error bars!
Limitations of Base Graphics
You could feasibly do anything you require in base graphics, but some common actions are not straightforward, such as
- Legends
- Dodged plots
- Faceting
- Error bars
- Formatting axes and plot area
Base graphics are best for quick, simple and exploratory graphics, but they are really time-consuming when it comes to complex graphs.
It is HIGHLY RECOMMENDED to use ggplot2 for statistical graphics.
Advantages of ggplot2:
- Consistent underlying grammar of graphics (Wilkinson, 2005)
- Plot specification at a high level of abstraction
- Very flexible
- Theme system for polishing plot appearance
- Mature and complete graphics system
- Many users, active mailing list
A blog on introduction to ggplot2 will come out this summer.
Programming Assignment B: Hospital Quality
- R Programming: Part 4 - Simulation & Base Graphics
- R learning -Base Graphics
- R Programming: Part 2 - Programming with R
- learning R with swirl-Base Graphics
- R Programming: Part 1 - Nuts and Bolts
- R Programming Note 4
- R Programming: Part 3 - Code Correctly and Efficiently
- R programming week #4 assigment
- Microsoft(R) XNA(TM) Unleashed: Graphics and Game Programming for Xbox 360 and Windows
- Practical WPF Graphics Programming
- UE4 Graphics Programming
- ARM Assembly Language Programming (part 4)
- A trip through the Graphics Pipeline 2011, part 4
- [R] Expression in R graphics
- R base常用函数
- learning R with swirl-simulation(模拟)
- EXPLORING IPHONE GRAPHICS PART 1
- EXPLORING IPHONE GRAPHICS PART 2
- mysql 之with rollup 的使用
- 【钻石展位】无线链接要求规则解读
- MySQL中TIMESTAMPDIFF和TIMESTAMPADD函数的用法
- 写出高效清晰Layout布局文件的一些技巧
- 到底怎么样才叫看书?
- R Programming: Part 4 - Simulation & Base Graphics
- java书籍介绍
- Kinect SDK V1.7 开发工具包概览
- Android中数据解析的实现
- 基数排序(桶排序)
- leetcode题解-349.Intersection of Two Arrays && 350. Intersection of Two Arrays II
- 使用libpng和GDI读取显示png图片
- win7下怎么配置ODBC数据源
- 如何做系列(1)- mybatis 如何实现分页?